<100 subscribers
<100 subscribers


This report evaluates Google Gemini 3 Flash, OpenAI GPT 5.1/5.2, Anthropic Claude Opus 4.5, and Anthropic Claude Sonnet 4.5 using the Top 25 Type Inhabitation Dataset. The primary objective was to assess the capability of Large Language Models (LLMs) to autonomously generate valid Programmable Transaction Blocks (PTBs) for Move smart contracts on the Sui mainnet.
The most significant finding of this study is the emergence of a "Fog of War" differentiator.
The data reveals that the deciding factor between model tiers was not syntax proficiency, but epistemic management—the ability to act methodically in the face of incomplete information.
Tier 1 (Claude Opus 4.5, GPT 5.1, GPT 5.2) models treated missing information as a search problem. They actively utilized progressive disclosure to build a complete mental map of the package dependencies before attempting construction. Consequently, these models achieved a 100% success rate.
Tier 2 (Claude Sonnet 4.5, Gemini 3 Flash/Pro) models treated missing information as a barrier. When the full context was not immediately visible, these models reacted with hallucination (guessing) or abandonment (stopping early), resulting in success rates below 50%.
Binary Reliability Cliff Crucially, the performance data exhibits a radical step-function rather than a gradient. There were no models in the "70-90%" range. This suggests that Move Smart Contract Inhabitation is a threshold capability: once a model possesses sufficient epistemic resilience to manage the "Fog of War," it solves the problem completely. Below that threshold, reliability collapses to unusable levels.
Unlike benchmarks that provide complete source code context, the Type Inhabitation benchmark evaluates a model's ability to operate in a low-information environment. The model is placed in a scenario similar to a developer interacting with a closed-source or unverified package: it is given only a package address and a goal.
To succeed, the model must autonomously discover interfaces, resolve type dependencies, and construct a valid Programmable Transaction Block (PTB). This design creates a "dynamic context retrieval" challenge, testing the model's reasoning and search capabilities rather than its ability to recall memorized patterns from open-source repositories.
To illustrate the technical complexity, consider a specific test case from the dataset. The model is initialized with the following prompt:
Goal: Inhabit the type
0x0f8343240d42fbefdb499f2c316f939aa168bf7b29ab63c31bf6c4bc0ba97fe0::escrow::Escrow.
The model has no prior knowledge of this package. The evaluation proceeds in specific technical phases:
Interface Discovery: The model's first action is to query the interface for the target module. It executes get_module_interface(package="0x0f83...fe0", module="escrow"). The system returns a JSON representation of the bytecode interface, revealing the constructor function signature:
public fun create(
arg0: Coin<0x2::sui::SUI>,
arg1: 0x0f8343240d42fbefdb499f2c316f939aa168bf7b29ab63c31bf6c4bc0ba97fe0::lock::Lock
): Escrow
Recursive Dependency Resolution: The model identifies that while Coin<SUI> is a standard primitive, the second argument requires a custom type: lock::Lock. This type resides in a separate module within the same package. The model must recognize this dependency and trigger a secondary query: get_module_interface(..., module="lock").
Dependency Graph Construction: Upon analyzing the lock module, the model locates the constructor for the Lock struct. It then synthesizes a Directed Acyclic Graph (DAG) of operations:
Node 1: SplitCoins (Move primitive) → Generates Coin<SUI>.
This workflow demonstrates that "inhabitation" is not merely about writing code; it is about autonomously traversing a graph of type constraints and determining the correct sequence of on-chain operations to satisfy them.
The benchmark environment is designed to ensure reproducibility and fair comparison across different models. The complete evaluation harness, including the Rust-based extractor and Docker configurations, is available in the project repository.
To rigorously test efficiency and decision-making, each evaluation is subject to strict constraints:
Single Attempt: The model is given only one opportunity to inhabit a specific package type. There are no "retries" if the final transaction fails.
Iteration Limit: The model is restricted to a maximum of 10 turn-based cycles (thought/action loops). Within this budget, it must discover all necessary interfaces, resolve dependencies, and generate the final transaction. If it fails to produce a valid result within 10 turns, the run is marked as a failure.
Source code is purposely withheld to simulate a rigorous "black box" environment. The system extracts interfaces directly from on-chain bytecode, presenting the model with raw struct definitions and function signatures. This tests the model's fundamental understanding of Move's type system and linear logic.
Evaluations are conducted within ephemeral Docker containers. This approach ensures:
State Consistency: The environment is reset after every run, preventing models from retaining information between attempts.
Uniform Tooling: All models interact with the chain using the same set of Rust extractors and Sui CLI binaries.
Realistic Network Conditions: Interaction occurs against the live Sui Mainnet, subjecting models to real-world latency and data structures.
A "Progressive Disclosure" strategy is employed where models must explicitly request module interfaces. This allows for observation of the model's search efficiency. It enables tracking whether a model navigates the dependency tree logically—requesting only what is necessary—or if it struggles to form a coherent mental map of the package structure.
Model | Success Rate | Avg Latency |
|---|---|---|
Claude Opus 4.5 | 100% | 24.3s |
OpenAI GPT 5.2 | 100% | 60.0s |
OpenAI GPT 5.1 | 100% | 51.7s |
Claude Sonnet 4.5 | 48% | 40.1s |
Gemini 3 Flash | 44% | 12.1s |
Gemini 3 Pro | 8% | 45.8s |
The results indicate a binary distribution in capabilities.
Tier 1 (High Reliability): The GPT-5 series and Claude Opus 4.5 demonstrated robust reasoning capabilities, effectively navigating the "unknown" environment without hallucinations or logical errors.
Tier 2 (Emerging Capability): Other models showed promise on simpler tasks but faced significant difficulties when dependency chains extended beyond a single layer.
To better understand the divergence in performance, model behaviors are analyzed across two dimensions: by Model Family and by Performance Tier.
OpenAI (GPT-5 Series)
Search Strategy: Methodical and exhaustive. These models systematically request every potential dependency before attempting construction, favoring complete information over speed.
Error Mode: Over-caution. Occasionally, they refuse to proceed if a specific permission object is missing, marking the task as "uninhabitable" rather than attempting a partial solution.
Latency Profile: High (50-60s), reflecting the exhaustive verification process.
Anthropic (Claude Series)
Search Strategy: Precision-targeted. These models request only the immediate dependencies. Opus resolves deeper layers effectively, while Sonnet often stops early if the path is not obvious.
Error Mode: Under-exploration (Sonnet). Sonnet tends to abandon complex dependency trees early. Opus shows no significant error mode.
Latency Profile: Varied. Opus is highly efficient (~24s), while Sonnet is slower (~40s) due to less decisive planning.
Google (Gemini Series)
Search Strategy: Heuristic and speculative. These models frequently attempt to "guess" interfaces or parameters to save steps, leading to high speed but high hallucination rates.
Error Mode: Hallucination. They frequently invent function names or misinterprets parameter types (e.g., passing strings where objects are required).
Latency Profile: Bimodal. Flash is extremely fast (~12s) due to skipping verification, while Pro is slow (~45s) due to analysis loops.
Context Management
Tier 1 (Opus 4.5, GPT 5.1/5.2): Maintains a dynamic map of the package structure. These models can backtrack and revise plans without losing context as new modules are revealed.
Tier 2 (Sonnet 4.5, Gemini Flash/Pro): Struggles with fragmented views. These models often fail to integrate new information with previous findings, "forgetting" dependencies discovered in earlier steps.
Dependency Resolution
Tier 1: Employs a depth-first search approach. They successfully resolve nested dependencies (A requires B, B requires C) and treat the graph as a recursive logical puzzle.
Tier 2: Relies on shallow pattern matching. While effective at direct instantiations (A requires integer), they fail when dependencies are recursive or require cross-module lookups.
Self-Correction
Tier 1: Proactive. These models catch schema errors before submitting the transaction, verifying that arguments match expected types.
Tier 2: Reactive or absent. They often submit invalid transactions and rely on error messages to guess the fix, or get stuck in a loop repeating the same mistake.
The data indicates that Tier 1 models (Claude Opus 4.5, GPT 5.1, GPT 5.2) have reached a maturity level suitable for production-grade automation of Move smart contracts. Among these, Claude Opus 4.5 stands out for achieving this reliability with the lowest latency.
The central finding of this research is that the performance gap is not driven by coding ability, but by search strategy. Tier 1 models successfully leveraged progressive disclosure to illuminate the "Fog of War," constructing valid mental models from partial information. Tier 2 models, unable to sustain this state of uncertainty, resorted to guessing or abandonment. This suggests that for autonomous agents, epistemic resilience—the ability to act methodically in the face of incomplete information—is the critical capability gap to bridge.
This report evaluates Google Gemini 3 Flash, OpenAI GPT 5.1/5.2, Anthropic Claude Opus 4.5, and Anthropic Claude Sonnet 4.5 using the Top 25 Type Inhabitation Dataset. The primary objective was to assess the capability of Large Language Models (LLMs) to autonomously generate valid Programmable Transaction Blocks (PTBs) for Move smart contracts on the Sui mainnet.
The most significant finding of this study is the emergence of a "Fog of War" differentiator.
The data reveals that the deciding factor between model tiers was not syntax proficiency, but epistemic management—the ability to act methodically in the face of incomplete information.
Tier 1 (Claude Opus 4.5, GPT 5.1, GPT 5.2) models treated missing information as a search problem. They actively utilized progressive disclosure to build a complete mental map of the package dependencies before attempting construction. Consequently, these models achieved a 100% success rate.
Tier 2 (Claude Sonnet 4.5, Gemini 3 Flash/Pro) models treated missing information as a barrier. When the full context was not immediately visible, these models reacted with hallucination (guessing) or abandonment (stopping early), resulting in success rates below 50%.
Binary Reliability Cliff Crucially, the performance data exhibits a radical step-function rather than a gradient. There were no models in the "70-90%" range. This suggests that Move Smart Contract Inhabitation is a threshold capability: once a model possesses sufficient epistemic resilience to manage the "Fog of War," it solves the problem completely. Below that threshold, reliability collapses to unusable levels.
Unlike benchmarks that provide complete source code context, the Type Inhabitation benchmark evaluates a model's ability to operate in a low-information environment. The model is placed in a scenario similar to a developer interacting with a closed-source or unverified package: it is given only a package address and a goal.
To succeed, the model must autonomously discover interfaces, resolve type dependencies, and construct a valid Programmable Transaction Block (PTB). This design creates a "dynamic context retrieval" challenge, testing the model's reasoning and search capabilities rather than its ability to recall memorized patterns from open-source repositories.
To illustrate the technical complexity, consider a specific test case from the dataset. The model is initialized with the following prompt:
Goal: Inhabit the type
0x0f8343240d42fbefdb499f2c316f939aa168bf7b29ab63c31bf6c4bc0ba97fe0::escrow::Escrow.
The model has no prior knowledge of this package. The evaluation proceeds in specific technical phases:
Interface Discovery: The model's first action is to query the interface for the target module. It executes get_module_interface(package="0x0f83...fe0", module="escrow"). The system returns a JSON representation of the bytecode interface, revealing the constructor function signature:
public fun create(
arg0: Coin<0x2::sui::SUI>,
arg1: 0x0f8343240d42fbefdb499f2c316f939aa168bf7b29ab63c31bf6c4bc0ba97fe0::lock::Lock
): Escrow
Recursive Dependency Resolution: The model identifies that while Coin<SUI> is a standard primitive, the second argument requires a custom type: lock::Lock. This type resides in a separate module within the same package. The model must recognize this dependency and trigger a secondary query: get_module_interface(..., module="lock").
Dependency Graph Construction: Upon analyzing the lock module, the model locates the constructor for the Lock struct. It then synthesizes a Directed Acyclic Graph (DAG) of operations:
Node 1: SplitCoins (Move primitive) → Generates Coin<SUI>.
This workflow demonstrates that "inhabitation" is not merely about writing code; it is about autonomously traversing a graph of type constraints and determining the correct sequence of on-chain operations to satisfy them.
The benchmark environment is designed to ensure reproducibility and fair comparison across different models. The complete evaluation harness, including the Rust-based extractor and Docker configurations, is available in the project repository.
To rigorously test efficiency and decision-making, each evaluation is subject to strict constraints:
Single Attempt: The model is given only one opportunity to inhabit a specific package type. There are no "retries" if the final transaction fails.
Iteration Limit: The model is restricted to a maximum of 10 turn-based cycles (thought/action loops). Within this budget, it must discover all necessary interfaces, resolve dependencies, and generate the final transaction. If it fails to produce a valid result within 10 turns, the run is marked as a failure.
Source code is purposely withheld to simulate a rigorous "black box" environment. The system extracts interfaces directly from on-chain bytecode, presenting the model with raw struct definitions and function signatures. This tests the model's fundamental understanding of Move's type system and linear logic.
Evaluations are conducted within ephemeral Docker containers. This approach ensures:
State Consistency: The environment is reset after every run, preventing models from retaining information between attempts.
Uniform Tooling: All models interact with the chain using the same set of Rust extractors and Sui CLI binaries.
Realistic Network Conditions: Interaction occurs against the live Sui Mainnet, subjecting models to real-world latency and data structures.
A "Progressive Disclosure" strategy is employed where models must explicitly request module interfaces. This allows for observation of the model's search efficiency. It enables tracking whether a model navigates the dependency tree logically—requesting only what is necessary—or if it struggles to form a coherent mental map of the package structure.
Model | Success Rate | Avg Latency |
|---|---|---|
Claude Opus 4.5 | 100% | 24.3s |
OpenAI GPT 5.2 | 100% | 60.0s |
OpenAI GPT 5.1 | 100% | 51.7s |
Claude Sonnet 4.5 | 48% | 40.1s |
Gemini 3 Flash | 44% | 12.1s |
Gemini 3 Pro | 8% | 45.8s |
The results indicate a binary distribution in capabilities.
Tier 1 (High Reliability): The GPT-5 series and Claude Opus 4.5 demonstrated robust reasoning capabilities, effectively navigating the "unknown" environment without hallucinations or logical errors.
Tier 2 (Emerging Capability): Other models showed promise on simpler tasks but faced significant difficulties when dependency chains extended beyond a single layer.
To better understand the divergence in performance, model behaviors are analyzed across two dimensions: by Model Family and by Performance Tier.
OpenAI (GPT-5 Series)
Search Strategy: Methodical and exhaustive. These models systematically request every potential dependency before attempting construction, favoring complete information over speed.
Error Mode: Over-caution. Occasionally, they refuse to proceed if a specific permission object is missing, marking the task as "uninhabitable" rather than attempting a partial solution.
Latency Profile: High (50-60s), reflecting the exhaustive verification process.
Anthropic (Claude Series)
Search Strategy: Precision-targeted. These models request only the immediate dependencies. Opus resolves deeper layers effectively, while Sonnet often stops early if the path is not obvious.
Error Mode: Under-exploration (Sonnet). Sonnet tends to abandon complex dependency trees early. Opus shows no significant error mode.
Latency Profile: Varied. Opus is highly efficient (~24s), while Sonnet is slower (~40s) due to less decisive planning.
Google (Gemini Series)
Search Strategy: Heuristic and speculative. These models frequently attempt to "guess" interfaces or parameters to save steps, leading to high speed but high hallucination rates.
Error Mode: Hallucination. They frequently invent function names or misinterprets parameter types (e.g., passing strings where objects are required).
Latency Profile: Bimodal. Flash is extremely fast (~12s) due to skipping verification, while Pro is slow (~45s) due to analysis loops.
Context Management
Tier 1 (Opus 4.5, GPT 5.1/5.2): Maintains a dynamic map of the package structure. These models can backtrack and revise plans without losing context as new modules are revealed.
Tier 2 (Sonnet 4.5, Gemini Flash/Pro): Struggles with fragmented views. These models often fail to integrate new information with previous findings, "forgetting" dependencies discovered in earlier steps.
Dependency Resolution
Tier 1: Employs a depth-first search approach. They successfully resolve nested dependencies (A requires B, B requires C) and treat the graph as a recursive logical puzzle.
Tier 2: Relies on shallow pattern matching. While effective at direct instantiations (A requires integer), they fail when dependencies are recursive or require cross-module lookups.
Self-Correction
Tier 1: Proactive. These models catch schema errors before submitting the transaction, verifying that arguments match expected types.
Tier 2: Reactive or absent. They often submit invalid transactions and rely on error messages to guess the fix, or get stuck in a loop repeating the same mistake.
The data indicates that Tier 1 models (Claude Opus 4.5, GPT 5.1, GPT 5.2) have reached a maturity level suitable for production-grade automation of Move smart contracts. Among these, Claude Opus 4.5 stands out for achieving this reliability with the lowest latency.
The central finding of this research is that the performance gap is not driven by coding ability, but by search strategy. Tier 1 models successfully leveraged progressive disclosure to illuminate the "Fog of War," constructing valid mental models from partial information. Tier 2 models, unable to sustain this state of uncertainty, resorted to guessing or abandonment. This suggests that for autonomous agents, epistemic resilience—the ability to act methodically in the face of incomplete information—is the critical capability gap to bridge.
lock::create()LockNode 3: Call escrow::create(Node 1, Node 2) → Generates Escrow.
PTB Generation & Verification: Finally, the model serializes this logic into a JSON-based PTB schema. The benchmark harness then dry-runs this transaction against the Sui Mainnet state. Success is defined strictly: the transaction must execute without VM errors (e.g., Effects.status.status == success).
lock::create()LockNode 3: Call escrow::create(Node 1, Node 2) → Generates Escrow.
PTB Generation & Verification: Finally, the model serializes this logic into a JSON-based PTB schema. The benchmark harness then dry-runs this transaction against the Sui Mainnet state. Success is defined strictly: the transaction must execute without VM errors (e.g., Effects.status.status == success).
Share Dialog
Share Dialog
No comments yet