LLM Benchmark for Move Smart Contract Type Inhabitation Evaluation

Subscribe to Defi, Data, Degen

<100 subscribers

Subscribe to Defi, Data, Degen

<100 subscribers

Executive Summary

This report evaluates Google Gemini 3 Flash, OpenAI GPT 5.1/5.2, Anthropic Claude Opus 4.5, and Anthropic Claude Sonnet 4.5 using the Top 25 Type Inhabitation Dataset. The primary objective was to assess the capability of Large Language Models (LLMs) to autonomously generate valid Programmable Transaction Blocks (PTBs) for Move smart contracts on the Sui mainnet.

The most significant finding of this study is the emergence of a "Fog of War" differentiator.

The data reveals that the deciding factor between model tiers was not syntax proficiency, but epistemic management—the ability to act methodically in the face of incomplete information.

Tier 1 (Claude Opus 4.5, GPT 5.1, GPT 5.2) models treated missing information as a search problem. They actively utilized progressive disclosure to build a complete mental map of the package dependencies before attempting construction. Consequently, these models achieved a 100% success rate.
Tier 2 (Claude Sonnet 4.5, Gemini 3 Flash/Pro) models treated missing information as a barrier. When the full context was not immediately visible, these models reacted with hallucination (guessing) or abandonment (stopping early), resulting in success rates below 50%.

Binary Reliability Cliff Crucially, the performance data exhibits a radical step-function rather than a gradient. There were no models in the "70-90%" range. This suggests that Move Smart Contract Inhabitation is a threshold capability: once a model possesses sufficient epistemic resilience to manage the "Fog of War," it solves the problem completely. Below that threshold, reliability collapses to unusable levels.

Benchmark Context: The Inhabitation Task

Unlike benchmarks that provide complete source code context, the Type Inhabitation benchmark evaluates a model's ability to operate in a low-information environment. The model is placed in a scenario similar to a developer interacting with a closed-source or unverified package: it is given only a package address and a goal.

To succeed, the model must autonomously discover interfaces, resolve type dependencies, and construct a valid Programmable Transaction Block (PTB). This design creates a "dynamic context retrieval" challenge, testing the model's reasoning and search capabilities rather than its ability to recall memorized patterns from open-source repositories.

Technical Walkthrough: The `Escrow` Example

To illustrate the technical complexity, consider a specific test case from the dataset. The model is initialized with the following prompt:

Goal: Inhabit the type 0x0f8343240d42fbefdb499f2c316f939aa168bf7b29ab63c31bf6c4bc0ba97fe0::escrow::Escrow.

The model has no prior knowledge of this package. The evaluation proceeds in specific technical phases:

Interface Discovery: The model's first action is to query the interface for the target module. It executes get_module_interface(package="0x0f83...fe0", module="escrow"). The system returns a JSON representation of the bytecode interface, revealing the constructor function signature:
```
public fun create(
    arg0: Coin<0x2::sui::SUI>,
    arg1: 0x0f8343240d42fbefdb499f2c316f939aa168bf7b29ab63c31bf6c4bc0ba97fe0::lock::Lock
): Escrow
```
Recursive Dependency Resolution: The model identifies that while Coin<SUI> is a standard primitive, the second argument requires a custom type: lock::Lock. This type resides in a separate module within the same package. The model must recognize this dependency and trigger a secondary query: get_module_interface(..., module="lock").
Dependency Graph Construction: Upon analyzing the lock module, the model locates the constructor for the Lock struct. It then synthesizes a Directed Acyclic Graph (DAG) of operations:
- Node 1: SplitCoins (Move primitive) → Generates Coin<SUI>.

This workflow demonstrates that "inhabitation" is not merely about writing code; it is about autonomously traversing a graph of type constraints and determining the correct sequence of on-chain operations to satisfy them.

Technical Methodology

The benchmark environment is designed to ensure reproducibility and fair comparison across different models. The complete evaluation harness, including the Rust-based extractor and Docker configurations, is available in the project repository.

Experimental Constraints

To rigorously test efficiency and decision-making, each evaluation is subject to strict constraints:

Single Attempt: The model is given only one opportunity to inhabit a specific package type. There are no "retries" if the final transaction fails.
Iteration Limit: The model is restricted to a maximum of 10 turn-based cycles (thought/action loops). Within this budget, it must discover all necessary interfaces, resolve dependencies, and generate the final transaction. If it fails to produce a valid result within 10 turns, the run is marked as a failure.

Bytecode-Derived Interface Analysis

Source code is purposely withheld to simulate a rigorous "black box" environment. The system extracts interfaces directly from on-chain bytecode, presenting the model with raw struct definitions and function signatures. This tests the model's fundamental understanding of Move's type system and linear logic.

Isolated Execution Environment

Evaluations are conducted within ephemeral Docker containers. This approach ensures:

State Consistency: The environment is reset after every run, preventing models from retaining information between attempts.
Uniform Tooling: All models interact with the chain using the same set of Rust extractors and Sui CLI binaries.
Realistic Network Conditions: Interaction occurs against the live Sui Mainnet, subjecting models to real-world latency and data structures.

Progressive Disclosure

A "Progressive Disclosure" strategy is employed where models must explicitly request module interfaces. This allows for observation of the model's search efficiency. It enables tracking whether a model navigates the dependency tree logically—requesting only what is necessary—or if it struggles to form a coherent mental map of the package structure.

Results

Model	Success Rate	Avg Latency
Claude Opus 4.5	100%	24.3s
OpenAI GPT 5.2	100%	60.0s
OpenAI GPT 5.1	100%	51.7s
Claude Sonnet 4.5	48%	40.1s
Gemini 3 Flash	44%	12.1s
Gemini 3 Pro	8%	45.8s

Performance Tiers

The results indicate a binary distribution in capabilities.

Tier 1 (High Reliability): The GPT-5 series and Claude Opus 4.5 demonstrated robust reasoning capabilities, effectively navigating the "unknown" environment without hallucinations or logical errors.
Tier 2 (Emerging Capability): Other models showed promise on simpler tasks but faced significant difficulties when dependency chains extended beyond a single layer.

Analysis of Behavioral Patterns

To better understand the divergence in performance, model behaviors are analyzed across two dimensions: by Model Family and by Performance Tier.

Comparative Analysis by Model Family

OpenAI (GPT-5 Series)

Search Strategy: Methodical and exhaustive. These models systematically request every potential dependency before attempting construction, favoring complete information over speed.
Error Mode: Over-caution. Occasionally, they refuse to proceed if a specific permission object is missing, marking the task as "uninhabitable" rather than attempting a partial solution.
Latency Profile: High (50-60s), reflecting the exhaustive verification process.

Anthropic (Claude Series)

Search Strategy: Precision-targeted. These models request only the immediate dependencies. Opus resolves deeper layers effectively, while Sonnet often stops early if the path is not obvious.
Error Mode: Under-exploration (Sonnet). Sonnet tends to abandon complex dependency trees early. Opus shows no significant error mode.
Latency Profile: Varied. Opus is highly efficient (~24s), while Sonnet is slower (~40s) due to less decisive planning.

Google (Gemini Series)

Search Strategy: Heuristic and speculative. These models frequently attempt to "guess" interfaces or parameters to save steps, leading to high speed but high hallucination rates.
Error Mode: Hallucination. They frequently invent function names or misinterprets parameter types (e.g., passing strings where objects are required).
Latency Profile: Bimodal. Flash is extremely fast (~12s) due to skipping verification, while Pro is slow (~45s) due to analysis loops.

Comparative Analysis by Performance Tier

Context Management

Tier 1 (Opus 4.5, GPT 5.1/5.2): Maintains a dynamic map of the package structure. These models can backtrack and revise plans without losing context as new modules are revealed.
Tier 2 (Sonnet 4.5, Gemini Flash/Pro): Struggles with fragmented views. These models often fail to integrate new information with previous findings, "forgetting" dependencies discovered in earlier steps.

Dependency Resolution

Tier 1: Employs a depth-first search approach. They successfully resolve nested dependencies (A requires B, B requires C) and treat the graph as a recursive logical puzzle.
Tier 2: Relies on shallow pattern matching. While effective at direct instantiations (A requires integer), they fail when dependencies are recursive or require cross-module lookups.

Self-Correction

Tier 1: Proactive. These models catch schema errors before submitting the transaction, verifying that arguments match expected types.
Tier 2: Reactive or absent. They often submit invalid transactions and rely on error messages to guess the fix, or get stuck in a loop repeating the same mistake.

Conclusion

The data indicates that Tier 1 models (Claude Opus 4.5, GPT 5.1, GPT 5.2) have reached a maturity level suitable for production-grade automation of Move smart contracts. Among these, Claude Opus 4.5 stands out for achieving this reliability with the lowest latency.

The central finding of this research is that the performance gap is not driven by coding ability, but by search strategy. Tier 1 models successfully leveraged progressive disclosure to illuminate the "Fog of War," constructing valid mental models from partial information. Tier 2 models, unable to sustain this state of uncertainty, resorted to guessing or abandonment. This suggests that for autonomous agents, epistemic resilience—the ability to act methodically in the face of incomplete information—is the critical capability gap to bridge.

Executive Summary

The most significant finding of this study is the emergence of a "Fog of War" differentiator.

The data reveals that the deciding factor between model tiers was not syntax proficiency, but epistemic management—the ability to act methodically in the face of incomplete information.

Tier 1 (Claude Opus 4.5, GPT 5.1, GPT 5.2) models treated missing information as a search problem. They actively utilized progressive disclosure to build a complete mental map of the package dependencies before attempting construction. Consequently, these models achieved a 100% success rate.
Tier 2 (Claude Sonnet 4.5, Gemini 3 Flash/Pro) models treated missing information as a barrier. When the full context was not immediately visible, these models reacted with hallucination (guessing) or abandonment (stopping early), resulting in success rates below 50%.

Benchmark Context: The Inhabitation Task

Technical Walkthrough: The `Escrow` Example

To illustrate the technical complexity, consider a specific test case from the dataset. The model is initialized with the following prompt:

Goal: Inhabit the type 0x0f8343240d42fbefdb499f2c316f939aa168bf7b29ab63c31bf6c4bc0ba97fe0::escrow::Escrow.

The model has no prior knowledge of this package. The evaluation proceeds in specific technical phases:

Interface Discovery: The model's first action is to query the interface for the target module. It executes get_module_interface(package="0x0f83...fe0", module="escrow"). The system returns a JSON representation of the bytecode interface, revealing the constructor function signature:
```
public fun create(
    arg0: Coin<0x2::sui::SUI>,
    arg1: 0x0f8343240d42fbefdb499f2c316f939aa168bf7b29ab63c31bf6c4bc0ba97fe0::lock::Lock
): Escrow
```
Recursive Dependency Resolution: The model identifies that while Coin<SUI> is a standard primitive, the second argument requires a custom type: lock::Lock. This type resides in a separate module within the same package. The model must recognize this dependency and trigger a secondary query: get_module_interface(..., module="lock").
Dependency Graph Construction: Upon analyzing the lock module, the model locates the constructor for the Lock struct. It then synthesizes a Directed Acyclic Graph (DAG) of operations:
- Node 1: SplitCoins (Move primitive) → Generates Coin<SUI>.

Technical Methodology

Experimental Constraints

To rigorously test efficiency and decision-making, each evaluation is subject to strict constraints:

Single Attempt: The model is given only one opportunity to inhabit a specific package type. There are no "retries" if the final transaction fails.
Iteration Limit: The model is restricted to a maximum of 10 turn-based cycles (thought/action loops). Within this budget, it must discover all necessary interfaces, resolve dependencies, and generate the final transaction. If it fails to produce a valid result within 10 turns, the run is marked as a failure.

Bytecode-Derived Interface Analysis

Isolated Execution Environment

Evaluations are conducted within ephemeral Docker containers. This approach ensures:

State Consistency: The environment is reset after every run, preventing models from retaining information between attempts.
Uniform Tooling: All models interact with the chain using the same set of Rust extractors and Sui CLI binaries.
Realistic Network Conditions: Interaction occurs against the live Sui Mainnet, subjecting models to real-world latency and data structures.

Progressive Disclosure

Results

Model	Success Rate	Avg Latency
Claude Opus 4.5	100%	24.3s
OpenAI GPT 5.2	100%	60.0s
OpenAI GPT 5.1	100%	51.7s
Claude Sonnet 4.5	48%	40.1s
Gemini 3 Flash	44%	12.1s
Gemini 3 Pro	8%	45.8s

Performance Tiers

The results indicate a binary distribution in capabilities.

Tier 1 (High Reliability): The GPT-5 series and Claude Opus 4.5 demonstrated robust reasoning capabilities, effectively navigating the "unknown" environment without hallucinations or logical errors.
Tier 2 (Emerging Capability): Other models showed promise on simpler tasks but faced significant difficulties when dependency chains extended beyond a single layer.

Analysis of Behavioral Patterns

To better understand the divergence in performance, model behaviors are analyzed across two dimensions: by Model Family and by Performance Tier.

Comparative Analysis by Model Family

OpenAI (GPT-5 Series)

Search Strategy: Methodical and exhaustive. These models systematically request every potential dependency before attempting construction, favoring complete information over speed.
Error Mode: Over-caution. Occasionally, they refuse to proceed if a specific permission object is missing, marking the task as "uninhabitable" rather than attempting a partial solution.
Latency Profile: High (50-60s), reflecting the exhaustive verification process.

Anthropic (Claude Series)

Search Strategy: Precision-targeted. These models request only the immediate dependencies. Opus resolves deeper layers effectively, while Sonnet often stops early if the path is not obvious.
Error Mode: Under-exploration (Sonnet). Sonnet tends to abandon complex dependency trees early. Opus shows no significant error mode.
Latency Profile: Varied. Opus is highly efficient (~24s), while Sonnet is slower (~40s) due to less decisive planning.

Google (Gemini Series)

Search Strategy: Heuristic and speculative. These models frequently attempt to "guess" interfaces or parameters to save steps, leading to high speed but high hallucination rates.
Error Mode: Hallucination. They frequently invent function names or misinterprets parameter types (e.g., passing strings where objects are required).
Latency Profile: Bimodal. Flash is extremely fast (~12s) due to skipping verification, while Pro is slow (~45s) due to analysis loops.

Comparative Analysis by Performance Tier

Context Management

Tier 1 (Opus 4.5, GPT 5.1/5.2): Maintains a dynamic map of the package structure. These models can backtrack and revise plans without losing context as new modules are revealed.
Tier 2 (Sonnet 4.5, Gemini Flash/Pro): Struggles with fragmented views. These models often fail to integrate new information with previous findings, "forgetting" dependencies discovered in earlier steps.

Dependency Resolution

Tier 1: Employs a depth-first search approach. They successfully resolve nested dependencies (A requires B, B requires C) and treat the graph as a recursive logical puzzle.
Tier 2: Relies on shallow pattern matching. While effective at direct instantiations (A requires integer), they fail when dependencies are recursive or require cross-module lookups.

Self-Correction

Tier 1: Proactive. These models catch schema errors before submitting the transaction, verifying that arguments match expected types.
Tier 2: Reactive or absent. They often submit invalid transactions and rely on error messages to guess the fix, or get stuck in a loop repeating the same mistake.

More from Defi, Data, Degen

More from Defi, Data, Degen

No activity yet

More from Defi, Data, Degen

Defi, Data, Degen

No activity yet

More from Defi, Data, Degen

LLM Benchmark for Move Smart Contract Type Inhabitation Evaluation

With popular Sui Blockchain packages

LLM Benchmark for Move Smart Contract Type Inhabitation Evaluation

With popular Sui Blockchain packages

Executive Summary

Benchmark Context: The Inhabitation Task

Technical Walkthrough: The Escrow Example

Technical Methodology

Experimental Constraints

Bytecode-Derived Interface Analysis

Isolated Execution Environment

Progressive Disclosure

Results

Performance Tiers

Analysis of Behavioral Patterns

Comparative Analysis by Model Family

Comparative Analysis by Performance Tier

Conclusion

Executive Summary

Benchmark Context: The Inhabitation Task

Technical Walkthrough: The Escrow Example

Technical Methodology

Experimental Constraints

Bytecode-Derived Interface Analysis

Isolated Execution Environment

Progressive Disclosure

Results

Performance Tiers

Analysis of Behavioral Patterns

Comparative Analysis by Model Family

Comparative Analysis by Performance Tier

Conclusion

No activity yet

No activity yet

Technical Walkthrough: The `Escrow` Example

Technical Walkthrough: The `Escrow` Example