From conversation to Execution: Lessons learned evaluating 6+ LLMs for Gina

Context

Every day, there’s a shiny new large language model (LLM) claiming to outperform the rest. For users, choosing the right model for a specific use case can feel like flying blind in a storm of hype around the latest new thing. The state of things is such that even the model drop down list in the base ChatGPT app now has 7 different models to choose from.

The stakes are even higher when you move beyond mere text generation and want your AI to perform useful tasks - especially on the blockchain. At Ask Gina, we’re building a new kind of wallet interface that has a personal AI agent (Gina) built in. Our users can simply ask, “What’s hot today?” or “Swap 10% of my USDC into AERO on Base.” and in seconds, Gina fetches the right information and handles all aspects of the on-chain transaction.

But going from an LLM to a crypto-savvy agent embedded in a wallet (with the ability to call web APIs, fetch market data, and execute transactions) isn’t trivial. Through the process of building Gina, we tested multiple models, integrated them with various crypto tools, measured performance, and documented our findings. Below, we’ll share some findings from real-world data - and a few key takeaways we’ve learned about picking a model that can do more than just chat.

Tool-calling and why it matters

First a bit of context on “tool-calling”. When large language models are restricted to just generating text, they’re limited in how “agentic” they can be. For an agent to perform actions - like pulling real-time data, booking tickets, or facilitating payments - it needs to leverage tool calling. Think of tools as extensions of the LLM’s capabilities, allowing it to interface with APIs or blockchain smart contracts

A great metaphor we’ve heard in the wild is to think of tools like giving an extra pair of “hands” for an agent to do more things. The more hands we have the more tasks we can handle with them.

Not all models have native support for tool calling (for ex: the Deepseek API doesn’t yet), and even among those that do, performance can vary wildly. This is why it’s crucial to test both:

(1) how well the model recognizes the right tool to call for the job at hand and

(2) the quality and accuracy of the output or transaction it produces.

Our Setup: Gina in Action

We ran trials for each model on Gina and their performance on tasks that utilize tool-calling like:

Fetching market data (using a market data API call),
Generating charts (using a charting API call), and
Executing on-chain swaps (using RPCs to to execute transactions calldata).

Data Collection

We built a testing suite that records:

The task (e.g., “What’s hot?”, “Chart me PEPE”, “Swap 1 USDC into AERO”),
The model used (e.g., GPT 4o, Claude 3.5 Sonnet, Gemini 2.0 Flash Beta, o3-mini),
The time taken to produce a final output
And a rating of accuracy/relevance. This rating is an interpretation of sometimes subjective results for a given task but it is a preliminary indication of where we’re at currently.

Below are the results from an average of multiple trials we ran this past week:

Observations & Takeaways

Our testing revealed insights across multiple categories—speed, accuracy, transaction success, tool compatibility, and user experience.

Below, are some takeaways we had from a subjective lens:

Speed vs. Completeness
- Gemini 2.0 Flash often provides the fastest “time to first token (letter or word shown)” and renders visual results quickly, a clear advantage for impatient users. However, due to minimal text exposition, there’s sometimes less clarity about how it chose to solve a task.
- Claude 3.5 Sonnet can be slower overall but is also more comprehensive - it often makes multiple tool calls, fetches additional context, and tends toward robust explanations. This is an example of tool ordering and handover done well, but at the cost of added latency.
Accuracy & Relevance
- Wrapped tokens remain a recurring confusion for certain models, highlighting how tool optimization with correct prompt scaffolding and correct token metadata selection matter. If a model doesn’t hand over the right query to fetch market data for (e.g., mixing up WETH vs. ETH), it can produce inaccurate results for metrics like TVL, volume etc.
- GPT 4o tends to be especially thorough in checking user balances before execution—boosting its average success rate on transaction execution in general. This thoroughness reflects better tool ordering (e.g., first check portfolio, then proceed) and is part of why it excels at more complicated transactions.
Transaction Success Consistency
- For simple swaps, nearly all models that properly invoked our "execute transaction" tool succeeded—except Gemini 2.0 Flash, which stumbled when multiple tool calls were needed. This suggests potential gaps in tool handover (switching from quote fetching to swap execution).

What does this mean in practice

If you’re building an AI agent that needs to do more than chat—like interacting with APIs, pulling real-time data, or executing blockchain transactions—choosing the right model can make or break your product. Our experience building Gina to satusfy crypto-specific tasks highlights a few key lessons:

It’s not always about the “best” LLM in isolation—it’s about the one that plays nicest with your tool stack. Different models for different tools.
Speed, reliability, and the ability to handle multi-step tasks are just as critical as raw language quality.
Real-world tests (like bridging assets or fetching legitimate market data) expose nuance that simple text benchmarks often miss.

A practical challenge to consider is simply the cost of running comprehensive model evaluations in production. Running a thorough test suite across multiple models for each code deployment can quickly become expensive, especially when testing complex multi-step operations. This is still an ongoing field of discovery of what is best practice across the AI field in general.

We hope our data and learnings help you make more informed decisions when integrating LLMs into your workflows. As new models emerge, we’ll keep testing and sharing updates. Feel free to shoot us a message - can DM @askgina.eth or one of the Gina squad: @sidshekhar, @ericjuta on Farcaster or X).

And of course, as a friendly reminder, Gina is now live for early beta - sign up at askgina.ai for early trial access.

Context

Tool-calling and why it matters

A great metaphor we’ve heard in the wild is to think of tools like giving an extra pair of “hands” for an agent to do more things. The more hands we have the more tasks we can handle with them.

Not all models have native support for tool calling (for ex: the Deepseek API doesn’t yet), and even among those that do, performance can vary wildly. This is why it’s crucial to test both:

(1) how well the model recognizes the right tool to call for the job at hand and

(2) the quality and accuracy of the output or transaction it produces.

Our Setup: Gina in Action

We ran trials for each model on Gina and their performance on tasks that utilize tool-calling like:

Fetching market data (using a market data API call),
Generating charts (using a charting API call), and
Executing on-chain swaps (using RPCs to to execute transactions calldata).

Data Collection

We built a testing suite that records:

The task (e.g., “What’s hot?”, “Chart me PEPE”, “Swap 1 USDC into AERO”),
The model used (e.g., GPT 4o, Claude 3.5 Sonnet, Gemini 2.0 Flash Beta, o3-mini),
The time taken to produce a final output
And a rating of accuracy/relevance. This rating is an interpretation of sometimes subjective results for a given task but it is a preliminary indication of where we’re at currently.

Below are the results from an average of multiple trials we ran this past week:

Observations & Takeaways

Our testing revealed insights across multiple categories—speed, accuracy, transaction success, tool compatibility, and user experience.

Below, are some takeaways we had from a subjective lens:

Speed vs. Completeness
- Gemini 2.0 Flash often provides the fastest “time to first token (letter or word shown)” and renders visual results quickly, a clear advantage for impatient users. However, due to minimal text exposition, there’s sometimes less clarity about how it chose to solve a task.
- Claude 3.5 Sonnet can be slower overall but is also more comprehensive - it often makes multiple tool calls, fetches additional context, and tends toward robust explanations. This is an example of tool ordering and handover done well, but at the cost of added latency.
Accuracy & Relevance
- Wrapped tokens remain a recurring confusion for certain models, highlighting how tool optimization with correct prompt scaffolding and correct token metadata selection matter. If a model doesn’t hand over the right query to fetch market data for (e.g., mixing up WETH vs. ETH), it can produce inaccurate results for metrics like TVL, volume etc.
- GPT 4o tends to be especially thorough in checking user balances before execution—boosting its average success rate on transaction execution in general. This thoroughness reflects better tool ordering (e.g., first check portfolio, then proceed) and is part of why it excels at more complicated transactions.
Transaction Success Consistency
- For simple swaps, nearly all models that properly invoked our "execute transaction" tool succeeded—except Gemini 2.0 Flash, which stumbled when multiple tool calls were needed. This suggests potential gaps in tool handover (switching from quote fetching to swap execution).

What does this mean in practice

It’s not always about the “best” LLM in isolation—it’s about the one that plays nicest with your tool stack. Different models for different tools.
Speed, reliability, and the ability to handle multi-step tasks are just as critical as raw language quality.
Real-world tests (like bridging assets or fetching legitimate market data) expose nuance that simple text benchmarks often miss.

And of course, as a friendly reminder, Gina is now live for early beta - sign up at askgina.ai for early trial access.

Sid

Commented 6 months ago

Using AI today involves making a choice between a plethora of different LLM models available to trial. Through the course of building Gina, we evaluated 6+ models on complex tasks like token swaps, bridging, and fetching market updates. Wrote up some findings on what we learned about speed, accuracy, and which models handle multi-step actions well in our latest post. Your weekend read if interested in practical insights and real-world data on model performance :) https://paragraph.xyz/@askgina/from-conversation-to-execution-lessons-learned-evaluating-6-llms-for-gina

Srivatsan sankaran

Commented 6 months ago

So fascinating 🔥🔥

Kylie Chester | F4F ❤️

Commented 6 months ago

Yo! Check dm

From conversation to Execution: Lessons learned evaluating 6+ LLMs for Gina

Context

Tool-calling and why it matters

Our Setup: Gina in Action

Observations & Takeaways

What does this mean in practice

From conversation to Execution: Lessons learned evaluating 6+ LLMs for Gina

AskGina

From conversation to Execution: Lessons learned evaluating 6+ LLMs for Gina

From conversation to Execution: Lessons learned evaluating 6+ LLMs for Gina

Context

Tool-calling and why it matters

Our Setup: Gina in Action

Observations & Takeaways

What does this mean in practice

From conversation to Execution: Lessons learned evaluating 6+ LLMs for Gina

AskGina

From conversation to Execution: Lessons learned evaluating 6+ LLMs for Gina