# Introducing the Crypto AI Benchmark Alliance

By [CAIBA](https://paragraph.com/@caiba) · 2025-06-09

---

AI is quickly becoming the go-to starting point for crypto users. Whether you're chasing the next viral memecoin, bridging assets, or checking if a contract is safe, chances are you've asked AI for help. But relying on AI without rigorous benchmarks is like navigating crypto blindfolded. One bad answer can lead to exploited protocols, misrouted funds, or drained wallets.

In industries where accuracy is mission-critical, like law and medicine, benchmarks are built to keep AI honest. They provide builders with clear standards and tools for improvement. With its high stakes transactions and rapid pace of innovation, crypto requires the same rigor.

![CAIBA Founding Members](https://storage.googleapis.com/papyrus_images/cefc2f139a6913314be2f10b17caf2e455012dbbe8071645a04421bb33de520a.png)

CAIBA Founding Members

To address this critical need 14 leading projects — including [Cyber](https://x.com/BuildOnCyber), [Alchemy](https://x.com/Alchemy), [EigenLayer](https://x.com/eigenlayer), [Goldsky](https://x.com/goldskyio), [IOSG](https://x.com/IOSGVC), [LazAI](https://x.com/LazAINetwork), [Magic Newton](https://x.com/MagicNewton), [Metis](https://x.com/MetisL2), [MyShell](https://x.com/myshell_ai), [OpenGradient](https://x.com/OpenGradient), [RootData](https://x.com/RootDataCrypto), [Sentient](https://x.com/SentientAGI), [Surf](https://x.com/Surf_Copilot), and [Thirdweb](https://x.com/thirdweb) — to launch the Crypto AI Benchmark Alliance (CAIBA). CAIBA is an open, community-driven initiative to establish transparent, reliable benchmarks for crypto‑specific AI tasks and to help the entire industry raise the bar together.

**Why Benchmarks Are Essential in Crypto**
------------------------------------------

Across industries, the push for AI evaluation is gaining serious momentum. LMArena recently raised $100 million to build a dedicated benchmarking platform.

Sectors like law and healthcare have already recognized the need for rigorous testing. Legal professionals rely on benchmarks like [Harvey’s BigLaw Bench](https://www.harvey.ai/blog/introducing-biglaw-bench) to assess legal reasoning, while clinicians use Stanford’s [MedHELM](https://crfm.stanford.edu/helm/medhelm/latest/) to evaluate AI performance on high stakes medical tasks. Similarly, platforms like [Vals.ai](http://vals.ai/) have emerged to test LLMs against task-specific challenges in finance, healthcare, math, and academia.

The need for domain-specific evaluation is clear. A recent [study](https://www.vals.ai/benchmarks/finance_agent-04-22-2025) by [Vals.ai](http://vals.ai/) tested 22 top AI models on finance-specific tasks and found that even the best performers averaged below 50% accuracy. General-purpose models struggled with domain complexity — frequently hallucinating, misreading questions, or failing to use tools correctly.

With over $100 billion locked in DeFi ([DefiLlama](https://defillama.com/)) and AI already being used to automate trading, governance, and onchain analysis, there’s no room for hallucinations or half-truths in crypto. If our industry is going to lean on AI, it needs benchmarks built for it. CAIBA is here to solve this problem.

**What CAIBA Is and How It Works**
----------------------------------

CAIBA is an alliance that publishes industry-specific benchmarks, plus the tools and frameworks developers need to build more accurate crypto AI models and agents.

The effort is larger than testing alone. By bringing together protocols, data providers, researchers, and auditors, CAIBA promotes transparency and fairness while guarding against any single project skewing results.

Shunyu Yao’s influential essay, [_The Second Half of AI_](https://ysymyth.github.io/The-Second-Half/) argues that “evaluation is the last unsolved piece of the intelligence puzzle.” CAIBA takes that view to heart by turning real crypto workflows into multistep challenges that test agents on three pillars of fluency:

**Knowledge:** Answering practical questions about protocols, tokens, and onchain data

**Planning:** Charting multi-step tasks

**Action:** Using wallets, explorers, and APIs safely and reliably

Models and agents receive a numerical score for each pillar, and those scores feed a live leaderboard that highlights which ones truly grasp crypto’s complexities. By enabling teams to collect data and run evaluations at scale, CAIBA helps builders pinpoint where their apps and models fall short, leading to improvements in the areas that matter most to users.

To ensure accountability, CAIBA publishes its grading systems and public datasets on open source platforms like GitHub and Hugging Face under permissive licenses, when allowed. Like [GAIA](https://huggingface.co/gaia-benchmark) and [Vals.ai](http://vals.ai/)’s benchmarks, some question‑and‑answer sets are kept private to prevent over‑fitting and to protect confidentiality. When distribution is restricted, this data is overseen by a rotating council of protocols, auditors, and researchers.

**CAIA: The First Benchmark for Crypto AI Agents**
--------------------------------------------------

Launched alongside CAIBA, a benchmark for Crypto AI Agents (CAIA) is the alliance’s inaugural evaluation. CAIA builds on general-purpose benchmarks like GAIA and incorporates domain-specific adaptations to test whether AI agents can perform real, analyst-level tasks in crypto.

The benchmark evaluates agents across three core crypto workflows. Scoring well on CAIA indicates that an agent has the practical skills of a junior crypto analyst. High performing models are able to parse onchain data, explain tokenomics, and navigate projects with context and accuracy, much like a human would.

**Workflows Evaluated and Representative Tasks**

![](https://storage.googleapis.com/papyrus_images/f6ac10313a833a94877f234d5a2466c773379ef5ac31b36857c2ba4c3a2a0523.png)

CAIA evaluates both foundational models (like GPT-4o, Claude 3.7, Gemini 2.5, DeepSeek-1) and crypto-native agents. Model scores are published on a public leaderboard, and those meeting a performance threshold receive a Crypto-Ready badge to signal of reliability for builders and users alike.

**Roadmap**
-----------

CAIBA will continue expanding its evaluation coverage with three additional benchmarks already planned for 2025:

*   **Crypto Named Entity Recognition (CNER)**: Inspired by traditional Named Entity Recognition, this measures how well models identify protocols, tokens, wallets, and contracts _to reduce false positives in crypto data._
    
*   **Blockchain-Use Benchmark**: Based on the Mind2Web framework, this evaluates how effectively agents follow natural-language instructions _to complete tasks on live crypto websites and test real-world usability._
    
*   **Crypto LM Arena**: Modeled after crowdsourced evaluation platforms, this uses community voting _to assess the usefulness and accuracy of AI responses and highlight the most effective models._
    

Together, these represent a foundation for holding crypto AI to a higher standard. CAIBA will grow into a complete platform where builders test and improve their agents, and users compare models with confidence. If crypto is to trust AI, standards must be built now because the tools of tomorrow depend on the work done today.

**Help Shape the Standard**
---------------------------

CAIBA is open to everyone:

*   **Projects & researchers**: Join the alliance, contribute datasets, or submit an agent.
    
*   **Developers**: Propose new tasks that track emerging primitives.
    
*   **Everyday users**: Share the questions you wish AI could answer.
    

Crypto keeps evolving, let’s make sure AI keeps up. Learn more or get involved at [caiba.ai](https://caiba.ai/) or by contacting **@James\_dai** on Telegram.

---

*Originally published on [CAIBA](https://paragraph.com/@caiba/introducing-the-crypto-ai-benchmark-alliance)*
