Introducing the Crypto AI Benchmark Alliance

caiba@newsletter.paragraph.com (CAIBA) — Mon, 09 Jun 2025 21:26:44 GMT

AI is quickly becoming the go-to starting point for crypto users. Whether you're chasing the next viral memecoin, bridging assets, or checking if a contract is safe, chances are you've asked AI for help. But relying on AI without rigorous benchmarks is like navigating crypto blindfolded. One bad answer can lead to exploited protocols, misrouted funds, or drained wallets.

In industries where accuracy is mission-critical, like law and medicine, benchmarks are built to keep AI honest. They provide builders with clear standards and tools for improvement. With its high stakes transactions and rapid pace of innovation, crypto requires the same rigor.

CAIBA Founding Members

To address this critical need 14 leading projects — including Cyber, Alchemy, EigenLayer, Goldsky, IOSG, LazAI, Magic Newton, Metis, MyShell, OpenGradient, RootData, Sentient, Surf, and Thirdweb — to launch the Crypto AI Benchmark Alliance (CAIBA). CAIBA is an open, community-driven initiative to establish transparent, reliable benchmarks for crypto‑specific AI tasks and to help the entire industry raise the bar together.

Why Benchmarks Are Essential in Crypto

Across industries, the push for AI evaluation is gaining serious momentum. LMArena recently raised $100 million to build a dedicated benchmarking platform.

Sectors like law and healthcare have already recognized the need for rigorous testing. Legal professionals rely on benchmarks like Harvey’s BigLaw Bench to assess legal reasoning, while clinicians use Stanford’s MedHELM to evaluate AI performance on high stakes medical tasks. Similarly, platforms like Vals.ai have emerged to test LLMs against task-specific challenges in finance, healthcare, math, and academia.

The need for domain-specific evaluation is clear. A recent study by Vals.ai tested 22 top AI models on finance-specific tasks and found that even the best performers averaged below 50% accuracy. General-purpose models struggled with domain complexity — frequently hallucinating, misreading questions, or failing to use tools correctly.

With over $100 billion locked in DeFi (DefiLlama) and AI already being used to automate trading, governance, and onchain analysis, there’s no room for hallucinations or half-truths in crypto. If our industry is going to lean on AI, it needs benchmarks built for it. CAIBA is here to solve this problem.

What CAIBA Is and How It Works

CAIBA is an alliance that publishes industry-specific benchmarks, plus the tools and frameworks developers need to build more accurate crypto AI models and agents.

The effort is larger than testing alone. By bringing together protocols, data providers, researchers, and auditors, CAIBA promotes transparency and fairness while guarding against any single project skewing results.

Shunyu Yao’s influential essay, The Second Half of AI argues that “evaluation is the last unsolved piece of the intelligence puzzle.” CAIBA takes that view to heart by turning real crypto workflows into multistep challenges that test agents on three pillars of fluency:

Knowledge: Answering practical questions about protocols, tokens, and onchain data

Planning: Charting multi-step tasks

Action: Using wallets, explorers, and APIs safely and reliably

Models and agents receive a numerical score for each pillar, and those scores feed a live leaderboard that highlights which ones truly grasp crypto’s complexities. By enabling teams to collect data and run evaluations at scale, CAIBA helps builders pinpoint where their apps and models fall short, leading to improvements in the areas that matter most to users.

To ensure accountability, CAIBA publishes its grading systems and public datasets on open source platforms like GitHub and Hugging Face under permissive licenses, when allowed. Like GAIA and Vals.ai’s benchmarks, some question‑and‑answer sets are kept private to prevent over‑fitting and to protect confidentiality. When distribution is restricted, this data is overseen by a rotating council of protocols, auditors, and researchers.

CAIA: The First Benchmark for Crypto AI Agents

Launched alongside CAIBA, a benchmark for Crypto AI Agents (CAIA) is the alliance’s inaugural evaluation. CAIA builds on general-purpose benchmarks like GAIA and incorporates domain-specific adaptations to test whether AI agents can perform real, analyst-level tasks in crypto.

The benchmark evaluates agents across three core crypto workflows. Scoring well on CAIA indicates that an agent has the practical skills of a junior crypto analyst. High performing models are able to parse onchain data, explain tokenomics, and navigate projects with context and accuracy, much like a human would.

Workflows Evaluated and Representative Tasks

CAIA evaluates both foundational models (like GPT-4o, Claude 3.7, Gemini 2.5, DeepSeek-1) and crypto-native agents. Model scores are published on a public leaderboard, and those meeting a performance threshold receive a Crypto-Ready badge to signal of reliability for builders and users alike.

Roadmap

CAIBA will continue expanding its evaluation coverage with three additional benchmarks already planned for 2025:

Crypto Named Entity Recognition (CNER): Inspired by traditional Named Entity Recognition, this measures how well models identify protocols, tokens, wallets, and contracts to reduce false positives in crypto data.
Blockchain-Use Benchmark: Based on the Mind2Web framework, this evaluates how effectively agents follow natural-language instructions to complete tasks on live crypto websites and test real-world usability.
Crypto LM Arena: Modeled after crowdsourced evaluation platforms, this uses community voting to assess the usefulness and accuracy of AI responses and highlight the most effective models.

Together, these represent a foundation for holding crypto AI to a higher standard. CAIBA will grow into a complete platform where builders test and improve their agents, and users compare models with confidence. If crypto is to trust AI, standards must be built now because the tools of tomorrow depend on the work done today.

Help Shape the Standard

CAIBA is open to everyone:

Projects & researchers: Join the alliance, contribute datasets, or submit an agent.
Developers: Propose new tasks that track emerging primitives.
Everyday users: Share the questions you wish AI could answer.

Crypto keeps evolving, let’s make sure AI keeps up. Learn more or get involved at caiba.ai or by contacting @James_dai on Telegram.