<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/">
    <channel>
        <title>CAIBA</title>
        <link>https://paragraph.com/@caiba</link>
        <description>CAIBA is a community-governed initiative that sets standards for evaluating AI model performance in crypto-specific contexts.</description>
        <lastBuildDate>Thu, 16 Apr 2026 03:56:04 GMT</lastBuildDate>
        <docs>https://validator.w3.org/feed/docs/rss2.html</docs>
        <generator>https://github.com/jpmonette/feed</generator>
        <language>en</language>
        <image>
            <title>CAIBA</title>
            <url>https://storage.googleapis.com/papyrus_images/72e99b34e063037e5d89a093a3c7a900e2f7b208bf79a75d1596a35e8a33a36b.png</url>
            <link>https://paragraph.com/@caiba</link>
        </image>
        <copyright>All rights reserved</copyright>
        <item>
            <title><![CDATA[Introducing the Crypto AI Benchmark Alliance]]></title>
            <link>https://paragraph.com/@caiba/introducing-the-crypto-ai-benchmark-alliance</link>
            <guid>nyvdQEK79TFWq2Y44bGg</guid>
            <pubDate>Mon, 09 Jun 2025 21:26:44 GMT</pubDate>
            <description><![CDATA[AI is quickly becoming the go-to starting point for crypto users. Whether you&apos;re chasing the next viral memecoin, bridging assets, or checking if a contract is safe, chances are you&apos;ve asked AI for help. But relying on AI without rigorous benchmarks is like navigating crypto blindfolded. One bad answer can lead to exploited protocols, misrouted funds, or drained wallets. In industries where accuracy is mission-critical, like law and medicine, benchmarks are built to keep AI honest. ...]]></description>
            <content:encoded><![CDATA[<p>AI is quickly becoming the go-to starting point for crypto users. Whether you&apos;re chasing the next viral memecoin, bridging assets, or checking if a contract is safe, chances are you&apos;ve asked AI for help. But relying on AI without rigorous benchmarks is like navigating crypto blindfolded. One bad answer can lead to exploited protocols, misrouted funds, or drained wallets.</p><p>In industries where accuracy is mission-critical, like law and medicine, benchmarks are built to keep AI honest. They provide builders with clear standards and tools for improvement. With its high stakes transactions and rapid pace of innovation, crypto requires the same rigor.</p><figure float="none" data-type="figure" class="img-center" style="max-width: null;"><img src="https://storage.googleapis.com/papyrus_images/cefc2f139a6913314be2f10b17caf2e455012dbbe8071645a04421bb33de520a.png" alt="CAIBA Founding Members" blurdataurl="data:image/gif;base64,R0lGODlhAQABAIAAAP///wAAACwAAAAAAQABAAACAkQBADs=" nextheight="600" nextwidth="800" class="image-node embed"><figcaption HTMLAttributes="[object Object]" class="">CAIBA Founding Members</figcaption></figure><p>To address this critical need 14 leading projects — including <a target="_blank" rel="noopener noreferrer nofollow ugc" class="dont-break-out" href="https://x.com/BuildOnCyber">Cyber</a>, <a target="_blank" rel="noopener noreferrer nofollow ugc" class="dont-break-out" href="https://x.com/Alchemy">Alchemy</a>, <a target="_blank" rel="noopener noreferrer nofollow ugc" class="dont-break-out" href="https://x.com/eigenlayer">EigenLayer</a>, <a target="_blank" rel="noopener noreferrer nofollow ugc" class="dont-break-out" href="https://x.com/goldskyio">Goldsky</a>, <a target="_blank" rel="noopener noreferrer nofollow ugc" class="dont-break-out" href="https://x.com/IOSGVC">IOSG</a>, <a target="_blank" rel="noopener noreferrer nofollow ugc" class="dont-break-out" href="https://x.com/LazAINetwork">LazAI</a>, <a target="_blank" rel="noopener noreferrer nofollow ugc" class="dont-break-out" href="https://x.com/MagicNewton">Magic Newton</a>, <a target="_blank" rel="noopener noreferrer nofollow ugc" class="dont-break-out" href="https://x.com/MetisL2">Metis</a>, <a target="_blank" rel="noopener noreferrer nofollow ugc" class="dont-break-out" href="https://x.com/myshell_ai">MyShell</a>, <a target="_blank" rel="noopener noreferrer nofollow ugc" class="dont-break-out" href="https://x.com/OpenGradient">OpenGradient</a>, <a target="_blank" rel="noopener noreferrer nofollow ugc" class="dont-break-out" href="https://x.com/RootDataCrypto">RootData</a>, <a target="_blank" rel="noopener noreferrer nofollow ugc" class="dont-break-out" href="https://x.com/SentientAGI">Sentient</a>, <a target="_blank" rel="noopener noreferrer nofollow ugc" class="dont-break-out" href="https://x.com/Surf_Copilot">Surf</a>, and <a target="_blank" rel="noopener noreferrer nofollow ugc" class="dont-break-out" href="https://x.com/thirdweb">Thirdweb</a> — to launch the Crypto AI Benchmark Alliance (CAIBA). CAIBA is an open, community-driven initiative to establish transparent, reliable benchmarks for crypto‑specific AI tasks and to help the entire industry raise the bar together.</p><h2 id="h-why-benchmarks-are-essential-in-crypto" class="text-3xl font-header !mt-8 !mb-4 first:!mt-0 first:!mb-0"><strong>Why Benchmarks Are Essential in Crypto</strong></h2><p>Across industries, the push for AI evaluation is gaining serious momentum. LMArena recently raised $100 million to build a dedicated benchmarking platform.</p><p>Sectors like law and healthcare have already recognized the need for rigorous testing. Legal professionals rely on benchmarks like <a target="_blank" rel="noopener noreferrer nofollow ugc" class="dont-break-out" href="https://www.harvey.ai/blog/introducing-biglaw-bench">Harvey’s BigLaw Bench</a> to assess legal reasoning, while clinicians use Stanford’s <a target="_blank" rel="noopener noreferrer nofollow ugc" class="dont-break-out" href="https://crfm.stanford.edu/helm/medhelm/latest/">MedHELM</a> to evaluate AI performance on high stakes medical tasks. Similarly, platforms like <a target="_blank" rel="noopener noreferrer nofollow ugc" class="dont-break-out" href="http://vals.ai/">Vals.ai</a> have emerged to test LLMs against task-specific challenges in finance, healthcare, math, and academia.</p><p>The need for domain-specific evaluation is clear. A recent <a target="_blank" rel="noopener noreferrer nofollow ugc" class="dont-break-out" href="https://www.vals.ai/benchmarks/finance_agent-04-22-2025">study</a> by <a target="_blank" rel="noopener noreferrer nofollow ugc" class="dont-break-out" href="http://vals.ai/">Vals.ai</a> tested 22 top AI models on finance-specific tasks and found that even the best performers averaged below 50% accuracy. General-purpose models struggled with domain complexity — frequently hallucinating, misreading questions, or failing to use tools correctly.</p><p>With over $100 billion locked in DeFi (<a target="_blank" rel="noopener noreferrer nofollow ugc" class="dont-break-out" href="https://defillama.com/">DefiLlama</a>) and AI already being used to automate trading, governance, and onchain analysis, there’s no room for hallucinations or half-truths in crypto. If our industry is going to lean on AI, it needs benchmarks built for it. CAIBA is here to solve this problem.</p><h2 id="h-what-caiba-is-and-how-it-works" class="text-3xl font-header !mt-8 !mb-4 first:!mt-0 first:!mb-0"><strong>What CAIBA Is and How It Works</strong></h2><p>CAIBA is an alliance that publishes industry-specific benchmarks, plus the tools and frameworks developers need to build more accurate crypto AI models and agents.</p><p>The effort is larger than testing alone. By bringing together protocols, data providers, researchers, and auditors, CAIBA promotes transparency and fairness while guarding against any single project skewing results.</p><p>Shunyu Yao’s influential essay, <a target="_blank" rel="noopener noreferrer nofollow ugc" class="dont-break-out" href="https://ysymyth.github.io/The-Second-Half/"><em>The Second Half of AI</em></a> argues that “evaluation is the last unsolved piece of the intelligence puzzle.” CAIBA takes that view to heart by turning real crypto workflows into multistep challenges that test agents on three pillars of fluency:</p><p><strong>Knowledge:</strong> Answering practical questions about protocols, tokens, and onchain data</p><p><strong>Planning:</strong> Charting multi-step tasks</p><p><strong>Action:</strong> Using wallets, explorers, and APIs safely and reliably</p><p>Models and agents receive a numerical score for each pillar, and those scores feed a live leaderboard that highlights which ones truly grasp crypto’s complexities. By enabling teams to collect data and run evaluations at scale, CAIBA helps builders pinpoint where their apps and models fall short, leading to improvements in the areas that matter most to users.</p><p>To ensure accountability, CAIBA publishes its grading systems and public datasets on open source platforms like GitHub and Hugging Face under permissive licenses, when allowed. Like <a target="_blank" rel="noopener noreferrer nofollow ugc" class="dont-break-out" href="https://huggingface.co/gaia-benchmark">GAIA</a> and <a target="_blank" rel="noopener noreferrer nofollow ugc" class="dont-break-out" href="http://vals.ai/">Vals.ai</a>’s benchmarks, some question‑and‑answer sets are kept private to prevent over‑fitting and to protect confidentiality. When distribution is restricted, this data is overseen by a rotating council of protocols, auditors, and researchers.</p><h2 id="h-caia-the-first-benchmark-for-crypto-ai-agents" class="text-3xl font-header !mt-8 !mb-4 first:!mt-0 first:!mb-0"><strong>CAIA: The First Benchmark for Crypto AI Agents</strong></h2><p>Launched alongside CAIBA, a benchmark for Crypto AI Agents (CAIA) is the alliance’s inaugural evaluation. CAIA builds on general-purpose benchmarks like GAIA and incorporates domain-specific adaptations to test whether AI agents can perform real, analyst-level tasks in crypto.</p><p>The benchmark evaluates agents across three core crypto workflows. Scoring well on CAIA indicates that an agent has the practical skills of a junior crypto analyst. High performing models are able to parse onchain data, explain tokenomics, and navigate projects with context and accuracy, much like a human would.</p><p><strong>Workflows Evaluated and Representative Tasks</strong></p><figure float="none" data-type="figure" class="img-center" style="max-width: null;"><img src="https://storage.googleapis.com/papyrus_images/f6ac10313a833a94877f234d5a2466c773379ef5ac31b36857c2ba4c3a2a0523.png" alt="" blurdataurl="data:image/gif;base64,R0lGODlhAQABAIAAAP///wAAACwAAAAAAQABAAACAkQBADs=" nextheight="600" nextwidth="800" class="image-node embed"><figcaption HTMLAttributes="[object Object]" class="hide-figcaption"></figcaption></figure><p>CAIA evaluates both foundational models (like GPT-4o, Claude 3.7, Gemini 2.5, DeepSeek-1) and crypto-native agents. Model scores are published on a public leaderboard, and those meeting a performance threshold receive a Crypto-Ready badge to signal of reliability for builders and users alike.</p><h2 id="h-roadmap" class="text-3xl font-header !mt-8 !mb-4 first:!mt-0 first:!mb-0"><strong>Roadmap</strong></h2><p>CAIBA will continue expanding its evaluation coverage with three additional benchmarks already planned for 2025:</p><ul><li><p><strong>Crypto Named Entity Recognition (CNER)</strong>: Inspired by traditional Named Entity Recognition, this measures how well models identify protocols, tokens, wallets, and contracts <em>to reduce false positives in crypto data.</em></p></li><li><p><strong>Blockchain-Use Benchmark</strong>: Based on the Mind2Web framework, this evaluates how effectively agents follow natural-language instructions <em>to complete tasks on live crypto websites and test real-world usability.</em></p></li><li><p><strong>Crypto LM Arena</strong>: Modeled after crowdsourced evaluation platforms, this uses community voting <em>to assess the usefulness and accuracy of AI responses and highlight the most effective models.</em></p></li></ul><p>Together, these represent a foundation for holding crypto AI to a higher standard. CAIBA will grow into a complete platform where builders test and improve their agents, and users compare models with confidence. If crypto is to trust AI, standards must be built now because the tools of tomorrow depend on the work done today.</p><h2 id="h-help-shape-the-standard" class="text-3xl font-header !mt-8 !mb-4 first:!mt-0 first:!mb-0"><strong>Help Shape the Standard</strong></h2><p>CAIBA is open to everyone:</p><ul><li><p><strong>Projects &amp; researchers</strong>: Join the alliance, contribute datasets, or submit an agent.</p></li><li><p><strong>Developers</strong>: Propose new tasks that track emerging primitives.</p></li><li><p><strong>Everyday users</strong>: Share the questions you wish AI could answer.</p></li></ul><p>Crypto keeps evolving, let’s make sure AI keeps up. Learn more or get involved at <a target="_blank" rel="noopener noreferrer nofollow ugc" class="dont-break-out" href="https://caiba.ai/">caiba.ai</a> or by contacting <strong>@James_dai</strong> on Telegram.</p>]]></content:encoded>
            <author>caiba@newsletter.paragraph.com (CAIBA)</author>
            <enclosure url="https://storage.googleapis.com/papyrus_images/11ed955d2cbeb296e41b7cddbd697f0ae42d5e6d5c13ea4a863e71abf7862155.png" length="0" type="image/png"/>
        </item>
    </channel>
</rss>