# DeAI Data Stack: Challenges and Considerations

By [Lucidity Finance](https://paragraph.com/@lucidityfinance) · 2024-11-29

---

As we building towards powering on-chain recommendations, some of the most frequent and important questions we hear are around the quantity and quality of on-chain data available to power meaningful recommendations. While model architectures play an important role towards addressing those concerns, that only takes us half way.

Data is one of the three fundamental neural scaling constraints and this blog aims to deconstruct the key challenges, processes and considerations in building a rich and robust data pipeline from onchain model builder’s standpoint. The other two constraints are compute and parameters (model size), but more on them later.

First things first, building a data pipeline for AI models involves multiple steps, each with its own set of challenges, more so when you’re targeting onchain usecases.

Data collection
---------------

This includes gathering data from a variety of onchain / offchain sources relevant to the use case (in our case any datasets that can contribute to characterisation of a wallet/asset/protocol smart contracts).

### Challenges:

*   Onchain data is simply not being rich enough given the limited history and engagement as compared to Internet data. For instance, Ethereum has processed ~2.5 billion transactions till date; only ~8 billion even if we consider top 10 chains. Compare this with ~15 trillion data tokens from publicly available sources used to pre-train Llama 3 on.
    

### Developments:

*   **DataDAOs**: Decentralized networks enabling community-owned data ecosystems. Over the past year, several DataDAOs have emerged, aggregating diverse datasets across domains like finance, healthcare, and social media. Projects such as [Masa](https://www.masa.ai/), [Vana](https://www.vana.org/), [Grass](https://www.getgrass.io/), etc have gained significant traction, facilitating user-owned data collection and management across varied sectors.
    
*   **Synthetic data**: Artificially generating datasets that mirror real-world data, in a privacy-preserving manner. While its scalability still needs to be seen, synthetic data can be a very cost-effective to the AI model’s rapidly increasing data needs. Teams like [Firstbatch](https://www.firstbatch.xyz/), [Openledger](https://www.openledger.xyz/), [Mizu](https://mizu.global/) have been building in this direction.
    

Data Labelling and Post-processing
----------------------------------

Once the data is collected, it needs to be cleaned to make it usable to training AI models, which involves multiple steps. The goal here is to ensure data integrity by removing noise and inconsistencies, adding context and structure, and refining outputs for usability in downstream applications.

### Challenges:

*   Blockchain data can be noisy and incomplete. The data collected across different chains is also often not standardized and not interoperable. Furthermore, incentivised crowdsourcing of data can lead to corrupt user behaviours and sybil attacks. The labelling process can also be time consuming, costly, and lead to inherent biases in resulting datasets.
    

### Developments:

*   **Labelling**: There have been interesting developments around building incentivised networks of massive-scale, community-owned, labelled datasets and trustless verifications to eliminate biases. Teams including [FractionAI](https://fractionai.xyz/), [Kiva](https://www.kivaai.com/), [Synesis](https://www.synesis.one/), etc. have been leading the charge on this front.
    
*   **Data aggregators and marketplaces**: Enable permissionless exchange and monetisation of tokenised and processed datasets across ecosystems. Teams including [Ocean Protocol](https://oceanprotocol.com/), [0xScope](https://www.0xscope.com/) have made notable strides, offering frameworks for data processing, sharing and pricing in a decentralized manner.
    

Data Storage and Querying
-------------------------

This involves secure and scalable storage and handling of the collected data. Key considerations include maximising throughputs and minimising latency while managing growing data volumes queried across distributed systems. This is essential for large-scale model training and serving.

### Challenges:

*   Onchain AI applications and agentic networks will demand high throughputs and data availability. Siloed data clusters and unavailability of high-performant data storage solutions can massively impede the development of onchain intelligence networks.
    

### Developments:

*   **DA layers**: [0G](http://0g.ai)’s DA layer offers infinitely-scalable infrastructure specifically focused on handling vast datasets required by AI models, queriable across its own storage and also third party storage solutions.
    
*   **Storage solutions**: Solutions like [Hyperline](https://www.hyperline.xyz/)’s unified data lakehouse, and [Arweave](https://arweave.org/)’s permanent, tamper-proof storage, can enable seamless preservation and querying of datasets used in model training and auditing.
    

Data Verification and Privacy-preservation
------------------------------------------

AI models building need increasing amounts of offchain user data (in addition to onchain data) to enable meaningfully personalised onchain interactions. That involves ensuring that the data collectd is authentic, tamper-proof, and doesn’t contain users’ PII (personally identifiable information).

### Challenges:

*   The privacy and the security of the user-owned data are serious concerns for any type of data collection task, more so with high-volume of meaningful data required for model training. Verification of data authenticity is also an equally daunting task especially when dealing with crowdsourced/offchain data.
    

### Developments:

*   **ZK-proofs**: [Space and Time](https://www.spaceandtime.io/)’s zk-proven SQL-based data warehouse combines a decentralized data warehouse with “Proof of SQL” protocol to allow querying onchain/offchain data and verify the result in a trustless manner. This enables enforcing verifiability of inputs at scale.
    
*   **Privacy**: [Provably](https://provably.ai/) focuses on privacy-preserving, verifiable analytics to enable AI and data teams to collaborate on live private data collections. This  Developments in FHE (Fully Homomorphic Encryption) for privacy-preserving machine learning are also worth mentioning here.
    

Closing Thoughts
================

The foundations of a robust data stack are being steadily established to meet the growing demand for rich, diverse, scalable, and verifiable data essential for onchain intelligence networks. With advancements in data collection, augmentation, storage, querying, and verification, the ecosystem is evolving to address the complexities of powering onchain AI. The challenges, though significant, seem surmountable.

Lucidity, focused on enabling personalized recommendations—one of the most impactful onchain AI applications—is dedicated to working with innovative teams across the data pipeline. Together, we aim to drive meaningful progress toward a more inclusive, transparent, and efficient Web3 ecosystem.

Further Reading
===============

*   [Delphi Digital DeAI II: Seizing the Means of Production](https://members.delphidigital.io/reports/seizing-the-means-of-production-deai-part-ii)
    
*   [Challenges and applications of Large Language Models](https://arxiv.org/pdf/2307.10169)
    
*   [Data Collection and Quality Challenges in Deep Learning: A Data-Centric AI Perspective](https://arxiv.org/pdf/2112.06409v3)
    
*   [Towards decentralized AI part 1: Data collection](https://www.firstbatch.xyz/blog/towards-decentralized-ai-part-1-data-collection)
    
*   [The modern DeAI Stack: Melno ventures](https://menlovc.com/perspective/the-modern-ai-stack-design-principles-for-the-future-of-enterprise-ai-architectures/)
    
*   [DeAI-map](https://www.topology.vc/deai-map)

---

*Originally published on [Lucidity Finance](https://paragraph.com/@lucidityfinance/deai-data-stack-challenges-and-considerations)*
