# DeAI Data Stack: Challenges and Considerations

**Published by:** [Lucidity Finance](https://paragraph.com/@lucidityfinance/)
**Published on:** 2024-11-29
**URL:** https://paragraph.com/@lucidityfinance/deai-data-stack-challenges-and-considerations

## Content

As we building towards powering on-chain recommendations, some of the most frequent and important questions we hear are around the quantity and quality of on-chain data available to power meaningful recommendations. While model architectures play an important role towards addressing those concerns, that only takes us half way. Data is one of the three fundamental neural scaling constraints and this blog aims to deconstruct the key challenges, processes and considerations in building a rich and robust data pipeline from onchain model builder’s standpoint. The other two constraints are compute and parameters (model size), but more on them later. First things first, building a data pipeline for AI models involves multiple steps, each with its own set of challenges, more so when you’re targeting onchain usecases.Data collectionThis includes gathering data from a variety of onchain / offchain sources relevant to the use case (in our case any datasets that can contribute to characterisation of a wallet/asset/protocol smart contracts).Challenges:Onchain data is simply not being rich enough given the limited history and engagement as compared to Internet data. For instance, Ethereum has processed ~2.5 billion transactions till date; only ~8 billion even if we consider top 10 chains. Compare this with ~15 trillion data tokens from publicly available sources used to pre-train Llama 3 on.Developments:DataDAOs: Decentralized networks enabling community-owned data ecosystems. Over the past year, several DataDAOs have emerged, aggregating diverse datasets across domains like finance, healthcare, and social media. Projects such as Masa, Vana, Grass, etc have gained significant traction, facilitating user-owned data collection and management across varied sectors.Synthetic data: Artificially generating datasets that mirror real-world data, in a privacy-preserving manner. While its scalability still needs to be seen, synthetic data can be a very cost-effective to the AI model’s rapidly increasing data needs. Teams like Firstbatch, Openledger, Mizu have been building in this direction.Data Labelling and Post-processingOnce the data is collected, it needs to be cleaned to make it usable to training AI models, which involves multiple steps. The goal here is to ensure data integrity by removing noise and inconsistencies, adding context and structure, and refining outputs for usability in downstream applications.Challenges:Blockchain data can be noisy and incomplete. The data collected across different chains is also often not standardized and not interoperable. Furthermore, incentivised crowdsourcing of data can lead to corrupt user behaviours and sybil attacks. The labelling process can also be time consuming, costly, and lead to inherent biases in resulting datasets.Developments:Labelling: There have been interesting developments around building incentivised networks of massive-scale, community-owned, labelled datasets and trustless verifications to eliminate biases. Teams including FractionAI, Kiva, Synesis, etc. have been leading the charge on this front.Data aggregators and marketplaces: Enable permissionless exchange and monetisation of tokenised and processed datasets across ecosystems. Teams including Ocean Protocol, 0xScope have made notable strides, offering frameworks for data processing, sharing and pricing in a decentralized manner.Data Storage and QueryingThis involves secure and scalable storage and handling of the collected data. Key considerations include maximising throughputs and minimising latency while managing growing data volumes queried across distributed systems. This is essential for large-scale model training and serving.Challenges:Onchain AI applications and agentic networks will demand high throughputs and data availability. Siloed data clusters and unavailability of high-performant data storage solutions can massively impede the development of onchain intelligence networks.Developments:DA layers: 0G’s DA layer offers infinitely-scalable infrastructure specifically focused on handling vast datasets required by AI models, queriable across its own storage and also third party storage solutions.Storage solutions: Solutions like Hyperline’s unified data lakehouse, and Arweave’s permanent, tamper-proof storage, can enable seamless preservation and querying of datasets used in model training and auditing.Data Verification and Privacy-preservationAI models building need increasing amounts of offchain user data (in addition to onchain data) to enable meaningfully personalised onchain interactions. That involves ensuring that the data collectd is authentic, tamper-proof, and doesn’t contain users’ PII (personally identifiable information).Challenges:The privacy and the security of the user-owned data are serious concerns for any type of data collection task, more so with high-volume of meaningful data required for model training. Verification of data authenticity is also an equally daunting task especially when dealing with crowdsourced/offchain data.Developments:ZK-proofs: Space and Time’s zk-proven SQL-based data warehouse combines a decentralized data warehouse with “Proof of SQL” protocol to allow querying onchain/offchain data and verify the result in a trustless manner. This enables enforcing verifiability of inputs at scale.Privacy: Provably focuses on privacy-preserving, verifiable analytics to enable AI and data teams to collaborate on live private data collections. This Developments in FHE (Fully Homomorphic Encryption) for privacy-preserving machine learning are also worth mentioning here.Closing ThoughtsThe foundations of a robust data stack are being steadily established to meet the growing demand for rich, diverse, scalable, and verifiable data essential for onchain intelligence networks. With advancements in data collection, augmentation, storage, querying, and verification, the ecosystem is evolving to address the complexities of powering onchain AI. The challenges, though significant, seem surmountable. Lucidity, focused on enabling personalized recommendations—one of the most impactful onchain AI applications—is dedicated to working with innovative teams across the data pipeline. Together, we aim to drive meaningful progress toward a more inclusive, transparent, and efficient Web3 ecosystem.Further ReadingDelphi Digital DeAI II: Seizing the Means of ProductionChallenges and applications of Large Language ModelsData Collection and Quality Challenges in Deep Learning: A Data-Centric AI PerspectiveTowards decentralized AI part 1: Data collectionThe modern DeAI Stack: Melno venturesDeAI-map

## Publication Information

- [Lucidity Finance](https://paragraph.com/@lucidityfinance/): Publication homepage
- [All Posts](https://paragraph.com/@lucidityfinance/): More posts from this publication
- [RSS Feed](https://api.paragraph.com/blogs/rss/@lucidityfinance): Subscribe to updates
- [Twitter](https://twitter.com/LucidityFinance): Follow on Twitter