Decentralized training has picked up a lot of steam over the past ~12 months, and, after seeing Pluralis Research release a breakthrough in model parallelism, I wanted to take a moment to consolidate my understanding of AI training in general and the raison d'être for decentralized training.
This post is primarily a way for me to sharpen my own thinking in what is a rapidly evolving, highly competitive, relatively technical, and increasingly consequential domain. And, with that, gain perspective on where things are today and where they might be headed.
I’ll first overview the current state of centralized AI training, then outline the rationale for decentralizing it. From there, I’ll highlight research breakthroughs in distributed and decentralized AI, challenges facing decentralized AI today, and run through some cool decentralized AI companies.
I’m not trying to offer any particularly unique insights – but I hope my step-by-step breakdown is a helpful (and enjoyable) orientation! Note: this topic is technical and can quickly get unwieldy – I use LLMs to help me parse through papers and understand and explain things, and I recommend you do too!
Centralized training is what OpenAI, Google, and Meta use to train AI models – i.e. running AI workloads inside single, massive data centers. But this approach is becoming unsustainable: “scaling laws” are plateauing, and the costs, energy, and supply-chain demands of ever-larger data centers are growing exponentially, quickly outpacing gains from Moore’s Law.
Distributed training splits large AI model jobs across various computational units (i.e., multiple GPUs or even multiple data centers), thereby reducing the load on any single location and unlocking further progress.
Decentralized training takes this one step further and seeks to provide an alternative to data centers with untrusted, heterogeneous smartphones, laptops, desktops, and colocation racks owned by anyone in the world.
Until recently, bandwidth and coordination overheads made this decentralized concept impossible. However, recent breakthroughs like DiLoCo, Streaming DiLoCo, DisTrO, DeMo, SWARM Parallelism, and DiPaCo have made significant progress.
While still early and evolving, two core pillars of decentralized AI infrastructure are seeing iteration and improvement: 1/ the design of sustainable crypto incentive mechanisms (i.e. programmatic token distributions to contributors of decentralized AI networks), and 2/ the ability to quickly and cheaply verify the authenticity of the heterogenous, untrusted compute contributed to AI networks (via zero-knowledge proofs, trusted execution environments, multi-party computation, and fully homomorphic encryption).
Continued progress in decentralized AI research and incentive and verification mechanisms are laying the foundation for a decentralized, global-scale GPU network that can train state-of-the-art AI models much cheaper than today’s centralized AI training costs, while also giving independent researchers, smaller labs, and the general public direct influence and financial upside over how the next generation of AI is built and governed.
Centralized AI training is what OpenAI, Google, Meta, and Anthropic use to build the large AI models that power apps you use daily – ChatGPT, Gemini, Llama, and Claude.
Centralized training involves building a data center and loading it with as many GPUs as possible. This approach stems from “scaling laws”, coined by OpenAI in 2020, which have consistently shown that models with more parameters, trained on more data with more compute, tend to perform better.
Leading clusters, like xAI’s Colossus and Meta’s Llama, each exceed 100,000 H100 GPUs. Meanwhile, Microsoft and OpenAI are rumoured to be planning 300,000+ GPU clusters by late 2025. But, while these centralized efforts continue to expand, problems are emerging with centralized training.
It is very capital intensive. AI data center capex is projected to hit $5.2T by 2030, with a single data center like xAI or Meta’s requiring ~$5-10B in upfront capex. Add $3-9B in hardware costs (at $25-30K per H100) and $500M-1B+ in annual OPEX for labor (40–60%), electricity (15–25%), cooling, and maintenance.
Beyond just cost, GPU supply is scarce and politicized, leading to 6-12 month wait times due to supply-chain choke points. And, with data center energy demand escalating, there is immense grid strain on local infrastructure.
Concentrating so much compute in so few hands also creates governance risk. Google’s 2024 “woke-filter” incident and OpenAI’s 2023 governance crisis highlight how small, opaque groups of people can shape model behavior at global scale and raise concerns about misuse, lack of accountability, and the imposition of a narrow set of values (think Orwell’s 1984 or the hyper-personalized manipulation loops of Black Mirror).
Relatedly, centralized AI models rely on a foundation of blind trust: users have no proof that AI outputs haven’t been tampered with. Centralized systems also struggle with brittle scaling, leading to inefficiencies and frustrating rate limits for API users of ChatGPT or Claude.
The way progress in AI is measured is also flawed – traditional benchmarks are increasingly gamed by centralized labs and fail to reflect real-world performance (see recent Meta’s Llama 4 controversy).
These escalating challenges demand a re-evaluation of how we build and deploy AI.
Multi‑datacenter (distributed) training helps, but the next leap is letting any node that meets a quality bar contribute computing power (decentralized training).
Centralization worked while scaling laws (more model parameters + bigger datasets + more compute ≈ better performance) still fit inside a single hyperscale campus. But they no longer do. To understand why, and how decentralized training presents an alternative, I’ll try break down the mechanics of machine learning, training large models, and then move through the challenges.
At its core, ML model training is an optimization problem. The goal is to make a model perform a task, like predicting the next word in a sentence or identifying an object in an image. To do this, the model needs to learn from data.
Here's an overview of the process:
Prediction: The model takes some input data and makes a prediction.
Loss Calculation: This prediction is then compared to the actual correct answer. The difference is quantified by a loss function. A high loss means the model was very wrong. A low loss means it was close to correct.
Gradient Calculation: To improve, the model needs to know how to adjust its internal parameters (the weights and biases that define its knowledge). This is where gradients come in. Gradients tell us the direction and magnitude of change for each parameter to reduce its loss. This calculation process is handled by an algorithm called “backpropagation”. Think of it like finding the steepest downhill path on a complex, multi-dimensional landscape.
Parameter Update: Finally, the model's parameters are updated slightly in the direction indicated by the gradients, with the goal of minimizing the loss.
This process is repeated many, many times, often across billions or trillions of data points, until the model's performance on the task is satisfactory.
Training very large models on massive datasets would take an impractical amount of time on a single computer. Distributed training can help expedite the process, and a common strategy within distributed training is data parallelism.
Data parallelism is where you:
replicate a full copy of the model across multiple GPUs, and
have each GPU process a different shard (or subset) of the training data simultaneously.
After processing its shard, each GPU calculates its gradients, and these gradients are then synchronized and averaged across all GPUs to update the model's parameters.
GPUs are incredibly good at data parallelism: they have thousands of processing cores, enabling them to perform operations like "predict-the-next-token" on different datasets simultaneously. This significantly expedites training with massive datasets, whether GPUs are in a single server, across data centers, or in a distributed setup.
While GPUs excel at parallel computation, a critical bottleneck for larger models in distributed setups is gradient synchronization.
After each GPU processes its shard and calculates its gradients, these individually computed gradients must be aggregated across all participating GPUs to ensure the model learns from the entire global batch of data and to keep all the model copies consistent.
This aggregation process, which happens after every training step, requires a huge data transfer between all devices. For the largest models, which can have trillions of parameters, the size of these gradients is enormous.
This constant, massive data transfer places a huge demand on network infrastructure.
Tightly integrated data center campuses use specialized high-speed, low-latency interconnects (like NVLink or InfiniBand) to handle this traffic effectively. But once the setup scales to different data centers, multiple campuses, or becomes geographically dispersed, network links stretch out, leading to much lower throughput and significantly higher latency.
At this point, the network can't keep up. GPUs spend more time waiting for data to sync than performing computations, drastically slowing the training process. The network thus becomes the critical bottleneck, preventing global-scale centralized AI training.
Despite these challenges, the large model development landscape has seen a shift since 2024. To understand this, it's helpful to clarify the two main stages of large model training:
Pre-training: This is the initial, computationally intense phase, which I’ve focused on thus far. Here, models learn general language patterns, facts, and reasoning abilities by processing vast amounts of raw, unlabeled text and code. The goal is to build a foundational understanding, typically by predicting the next token or filling in masked words.
Post-training (or Fine-tuning): After pre-training, we try to align model behavior with human preferences, improve their ability to follow instructions, and reduce hallucinations. This involves techniques like Supervised Fine-Tuning (SFT) on curated human-labeled data and Reinforcement Learning from Human Feedback (RLHF) or other forms of Reinforcement Learning (RL), where the model learns by receiving rewards or penalties based on its outputs.
With these stages in mind, we've seen two significant developments:
Smarter Training and Inference: Reasoning models, like OpenAI’s o3 and DeepSeek’s R, have demonstrated huge performance gains, not just from continuously scaling raw pre-training compute, but from advanced RL and SFT (which involve their own train-time compute), alongside strategic applications of increased test-time compute for reasoning (e.g. generating pre-answer tokens, efficient inference, or prompt engineering). Sam Lehman’s “The World’s RL Gym” offers an excellent deep dive, highlighting opportunities for decentralized AI.
Efficiency Through Innovation: DeepSeek successfully trained powerful models on a relatively modest 2,000 H800 GPUs by strategically combining advanced techniques, like Mixture-of-Experts (MoE) architectures, highly optimized low-level CUDA tweaks, and novel optimization algorithms like Group Relative Policy Optimization (GRPO).
Together, these developments demonstrate that architectural innovation and training methodologies are as crucial as raw computational power, opening new avenues for achieving frontier-level model capabilities with a more constrained hardware footprint, challenging the dominance of hyperscale data centers.
So – the massive data transfer required for gradient synchronization forms a critical network bottleneck, preventing centralized AI training from scaling infinitely, especially when attempting to span across geographically dispersed locations.
This challenge has fueled the development of various distributed training strategies to make larger models feasible. The table below aims to clarify the different approaches:
Thus far, I’ve focused on the mechanics and limitations of centralized training. Now, I’ll dive into the techniques that underpin distributed training (methods used by hyperscalers to further scale models) and then pivot to recent breakthroughs making decentralized training a reality.
To manage the immense size of modern AI models and the vast datasets they consume, distributed training leverages three “parallelism axes” to spread a training job across many GPUs and even many servers.
Data Parallelism. As previously discussed, data parallelism involves replicating a full copy of the model across multiple GPUs, with each GPU processing a different shard of the training data simultaneously. It’s effective for scaling with large datasets, but struggles with the constant, massive synchronization of gradients across the network, which becomes increasingly problematic as training becomes geographically dispersed.
Model Parallelism: Instead of replicating the full model, different parts of the model are distributed across multiple devices. This is helpful when the model itself becomes too large to fit entirely into a single GPU and enables the training of truly massive models that would otherwise be computationally impossible. Within model parallelism, there are two sub-strategies:
Tensor Parallelism: this strategy involves splitting the computation within a layer across multiple GPUs, i.e. one matrix multiplication might span several GPUs that then exchange partial results constantly. This helps in training large models by distributing memory load and computation, but it also introduces heavy, low-latency communication demands, typically handled by very fast interconnects within a single server or closely-packed servers.
Pipeline Parallelism: this strategy divides the model’s layers into sequential stages, assigning each stage to a different GPU or group of GPUs. Data/inputs pass forward through this “pipeline”, with each stage computing its part, and then gradients move backward through the pipeline. This spreads memory load and reduces communication requirements, but creates “bubble” or idle time, as each stage waits its turn in the sequence. Modern implementations use techniques like micro-batching to fill these bubbles and reduce idle GPU time.
Modern distributed algorithms often combine all three of these techniques (3-D parallelism) to compress synchronization steps, thereby reducing the burden on the network. But even these advanced strategies face inherent limits when scaling beyond tightly integrated, single-operator data centers, primarily due to fundamental network limitations, as discussed previously.
The classic distributed training approaches (Data, Tensor, and Pipeline Parallelism) excel within highly integrated, single-operator environments like large data centres. But they struggle to scale to truly global, heterogeneous, and trustless compute networks.
Until recently, the communication overheads and coordination challenges inherent with decentralized compute networks have made true decentralized training seem impossible for state-of-the-art models. But research breakthroughs over the past few years have introduced novel approaches that directly tackle these barriers:
Individually, these breakthroughs are impressive! But when combined, they hint at the potential for a reduction in communication between nodes – which should be sufficient to move from high-bandwidth InfiniBand interconnects within data centres to efficient training over consumer-grade fiber optic internet connections. This demonstrates the potential for a globally distributed, decentralized compute network for AI.
While these advancements address a fundamental bottleneck, bridging the gap to SOTA centralized AI models is still a long way off; it’s not just about communication, it's about the total available dedicated compute, specialized interconnects within single clusters, and optimized software stacks.
There are also important challenges that need to be overcome in establishing trust and verifiability in an open network, and designing robust economic incentives. The goal here isn’t just to equal the efficiency of centralized AI training, but rather to enable a new paradigm where AI can be trained on a far greater scale and accessibility.
The communication breakthroughs discussed above are indeed incredible, but there’s still a huge problem facing decentralized training: trust and verifiability.
In a centralized system, you trust the single entity (e.g., OpenAI or Google) running the data center. But in a decentralized network, where compute providers are unknown and potentially malicious, this trust assumption disappears.
In other words, for decentralized AI systems to truly scale, participants need 100% guarantees: that the computations they receive are performed correctly (without errors or malicious alterations), and that sensitive training data remains private. It also means, in the case of decentralized inference, knowing that if you ask the network a question using Llama 4, it's not returning with a question answered by Llama 2 – giving you a worse analysis and profiting from the discrepancy.
This is especially important for high-value use cases like finance, healthcare, defense, and legal compliance. Enterprises in these categories are likely willing to pay a premium for provable correctness, as even a single incorrect inference could trigger expensive recomputations or regulatory penalties.
This is where verifiable compute comes in – introducing methods to cryptographically verify correctness into AI pre-training, post-training, and inference. Here’s a summary of the core approaches:
1. Trusted Execution Environments (TEEs): TEEs (like Intel SGX or AMD SEV) are secure, isolated "black boxes" within a CPU. Code and data enter, are processed privately and securely, and results exit – all without the host machine being able to see or tamper with what's inside. For decentralized AI, TEEs offer hardware-backed verification: if computation occurs within a TEE, you get a strong guarantee of its integrity. The main limitations are that they're hardware-specific, which means limited availability, and they can bottleneck with very large models due to memory constraints.
2. Multi-Party Computation (MPC): MPC is a cryptographic technique enabling multiple parties to jointly perform a computation without revealing their individual private data (e.g., a model can be trained collaboratively across different organizations, each contributing their sensitive datasets, but no single party ever sees the others' raw data). This is transformative for privacy-preserving AI, allowing collaboration in highly regulated sectors like healthcare or finance. But MPC is computationally intensive, making it significantly slower and more resource-hungry than computing on plaintext, and its efficiency often decreases with more participants.
3. Fully Homomorphic Encryption (FHE): FHE is often considered the "holy grail" of encryption because it allows computations to be performed directly on encrypted data without ever decrypting it. Only the data owner can decrypt the final output. For decentralized AI, FHE offers the highest level of privacy: you can send encrypted data to a network, have it processed by an encrypted model, and receive an encrypted result – with no node ever accessing unencrypted information. While incredibly powerful, FHE is still largely an academic and research-intensive field for practical AI applications due to its extremely high computational overhead, often making it orders of magnitude slower than plaintext operations.
4. Zero-Knowledge Machine Learning (zkML): zkML uses Zero-knowledge Proofs (ZKPs) to cryptographically verify that a computation was performed correctly, without revealing any underlying data. This is important for decentralized AI because it enables nodes to prove they faithfully executed a model inference or gradient update without exposing proprietary model weights or sensitive input data. Generating these proofs remains computationally expensive, but overheads are coming down, making them increasingly viable for practical applications.
5. Optimistic Machine Learning: Applies the “innocent-until-proven-guilty” approach to ML – i.e., instead of proving every computation, optimistic ML assumes computations are correct unless challenged. Haseeb does a great job explaining this in detail. If a participant suspects fraud, they can initiate a "dispute" that re-executes or verifies the original computation onchain. This approach can significantly reduce overheads and is ideal for scenarios where verification costs are higher than the expected frequency of fraud. However, there's an inherent latency in dispute resolution (i.e., results aren't immediately final), and it relies on honest "challengers" to monitor the network.
Twill at Delphi Digital put out a decentralized AI report recently, including a table that helps quickly grasp the nuances and trade-offs of each approach:
Ultimately, these diverse technologies are critical enablers for decentralized AI. While no longer purely theoretical, they are still relatively early-stage and not yet battle-tested at enterprise scale or high volume. They represent crucial work in progress, with significant potential to build integrity and privacy into the AI stack as they mature.
Rapid advancements in consumer-grade hardware mean that powerful AI capabilities are no longer confined to hyperscale data centers. Modern consumer GPUs like M-series Macs and the latest 4090s can now not only fine-tune but even pre-train billion-parameter models, with projects like Exo Labs showing internet‑scale runs reaching 100+ GPUs with reasonable throughput.
Given this, and assuming we solve all communication and verification issues, the question remains: why would anyone contribute their power-hungry GPUs for compute tasks? They could use them for other purposes… or just turn them off. This highlights the need for a robust incentive layer.
The goal is to create an economic system that aligns all participants and ensures the long-term health and growth of the network: incentivizing GPU providers (for reliable compute), data contributors (for high-quality data), and model trainers (for accurate models), while deterring malicious or inefficient behavior (e.g., offline nodes, fraudulent computations).
While specific economic systems, token models, and incentive mechanisms deserve nuance and a separate blog post, what you need to know is that the crypto industry has speedrun experimentation in, and has become very good at, incentive design. Strategies relevant for decentralized AI include: using programmatic and formulaic rewards for compute, data, and developer contributions, staking and slashing to disincentivize bad behaviour, and reputation systems with community governance to cultivate long-term alignment.
With this incentive layer solved, as Twill also eludes to in his decentralized AI report, the view for decentralized AI is to create a powerful flywheel/feedback loop:
Research breakthroughs → decentralized training on globally dispersed consumer hardware
Better and cheaper consumer hardware and staking yields → cheaper cryptographic verification
Cheaper verification → more node providers and validators join the network
More network nodes → richer decentralized training and better models
Better models → more network value, model deflation (downward pressure on the unit cost of model capability), and better staking yields
The loop then closes, and this is essentially the end-game for decentralized AI training.
Each cycle lowers the cost of creating trustworthy AI and draws more participants into the network, enriching the training environment with diverse geographies and data sources and, ultimately, overcoming centralized AI’s problems outlined in Section 1 cost, energy, supply chain, governance, trust, brittle scaling, etc).
Now, let’s run through some cool companies building in this category!
Focus: Pioneering asynchronous model parallelism and structured compression for decentralized training. Pluralis focuses on "Protocol Learning," where model weights are sharded across nodes, ensuring no single node can reconstruct the full model, thus enabling monetizable, non-extractable models.
Key Tech: Employs an asynchronous pipeline-based parallelism architecture (SWARM) and innovations like "Column-Space Sparsification" for over 90% communication reduction, and Nesterov-based asynchronous gradient correction for Pipeline Parallel (PP) setups. Despite these advances, achieving parity with centralized training requires approximately 300x compression, a challenge that remains Pluralis's core focus.
Progress: R&D-centric. Published the 'Beyond Top k' paper demonstrating over 90% compression in inter-node communication for model-parallel setups. They also have other works showing heterogeneous device support in Pipeline Parallel (PP) configurations. Their Nesterov method for PP was accepted into ICML 2025. In June 2025, Pluralis released research extending their framework to fine-tuning, demonstrating an experimental 8B LLaMA model trained across 64 GPUs in 4 geographical regions via pipeline parallelism. No public product or testnet yet.
Funding: $7.6M Seed in March 2025, co-led by USV and CoinFund.
Focus: Decentralized, open-source, human-centric AI models and tools, aiming to democratize AI development by leveraging globally distributed compute and blockchain incentives. Abhay at Nous has a good thread with high-level resources on what they’re up to.
Key Tech: Developed the Psyche Network and the DisTrO (Distributed Training Over-the-Internet) optimizer, which achieves extreme communication compression (up to 10,000x reduction via DCT and 1-bit sign encoding), asynchronous fault tolerance, and decentralized scheduling.
Progress: Have developed highly-steerable models through their Hermes series of fine-tunes and successfully trained a 15B-parameter model using DisTrO. More recently, in late-May 2025, they started pre-training their new 40B-parameter Nous Consilience model on the Psyche network and have shown promising loss reduction and perplexity reduction graphs (see below)
Funding: $70M total, including a $50M Series A @$1B led by Paradigm in April 2025.
Focus: Decentralized training network where anyone can participate and receive verifiable rewards for compute contributions, primarily for RL-based decentralized training.
Key Tech: PRIME-RL (asynchronous RL framework), TOPLOC (lightweight behavior verification without costly zkML), SHARDCAST (asynchronous weight aggregation via gossip protocols), and OpenDiLoCo/PCCL (sparse asynchronous communication optimized for low-bandwidth, heterogeneous devices).
Progress: In May 2025, released INTELLECT-2, a 32B-parameter model fine-tuned entirely via trustless decentralized collaboration across 100+ heterogeneous GPUs on 3 continents, showcasing the feasibility of "training as consensus" with full transparency.
Funding: $20.5M total, including a $15M Series A led by Founders Fund in March 2025.
Focus: A verifiable execution layer for decentralized AI training, aiming to turn global idle compute into a massive open AI cluster (training-as-mining). It acts as a protocol layer that supports task distribution, execution, verification, and incentive allocation.
Key Tech: RL Swarm (decentralized collaborative RL for post-training), Verde (a hybrid verification system balancing verifiability and efficiency via minimal recomputation), and SkipPipe (fault-tolerant routing for unstable networks, improving pipeline training speed). Uses a multi-role game-theoretic incentive system.
Progress: Showing great results in decentralized RL post-training. This process takes a strong base model, gives copies to participants who generate reasoning traces, which are then collected and used to improve the base model. This approach is significantly cheaper than pre-training as nodes primarily perform inference, though it retains a dependency on the quality of the base model. Gensyn is currently in testnet phase for its RL swarm, conducting permissionless post-training of 0.5B to 72B models using reinforcement learning on a custom Ethereum rollup.
Funding: Over $50M total, including a $43M Series A led by a16z crypto in May 2023.
Focus: Onchain federated learning that decentralizes training across data, computation, and models. Unlike pure decentralized training, Flock integrates traditional federated learning with a crypto-native incentive layer, prioritizing privacy and usability.
Key Tech: Adopts the standard Federated Learning paradigm, allowing data owners to train locally and submit aggregated updates on-chain. Integrates VRF-based random selection, PoS staking, and programmatic incentives. Notable for zkFL, a zero-knowledge federated learning scheme for privacy-preserving gradient aggregation.
Progress: Has an active platform with 6,620 models created, 176 training nodes, 236 validation nodes, and 1,212 delegators. Launched products like AI Arena (training platform) and FL Alliance (federated learning client).
Funding: Over $9M raised across two rounds in 2024, with investors including DCG and Lightspeed Faction.
Centralized AI models are absolutely crushing it. The progress made since ChatGPT launched in November 2022 has been phenomenal. However, there are signs that scaling laws, which have fueled this progress, are beginning to wane.
Meanwhile, expectations around AI’s impact on the economy have soared – with some viewing AI as the primary path the US will grow out of a self-inflicted debt cycle (in combination with a renaissance in energy and 0-to-1 moment in robotics).
This massive AI TAM, combined with the eye-watering valuations that top AI labs are commanding, has given birth to opportunity for disruptive startups in decentralized AI trying to take at least a small piece of centralized AI’s market share.
There is, without a doubt, massive market risk and perhaps the most intensely competitive field we’ve ever seen in technology – centralized AI companies are incredibly well capitalized and are elite at execution. But, given the TAM, founders and VCs alike are super excited.
Personally, I find the AI model training and decentralized AI categories super interesting and am easily nerd-sniped by the principles behind and the potential of decentralized AI.
However, as an early-stage VC, I don’t think there’s much opportunity left to invest in the category: it’s already on everyone’s radar, the valuations are already high, and the category leaders seem to have been founded and backed at Series A by Tier 1 VCs.
At this point, this may devolve into a knife fight between players like Nous and Prime, who are competing for attention and GPUs amongst participants, community members, and retail degens speculating on airdrops.
While I’ve thoroughly enjoyed going deeper and researching decentralized AI training, and am truly excited about the possibility of a global-scale GPU swarm training open models with distributed ownership and governance, I’m left wanting to explore more emergent areas of decentralized AI innovation:
I think capital formation for agents (Virtuals), payments for agents (Nevermined), sovereign agents (Freysa), agent orchestration (Naptha), and swarm inference (Fortytwo) are all examples of exciting and emergent areas (and will report back in future blog posts on what I find!).
If you’ve made it this far, I’d love to hear your thoughts on decentralized AI in general, and also any emergent areas you’re excited about. If you’re thinking of building a company in this space, or are already in the trenches, I'd love to connect and hear your insights – please reach out!
Over 300 subscribers