<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/">
    <channel>
        <title>Function Network</title>
        <link>https://blog.function.network</link>
        <description>Access AI anytime, anywhere.</description>
        <lastBuildDate>Thu, 23 Apr 2026 02:04:16 GMT</lastBuildDate>
        <docs>https://validator.w3.org/feed/docs/rss2.html</docs>
        <generator>https://github.com/jpmonette/feed</generator>
        <language>en</language>
        <copyright>All rights reserved</copyright>
        <item>
            <title><![CDATA[The Evolution of Function Network]]></title>
            <link>https://blog.function.network/the-evolution-of-function-network</link>
            <guid>4JCK4BuPkJiQqfd5C2dM</guid>
            <pubDate>Tue, 08 Jul 2025 01:21:22 GMT</pubDate>
            <description><![CDATA[AI is one of the most powerful technologies of our generation, but in order for it to reach its fullest potential, it needs to be developed in the open. Limited training resources, rising costs and access to compute, and monopolization by a handful of tech companies, has resulted in second order effects that hinder progress. We are here to offer a new solution for AI development and contribution. The team has built a decentralized platform on Base that makes AI accessible to everyone, and rew...]]></description>
            <content:encoded><![CDATA[<p>AI is one of the most powerful technologies of our generation, but in order for it to reach its fullest potential, it needs to be developed in the open. Limited training resources, rising costs and access to compute, and monopolization by a handful of tech companies, has resulted in second order effects that hinder progress. <br><br>We are here to offer a new solution for AI development and contribution. The team has built a decentralized platform on Base that makes AI accessible to everyone, and rewards users for their contributions. This includes providing compute to the network, building applications with inference powered by our Developer Platform, and using our ecosystem consumer applications.</p><p>AI has limitless scalability and is more performant with global participation. This should be the standard regardless of which entity is fighting for market share.</p><p><br></p><p>Our belief revolves around making AI a collaborative technology where users are fairly compensated for their efforts. The Function team’s years of experience building in web3 helps us adapt in a rapidly-evolving industry. This is our story.</p><p><br></p><p><strong>Function’s Early Days&nbsp;</strong></p><p><br></p><p>Prior to joining forces, software engineers Erick Ho and Alex Mo were each building in the crypto space, gaining firsthand experience with the challenges of scalability and decentralization. Erick began his career at AWS, where it provided him with a unique vantage point to witness the rising wave of artificial intelligence and the challenges associated with centralized AI systems. He witnessed firsthand the computational demands of AI and its limiting access to only a select major companies.</p><p><br></p><p>Erick crossed paths with Alex during their time at Base at the height of the NFT boom in 2021. Alex, who had joined Coinbase as a developer, quickly made his mark by creating an NFT index that enhanced Coinbase’s NFT marketplace. It wasn’t until a year later when Erick was brought on to help scale its operations. Their shared drive and determination created a partnership that would lay the groundwork and later become the foundation for Function Network.&nbsp;</p><p><br></p><p><strong>Challenges with AI</strong></p><p><br></p><p>Artificial intelligence began to dominate headlines in early 2022 with companies like OpenAI, Google, and Microsoft quickly becoming household names. The advent of large language models (LLMs) like ChatGPT by OpenAI sparked a revolution in how the world interacted with artificial intelligence, while megacap Google and Microsoft raced to integrate AI into their existing products and services.&nbsp;</p><p><br></p><p>As AI adoption accelerated, the duo saw the challenges presented from their time at AWS and Coinbase materialize. Inference costs were becoming increasingly expensive, as so few compute providers were able to become successful. What’s more, these centralized AI systems created fundamental issues: a lack of seamless AI integration into existing blockchain infrastructures, cost inflation, and censorship-driven outputs.</p><p><br></p><ul><li><p><strong>AI and blockchain integration challenges: Centralized AI models rely on massive datasets stored in servers, making them more vulnerable to data breaches and cyberattacks that can compromise sensitive information globally. As well, centralized systems provide a siloed approach to AI development, raising concerns about transparency as there’s often no visibility to see how these datasets are trained.</strong></p></li><li><p><strong>Cost inflation: The growing demand for AI is creating a stark divide in the tech landscape. Costs are spiraling out of control while major tech companies are dominating critical resources. These large cap behemoths are </strong><a target="_blank" rel="noopener noreferrer nofollow ugc" class="dont-break-out" href="https://finance.yahoo.com/news/big-tech-set-to-invest-325-billion-this-year-as-hefty-ai-bills-come-under-scrutiny-182329236.html"><strong><u>forecast</u></strong></a><strong> to spend $325 billion in infrastructure this year alone, fueling a cloud compute market that’s already worth $675 billion in 2024. This massive influx of capital is driving up costs and creating a supply crunch that is stifling innovation and competition. It is also providing significant strain on centralized cloud providers, leading to increased latency. As demand outpaces supply, many businesses are feeling the pinch of being priced out of the AI race and find difficulty scaling their operations.&nbsp;</strong></p></li><li><p><strong>Censorship vulnerabilities: Power and control are concentrated in the hands of a few large corporations, making AI systems more susceptible to censorship by authorities or governing bodies. This can restrict access to information and cause manipulation of global narratives. It is possible for centralized systems to filter and remove information, potentially suppressing legitimate discourse and stifling freedom of speech. Users also often remain unaware of why specific content has been removed or altered, fuelling mistrust.</strong></p></li></ul><p><br></p><p>These challenges set Erick and Alex on a path to create a network that is more open, transparent, and inclusive for AI usage and development.<br><br></p><p><strong>The Birth of Function Network</strong></p><p>Function Network launched on Base as a platform for people to use AI, build with AI, and power AI while also being rewarded for it. The team has built a full-stack managed AI cloud where any type of user can access an array of open source LLMs, powered by a global infrastructure network. It is the first decentralized inference network that supports the next wave of AI powered applications in both web2 and web3.</p><p><br></p><ul><li><p><strong>Provide Compute: At the core of Function’s full-stack decentralized AI ecosystem is Function Network, the foundational layer where anyone can contribute compute to run open source models. Unlike centralized platforms, Function’s decentralized structure distributes inference across different GPU and other hardware providers that not only lowers computing costs, but introduces fault tolerance to guarantee model uptime. Providers will be granted the right to offer compute and earn additional rewards by staking Function’s native token (FUNC). This serves as a reliable threat model that also enhances network security.&nbsp;</strong></p></li><li><p><strong>Provide Models: Function empowers anyone to contribute their own models. Model creators can upload, host, and share their open-source LLMs directly on Function. In return, they gain visibility, usage, and also rewards. Function levels the playing field for model creators, while making room for participation beyond the “Big 4” (Llama, Mistral, Qwen, and DeepSeek).</strong></p></li><li><p><strong>Developer Platform: Function Network provides developers with an intuitive platform that delivers a one-stop shop for all AI development needs. It offers a suite of Open-AI compatible APIs (application programming interfaces) that helps streamline the integration of AI capabilities into other applications. This provides easy access to various open source models and allows Function to empower developers to create sophisticated AI-powered solutions, for free.</strong></p></li><li><p><strong>Function Chat: Function Chat represents the user-facing application of Function’s infrastructure. It features a seamless, conversational AI experience that’s available on both the web and mobile app, while providing a familiar chat interface—much like what people use today with ChatGPT or Claude. By decentralizing a network of node operators, users can interact with a variety of AI models, reinforcing Function’s core belief in censorship-resistant technology.</strong></p></li></ul><p><br></p><p>Function’s decision to launch on Base is a natural fit, by leveraging Coinbase’s powerful distribution network and its success as the fastest growing L2. By building as an appchain L3, Function gains dedicated blockspace and more transaction throughput enabling the platform to scale efficiently. The move reinforces the team’s vision of decentralized, accessible infrastructure for AI innovation.</p><p><br></p><p><strong>Join Function Network</strong></p><p>Function Network is not just a platform, it’s a movement towards a more open, efficient, and accessible AI ecosystem. We invite businesses and developers to join on testnet and explore the endless possibilities of decentralized AI computing.&nbsp;</p><p>The story of Function Network is far from over. In fact, it’s just the beginning. Start building <a target="_blank" rel="noopener noreferrer nofollow ugc" class="dont-break-out" href="https://www.function.network/"><u>here</u></a></p>]]></content:encoded>
            <author>function.network@newsletter.paragraph.com (Alex Mo)</author>
            <enclosure url="https://storage.googleapis.com/papyrus_images/4e754ce2bbe23a7d556769be0e63c21c.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Revolutionizing AI Inference: Overcoming Vertical Scaling Challenges with Hybrid Solutions]]></title>
            <link>https://blog.function.network/revolutionizing-ai-inference-overcoming-vertical-scaling-challenges-with-hybrid-solutions</link>
            <guid>Qb0Z5wAl7Whd4Qa51umN</guid>
            <pubDate>Wed, 05 Mar 2025 18:06:28 GMT</pubDate>
            <description><![CDATA[Currently, as AI technology demand grows, models are growing at an unprecedented rate, pushing the boundaries of what current hardware can support. Today’s landscape is dominated by increasingly large and complex models, with newer models being pushed, such as NEAR’s 1.4T parameter model. These models require immense computational power and memory. Despite the correlated growth of increasingly powerful GPUs, the traditional approach of vertical scaling is beginning to show limitations. This a...]]></description>
            <content:encoded><![CDATA[<p><em>Currently, as AI technology demand grows, models are growing at an unprecedented rate, pushing the boundaries of what current hardware can support. Today’s landscape is dominated by increasingly large and complex models, with newer models being pushed, such as NEAR’s 1.4T parameter model. These models require immense computational power and memory. Despite the correlated growth of increasingly powerful GPUs, the traditional approach of vertical scaling is beginning to show limitations.&nbsp;</em></p><p><em>This article explores the current state of AI inference, the challenges of vertical scaling, and how horizontally scaling through pipeline parallelism, novel optimization techniques, and hybrid GPU configurations provides a transformative solution as the market trends toward increasingly large and complex models.</em></p><hr><div class="relative header-and-anchor"><h2 id="h-the-state-of-ai-inference-today">The State of AI Inference Today</h2></div><p>AI inference involves running pre-trained models to generate insights, outputs, or predictions. From text generation, image generation, to video interpretation, inference is computationally intensive, requiring high-performance hardware to match the industry, filled with ever-growing model sizes.</p><p><strong>Massive Models Require Massive Resources</strong>:</p><ul><li><p><strong>Deepseek R1 671B</strong>, with <strong>671 billion parameters</strong>, pushes the boundaries of large-scale AI, requiring <strong>distributed compute infrastructure</strong> to function efficiently.</p></li><li><p>Models like <strong>Llama 3.1 405B</strong> have 405 billion parameters, necessitating multiple high-end GPUs for training and inference.</p></li><li><p>Google’s <strong>Switch C 2048 </strong>takes scaling to another level, requiring petabytes of memory and thousands of GPUs for optimal performance.</p></li><li><p>Computer vision models like <strong>Vision Transformers (ViTs)</strong> and high-resolution generative models like <strong>Stable Diffusion</strong> similarly demand significant VRAM, often exceeding 40GB for large-scale deployments.</p></li></ul><p><strong>Vertical Scaling: The Current Solution</strong>:</p><ul><li><p>Enterprises today rely on <strong>powerful clusters of GPUs</strong> to handle the ever-growing computational demands of AI inference. These GPUs, such as <strong>NVIDIA’s H100 and A100</strong>, are either <strong>self-hosted in enterprise datacenters</strong> or <strong>rented from cloud providers</strong> like <strong>AWS Bedrock, Google Cloud, and Azure AI</strong>.</p></li><li><p>As models continue to grow, so do GPUs. More and more powerful GPUs, such as NVIDIA’s <strong>GH200, and newly unveiled GB200</strong>,<strong> </strong>as well as unified chips such as Apple’s <strong>M4 chip</strong> continue to emerge in the market<strong>.</strong></p><ul><li><p><strong>NVIDIA GH200:&nbsp;</strong></p><ul><li><p>The GH200 offers up to 10 times higher performance for applications handling terabytes of data. It integrates 96GB of HBM3 memory, delivering a bandwidth of 4TB/s.&nbsp;</p></li><li><p>An upcoming version with HBM3e memory will increase capacity to 144GB and bandwidth to over 4.9TB/s.</p></li></ul></li><li><p><strong>NVIDIA GB200:&nbsp;</strong></p><ul><li><p>Recently unveiled, the GB200 provides a combined memory of 1.7TB, designed to handle the most demanding AI workloads, offering exceptional performance and scalability.</p></li></ul></li><li><p><strong>Apple's M4: </strong>Apple's latest M4 chip supports up to 128GB of unified memory with a bandwidth of 546GB/s.</p></li></ul></li></ul><div class="relative header-and-anchor"><h4 id="h-the-looming-problem-with-vertical-scaling"><strong>The Looming Problem with Vertical Scaling</strong></h4></div><p>As models like <strong>Deepseek R1 671B </strong>and <strong>Llama 3.1 405B</strong> continue to grow, along with the introduction of newer and larger models, the amount of VRAM required for both training and inference grows exponentially. Larger models demand more memory per layer and require more GPUs to process the increased parameter counts. This trend is pushing vertical scaling to its breaking point:</p><ul><li><p><strong>Physical Limits of Hardware</strong>: Although more and more powerful GPUs are being produced, the design of GPUs is approaching practical limits in terms of memory and processing power.</p></li><li><p><strong>Limited Scalability</strong>: Due to the law of diminishing returns, adding more powerful GPUs doesn’t linearly scale the performance output.</p></li><li><p><strong>Skyrocketing Costs</strong>: GPUs like the NVIDIA GH200 can cost hundreds of thousands to own clusters, or tens of thousands per month to rent.</p></li><li><p><strong>Supply Constraints</strong>: Enterprise-grade GPUs are in high demand, often resulting in inflated costs, or limited availability.</p></li><li><p><strong>Energy Consumption</strong>: High-end GPUs consume significant amounts of power, leading to high energy &amp; maintenance costs as well as environmental impacts.</p></li></ul><p>These issues underscore the need for alternative strategies as AI continues to scale. Vertical scaling, while crucial, can no longer track with the pace of AI growth in the market.</p><div class="relative header-and-anchor"><h4 id="h-centralized-ai-providers-and-outages"><strong>Centralized AI Providers &amp; Outages</strong></h4></div><p>Relying on centralized AI service providers poses significant risks, as evidenced by several notable outages and their widespread impacts:</p><p><strong>1. OpenAI's ChatGPT Downtime (June 2024, January 2025):</strong> An outage rendered ChatGPT inaccessible for several hours during June. Additionally on January 23, 2025, ChatGPT experienced an outage that prevented users from logging in and led to error messages, Prompting users to seek alternative services, as well as affecting multiple downstream enterprises. This shift in user behavior demonstrated the potential of customer attrition, as well as highlighted the risks of relying on a sole provider.&nbsp;</p><p><strong>2. DeepSeeks Server Outage (January 2025): </strong>The free AI chatbot DeepSeek faced 'server is busy' errors, frustrating users and causing complaints on social media, underscoring the challenges centralized services face in scaling infrastructure to meet growing demand. ​</p><p><strong>3. CrowdStrike-Related IT Outages (July 2024):</strong> A faulty update from cybersecurity firm CrowdStrike caused widespread outages, compromising organizations that relied upon their AI-driven cybersecurity solutions. The cascading effects posed risks to user data concerns and demonstrated the issues that centralized services can have across multiple sectors.</p><p><strong>4. Amazon Web Services (AWS) Disruptions (Dec 2021):</strong> AWS has experienced multiple outages over the years, including a significant one on December 7, 2021, that disrupted services like Disney+ and Netflix’s AI recommendation systems and communication tools. These events illustrate the extensive reach and potential impact of centralized service failures even towards large enterprises.&nbsp;</p><p><strong>5. Anthropic's Claude Outages (June 2024): </strong>Anthropic's Claude AI Chatbot experienced outages on the morning of June 4, 2024, coinciding with disruptions in ChatGPT and impacting dependent services. Although the cause of the outage was not disclosed, it highlights the risk of simultaneous failures across multiple AI platforms.</p><p><strong>6. Perplexity AI Outages (June 2024): </strong>Perplexity AI, recognized for its AI-powered search capabilities, experienced service disruptions that same month. The platform displayed messages about reaching its capacity limit, indicating the outage likely resulted from an overload due to high demand. This highlights the critical need for scalable infrastructure to meet the growing market demands.&nbsp;&nbsp;</p><p><strong>Implications of Centralized AI Service Dependencies:</strong></p><ul><li><p><strong>Single Point of Failure:</strong> Dependence on a sole provider can lead to widespread disruptions if that provider experiences issues.</p></li><li><p><strong>Operational Risks:</strong> Outages can halt business operations, leading to financial losses and reputational damage.</p></li></ul><p><strong>Data Privacy Concerns:</strong> Centralized data storage increases the risk of large-scale breaches.</p><p></p><div class="relative header-and-anchor"><h4 id="h-historical-mitigation-for-centralized-outages"><strong>Historical Mitigation for Centralized Outages</strong></h4></div><p>To address the risks posed by outages in centralized AI service providers, such as the outages outlined above various mitigation strategies have been employed. One prominent approach is the use of platforms like <strong>OpenRouter</strong>, which enable routing across multiple AI providers. While this offers a level of redundancy and operational continuity, it also introduces challenges that highlight the limitations of current solutions.</p><div class="relative header-and-anchor"><h5 id="h-openrouter-a-temporary-fix-for-routing">OpenRouter: A Temporary Fix for Routing</h5></div><p>OpenRouter serves as a middleware that routes requests dynamically between different AI providers (e.g., OpenAI, Anthropic, and others). In the event of an outage from one provider, requests can be redirected to another, maintaining functionality. Despite its benefits, the use of OpenRouter essentially acts as a <code>hacky</code> solution to the issue and introduces a significant technical issue: <strong>non-unified Key-Value (KV) Caches</strong>.</p><p>KV caches store intermediate states (e.g., previous token activations) to speed up processing of subsequent tokens, and serve as a ‘memory’ for subsequent requests. However, KV caches are not standardized across providers, meaning data cached by one provider cannot be reused by another when requests are rerouted. This results in higher computational costs, increased latency, and missing prior data when providers are swapped.&nbsp;</p><p>Looking towards the future, shifting from centralized AI providers to decentralized, distributed inference systems will minimize reliance on single points of failure. Additionally allowing for establishment of a unified common format for KV caches across providers to allow for seamless sharing of cached data. The following sections will explore this topic.</p><div class="relative header-and-anchor"><h2 id="h-introducing-horizontal-scaling-with-pipeline-parallelism">Introducing Horizontal Scaling with Pipeline Parallelism</h2></div><p>As the demand for these larger more powerful AI models grows, so too does the need for more scalable and efficient ways to perform inference. Enter <strong>pipeline parallelism</strong>, a technique that embraces horizontal scaling by distributing model computation across multiple devices.&nbsp;</p><p>Although pipeline parallelism is not novel, it was originally developed to maximize utilization on single computing systems, achieving fast and high throughput by efficiently partitioning and overlapping tasks across multiple GPUs within the same machine. With the advent of larger models and distributed systems, pipeline parallelism has evolved, incorporating more recent optimization techniques, making it an effective tool for scaling AI inference across distributed computing environments.</p><p>To understand its significance, let’s dive deeper into what pipeline parallelism is, how it works, and how it can change the landscape of AI inference for the future.</p><hr><div class="relative header-and-anchor"><h4 id="h-what-is-pipeline-parallelism"><strong>What is Pipeline Parallelism?</strong></h4></div><p>While we can imagine the current standard of vertical scaling as having one worker or one single factory building an entire product from start to finish in a monolithic process, we can instead imagine pipeline parallelism as distributing parts to be built by a series of factories, each performing or creating a part of the product, moving down the line, becoming progressively more complete in a distributed, efficient workflow.&nbsp;</p><p>Pipeline parallelism applies this concept to AI inference:</p><ul><li><p>A model’s computations are divided into sequential stages, with each stage assigned to a different device.</p></li><li><p>As the input data flows through the pipeline, each device processes its assigned portion before passing the results to the next device in line.</p></li></ul><p>To implement pipeline parallelism, a model’s computational graph (the representation of its operations) is divided into segments. Each segment corresponds to a stage in the pipeline, which is handled by a specific GPU or computational node. Here’s an example:</p><ol><li><p><strong>Input Embeddings</strong>: The first GPU processes the input data, such as converting text or images into numerical embeddings.</p></li><li><p><strong>Hidden Layers</strong>: The embeddings are passed to the next GPU, which performs calculations for a subset of the model’s layers.</p></li><li><p><strong>Output Generation</strong>: After flowing through all stages, the final device produces the output, whether it’s text, an image, or a classification.</p></li></ol><p>This sequential processing enables multiple devices to work on different parts of the computation simultaneously, optimizing resource usage and reducing bottlenecks.</p><div class="relative header-and-anchor"><h2 id="h-advantages-of-horizontal-scaling-with-pipeline-parallelism">Advantages of Horizontal Scaling with Pipeline Parallelism&nbsp;</h2></div><p>When horizontally scaling an AI inference system using pipeline parallelism across multiple GPUs, the benefits are amplified, especially for large-scale models that cannot fit or process efficiently on a single machine.</p><div class="relative header-and-anchor"><h3 id="h-1-scalability-flexibility-and-adaptability"><strong>1. Scalability, Flexibility, and Adaptability</strong></h3></div><ul><li><p><strong>Support for Large Models</strong>: Pipeline parallelism splits the model into stages distributed across multiple GPUs, allowing inference of models that exceed the memory and compute capacity of any single GPU or machine.</p></li><li><p><strong>Dynamic Expansion</strong>: Additional GPUs can be integrated into the pipeline to handle increasing workloads or deploy more model partitions.&nbsp;</p></li><li><p><strong>Elastic Workload Distribution</strong>: Pipeline stages can be adjusted dynamically to balance workloads across GPUs, ensuring that no single GPU becomes a bottleneck.</p></li><li><p><strong>Graceful Degradation</strong>: In a multi-GPU setup, failures in one pipeline stage can be mitigated by redistributing tasks to other GPUs or reconfiguring the pipeline dynamically (e.g. Using high-speed interconnects like NVLink or Infiniband ensures that failures in one communication path can be bypassed through alternate routes).</p></li><li><p><strong>Modular Design</strong>: Changes to one stage of the pipeline (e.g. updating a layer or swapping hardware) can be made without affecting the entire system.</p></li></ul><hr><div class="relative header-and-anchor"><h3 id="h-2-reliability-and-redundancy"><strong>2. Reliability and Redundancy</strong></h3></div><ul><li><p><strong>Node-Level Redundancy</strong>: Multiple GPUs can be allocated to the same pipeline stage, ensuring that a failure in one GPU doesn’t halt the entire stage. Input data and intermediate activations can additionally be replicated across nodes, reducing the risk of data loss during inference.</p></li><li><p><strong>Increased Uptime</strong>: Idle or underutilized GPUs can act as backups, ready to take over when an active GPU fails, acting as hot standby nodes.</p></li><li><p><strong>Graceful Degradation</strong>: In a multi-GPU setup, failures in one pipeline stage can be mitigated by redistributing tasks to other GPUs or reconfiguring the pipeline dynamically (e.g. Using high-speed interconnects like NVLink or Infiniband ensures that failures in one communication path can be bypassed through alternate routes).</p></li></ul><p><strong>Reduced Impact of Node Failures</strong>: Horizontal scaling ensures redundancy, preventing single GPU or machine failures from causing complete system downtime, as failures are isolated within a single stage.</p><hr><div class="relative header-and-anchor"><h3 id="h-3-cost-effectiveness"><strong>3. Cost-Effectiveness</strong></h3></div><ul><li><p><strong>Efficient Use of Resources</strong>: Distributed pipeline systems can use GPUs with smaller memory capacity, reducing the need for expensive high-memory devices.</p></li><li><p><strong>Hybrid GPU Integration</strong>: High-end GPUs with large VRAM capacities, like the NVIDIA GH200, are expensive and often in short supply. Pipeline parallelism supports combining enterprise-grade GPUs (e.g. NVIDIA GH200) and consumer-grade GPUs (e.g. RTX 3090), balancing cost and performance in hybrid setups.</p></li></ul><hr><div class="relative header-and-anchor"><h3 id="h-4-throughput-latency-and-energy-consumption"><strong>4. Throughput, Latency, and Energy Consumption</strong></h3></div><ul><li><p><strong>Parallel Execution</strong>: Stages of the pipeline operate concurrently, processing multiple input batches in parallel, which significantly increases overall throughput.</p></li><li><p><strong>Optimized GPU Utilization</strong>: Each GPU focuses on specific parts of the model, ensuring all devices are used efficiently and consistently.</p></li><li><p><strong>Reduced Bottlenecks</strong>: By breaking the model into smaller pipeline stages, each GPU handles a fraction of the total computation, reducing the time per stage and overall latency for batch inference.</p></li><li><p><strong>Overlap of Computation and Communication</strong>: Pipeline parallelism allows concurrent data transfer and computation, hiding communication delays and minimizing idle time.</p></li><li><p><strong>Optimized Workload Balancing</strong>: Each GPU operates at its most efficient load, minimizing unnecessary power consumption.</p></li><li><p><strong>Selective Activation</strong>: Idle GPUs can remain powered down until required, reducing energy use during low-demand periods.</p></li></ul><p>With these benefits in mind, let’s dive further into how horizontal scaling can affect the landscape of AI Inference.</p><hr><div class="relative header-and-anchor"><h2 id="h-unlocking-ai-with-consumer-grade-gpus-and-hybrid-approaches">Unlocking AI with Consumer-Grade GPUs and Hybrid Approaches</h2></div><p>The rise of massive AI models like <strong>Llama 3.1 (405B parameters)</strong> and <strong>Deepseek R1 671B </strong>has highlighted the growing need for efficient and scalable inference solutions. Traditionally, these workloads have been relegated to enterprise-grade GPUs like NVIDIA’s GH200, or H100. However, pipeline parallelism introduces an exciting opportunity: leveraging <strong>consumer-grade GPUs</strong> or <strong>hybrid approaches</strong> that combine consumer and enterprise-grade hardware to achieve high performance at a fraction of the cost.</p><div class="relative header-and-anchor"><h3 id="h-the-case-for-consumer-grade-gpus-in-ai-inference">The Case for Consumer-Grade GPUs in AI Inference</h3></div><p>Consumer-grade GPUs, such as NVIDIA’s <strong>RTX 3090 (24GB VRAM)</strong>, offer impressive computational power at significantly lower costs than their enterprise counterparts. While these GPUs were not initially designed for multi-device AI workloads, pipeline parallelism makes it possible to use them effectively by distributing the workload across multiple devices.</p><div class="relative header-and-anchor"><h4 id="h-why-consider-consumer-grade-gpus"><strong>Why Consider Consumer-Grade GPUs?</strong></h4></div><ol><li><p><strong>Affordability</strong>:</p><ul><li><p>Consumer GPUs are often 5-10x cheaper than enterprise GPUs with comparable raw performance.</p></li></ul></li><li><p><strong>Availability</strong>:</p><ul><li><p>Consumer GPUs are widely available, making them a practical choice for organizations with budget constraints.</p></li></ul></li><li><p><strong>Hybrid Potential</strong>:</p><ul><li><p>Combining consumer-grade GPUs with enterprise-grade GPUs allows for cost-effective scaling while retaining high-end capabilities for bottleneck stages.</p></li></ul></li></ol><p>Let’s explore how pipeline parallelism enables these possibilities by comparing costs, latency, and throughput.</p><hr><p>Below is an example of different configurations setups an enterprise might be able to use with hybrid pipeline parallelism. Currently many enterprises are using H100 clusters, so we will use that as a basis and compare two very large models. Below are some theoretical estimates of potential hybrid setups.</p><div class="relative header-and-anchor"><h3 id="h-gpu-configurations">GPU Configurations</h3></div><p>Calculations: </p><p>Llama 3.1 405B (8-bit) requires ~486GB of GPU memory in 8 bit mode.  [<a target="_blank" rel="noopener noreferrer nofollow ugc" class="dont-break-out" href="https://www.substratus.ai/blog/llama-3-1-405b-gpu-requirements">Source</a>]</p><p><code>8 x 80GB = 640 GB ~ 31% Overhead</code></p><p><code>24 x 24GB = 576 GB ~ 18% Overhead</code></p><p><code>4 x 80GB + 12 x 24GB = 608GB ~ 25% Overhead</code></p><p>Deepseek R1 671B requires ~ 1,342 GB of GPU Memory. [<a target="_blank" rel="noopener noreferrer nofollow ugc" class="dont-break-out" href="https://dev.to/askyt/deepseek-r1-671b-complete-hardware-requirements-optimal-deployment-setup-2e48">Source</a>] </p><p><code>20 x 80GB = 1600GB ~ 19% Overhead</code></p><p><code>64 x 24GB = 1546GB ~ 14% Overhead</code></p><p><code>10 x 80GB + 32 x 24GB = 1568GB ~ 17% Overhead</code></p><p></p><table style="min-width: 332px"><colgroup><col><col style="width: 282px"><col></colgroup><tbody><tr><td colspan="1" rowspan="1"><p>Configuration</p></td><td colspan="1" rowspan="1" colwidth="282"><p>Llama 3.1 405B (8 Bit)</p></td><td colspan="1" rowspan="1"><p>Deepseek R1 671B</p></td></tr><tr><td colspan="1" rowspan="1"><p>Enterprise</p></td><td colspan="1" rowspan="1" colwidth="282"><p>8 x H100 (80GB vRAM)</p></td><td colspan="1" rowspan="1"><p>20 x H100 (80GB vRAM)</p></td></tr><tr><td colspan="1" rowspan="1"><p>Consumer</p></td><td colspan="1" rowspan="1" colwidth="282"><p>24 x RTX 3090 (24GB vRAM)</p></td><td colspan="1" rowspan="1"><p>64 x RTX 3090 (24GB vRAM)</p></td></tr><tr><td colspan="1" rowspan="1"><p>Hybrid</p></td><td colspan="1" rowspan="1" colwidth="282"><p>4 x H100 + 12 x RTX 3090</p></td><td colspan="1" rowspan="1"><p>10 * H100 + 32 x RTX 3090</p></td></tr></tbody></table><div class="relative header-and-anchor"><h3 id="h-upfront-cost-buy">Upfront Cost (Buy)</h3></div><p>H100 cost per unit: $27,988 [<a target="_blank" rel="noopener noreferrer nofollow ugc" class="dont-break-out" href="https://www.amazon.com/NVIDIA-Hopper-Graphics-5120-Bit-Learning/dp/B0CXBNNNSD">Source</a>] </p><p>RTX 3090 cost per unit: $1,790[<a target="_blank" rel="noopener noreferrer nofollow ugc" class="dont-break-out" href="https://www.amazon.com/NVIDIA-RTX-3090-Founders-Graphics/dp/B08HR6ZBYJ">Source</a>]</p><table style="min-width: 75px"><colgroup><col><col><col></colgroup><tbody><tr><td colspan="1" rowspan="1"><p>Configuration</p></td><td colspan="1" rowspan="1"><p>Llama 3.1 405B&nbsp;(8-bit)</p></td><td colspan="1" rowspan="1"><p>Deepseek R1 671B</p></td></tr><tr><td colspan="1" rowspan="1"><p>Enterprise</p></td><td colspan="1" rowspan="1"><p><strong>$223,904</strong> (8 × $27,988)</p></td><td colspan="1" rowspan="1"><p><strong>$559,760</strong> (20 × $27,988)</p></td></tr><tr><td colspan="1" rowspan="1"><p>Consumer</p></td><td colspan="1" rowspan="1"><p><strong>$42,960</strong> (24 x $1,790)</p></td><td colspan="1" rowspan="1"><p><strong>$114,560</strong> (64 x $1,790)</p></td></tr><tr><td colspan="1" rowspan="1"><p>Hybrid</p></td><td colspan="1" rowspan="1"><p><strong>$133,432 </strong>(4  x $27,988 + 12 x $1,790)</p></td><td colspan="1" rowspan="1"><p><strong>$337,160</strong> ( 10 x $27,988 + 32 * $1,790)</p></td></tr></tbody></table><div class="relative header-and-anchor"><h3 id="h-monthly-cost-rent">Monthly Cost (Rent)</h3></div><p>H100 cost per unit per hour: $2.13 [<a target="_blank" rel="noopener noreferrer nofollow ugc" class="dont-break-out" href="https://www.amazon.com/NVIDIA-Hopper-Graphics-5120-Bit-Learning/dp/B0CXBNNNSDhttps://vast.ai/pricing/gpu/H100-PCIE">Source</a>] </p><p>RTX 3090 cost per unit: $0.18 [<a target="_blank" rel="noopener noreferrer nofollow ugc" class="dont-break-out" href="https://www.amazon.com/NVIDIA-RTX-3090-Founders-Graphics/dp/B08HR6ZBYJhttps://vast.ai/">Source</a>]</p><p>At an assumed 720 usage hours per month (24 hours/day x 30 days)</p><table style="min-width: 75px"><colgroup><col><col><col></colgroup><tbody><tr><td colspan="1" rowspan="1"><p>Configuration</p></td><td colspan="1" rowspan="1"><p>Llama 3.1 405B&nbsp;(8-bit)</p></td><td colspan="1" rowspan="1"><p>Deepseek R1 671B</p></td></tr><tr><td colspan="1" rowspan="1"><p>Enterprise</p></td><td colspan="1" rowspan="1"><p><strong>$12,268.80 </strong>($2.13 x 8 x 720)</p></td><td colspan="1" rowspan="1"><p><strong>$30,672.00</strong> (20 x $2.13 x 720)</p></td></tr><tr><td colspan="1" rowspan="1"><p>Consumer</p></td><td colspan="1" rowspan="1"><p><strong>$3,110.40 </strong>(24 x $0.18 x 720)</p></td><td colspan="1" rowspan="1"><p><strong>$8,294.40</strong> (64 x $0.18 x 720)</p></td></tr><tr><td colspan="1" rowspan="1"><p>Hybrid</p></td><td colspan="1" rowspan="1"><p><strong>$7,689.60</strong> (4 x $2.13 x 720 + 12 x $0.18 x 720)</p></td><td colspan="1" rowspan="1"><p><strong>$19,483.20</strong> (10 x $2.13 x 720 + 32 x $0.18 x 720)</p></td></tr></tbody></table><div class="relative header-and-anchor"><h3 id="h-theoretical-comparisons">Theoretical Comparisons</h3></div><p>Llama 3.1 405B (8-Bit): [<a target="_blank" rel="noopener noreferrer nofollow ugc" class="dont-break-out" href="https://developer.nvidia.com/blog/boosting-llama-3-1-405b-throughput-by-another-1-5x-on-nvidia-h200-tensor-core-gpus-and-nvlink-switch/">Source</a>]</p><p>Using the source data, we find that with pipeline parallelism with a H200 setup results in 764 tokens/second. </p><p>H200s are approximately 50% more performant than H100s, HBM3e memory bandwidth in H200 is <strong>4.8 TB/s</strong>, while H100 has <strong>3.35 TB/s</strong> (H200 is ~1.43x faster in memory access).</p><p><code>764 tokens/sec x 66.67% =  509.33 tokens/sec</code></p><p>Deepseek R1 671B: [<a target="_blank" rel="noopener noreferrer nofollow ugc" class="dont-break-out" href="https://blogs.nvidia.com/blog/deepseek-r1-nim-microservice/">Source</a>]</p><p>Using the source data, we find that with pipeline parallelism with a H200 setup results in 3872 tokens/second. </p><p><code>3872 tokens/sec × 66.67% ​= 2581.33 tokens/sec</code></p><table style="min-width: 536px"><colgroup><col style="width: 120px"><col><col style="width: 107px"><col><col style="width: 124px"><col style="width: 110px"><col></colgroup><tbody><tr><td colspan="1" rowspan="1" colwidth="120"><p>Conf</p></td><td colspan="1" rowspan="1"><p>Llama 3.1 405B (8-Bit)</p></td><td colspan="1" rowspan="1" colwidth="107"><p>Llama 3.1 405B (8-Bit)</p></td><td colspan="1" rowspan="1"><p>Llama 3.1 405B (8-Bit)</p></td><td colspan="1" rowspan="1" colwidth="124"><p>Deepseek R1 671B</p></td><td colspan="1" rowspan="1" colwidth="110"><p>Deepseek R1 671B</p></td><td colspan="1" rowspan="1"><p>Deepseek R1 671B</p></td></tr><tr><td colspan="1" rowspan="1" colwidth="120"><p></p></td><td colspan="1" rowspan="1"><p>throughput (tok/sec)</p></td><td colspan="1" rowspan="1" colwidth="107"><p>latency (ms)</p></td><td colspan="1" rowspan="1"><p>monthly cost (USD)</p></td><td colspan="1" rowspan="1" colwidth="124"><p>throughput (tok/sec)</p></td><td colspan="1" rowspan="1" colwidth="110"><p>latency (ms)</p></td><td colspan="1" rowspan="1"><p>monthly cost (USD)</p></td></tr><tr><td colspan="1" rowspan="1" colwidth="120"><p>Enterprise</p></td><td colspan="1" rowspan="1"><p>~510</p></td><td colspan="1" rowspan="1" colwidth="107"><p>~250ms</p></td><td colspan="1" rowspan="1"><p>$<strong>12,268.80</strong></p></td><td colspan="1" rowspan="1" colwidth="124"><p>~2580</p></td><td colspan="1" rowspan="1" colwidth="110"><p>~50ms</p></td><td colspan="1" rowspan="1"><p><strong>$30,672.00</strong></p></td></tr><tr><td colspan="1" rowspan="1" colwidth="120"><p>Consume</p></td><td colspan="1" rowspan="1"><p>~340</p></td><td colspan="1" rowspan="1" colwidth="107"><p>~380ms</p></td><td colspan="1" rowspan="1"><p>$<strong>3,110.40 </strong></p></td><td colspan="1" rowspan="1" colwidth="124"><p>~1700</p></td><td colspan="1" rowspan="1" colwidth="110"><p>~75ms</p></td><td colspan="1" rowspan="1"><p>$<strong>8,294.40</strong></p></td></tr><tr><td colspan="1" rowspan="1" colwidth="120"><p>Hybrid</p></td><td colspan="1" rowspan="1"><p>~435</p></td><td colspan="1" rowspan="1" colwidth="107"><p>~295ms</p></td><td colspan="1" rowspan="1"><p>$<strong>7,689.60</strong></p></td><td colspan="1" rowspan="1" colwidth="124"><p>~2200</p></td><td colspan="1" rowspan="1" colwidth="110"><p>~60ms</p></td><td colspan="1" rowspan="1"><p>$<strong>19,483.20</strong></p></td></tr></tbody></table><p></p><p></p><figure float="none" data-type="figure" class="img-center" style="max-width: null;"><img src="https://storage.googleapis.com/papyrus_images/ffa187811017d1134ffa3132d3193220.png" blurdataurl="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAACAAAAAVCAIAAACor3u9AAAACXBIWXMAAAsTAAALEwEAmpwYAAAEt0lEQVR4nI1VT2zTVhh/m3zawTvllF168qk333Kg0S5mh7BLtgNBIHlCag7rZgmpPlQzHJDZIS4iKKxsuKVrHxTV/Zd5igpRQQXKNgWkQRgalJa2rylQ1+4S2wEnb7JfsEy7w36Kknx+1vu+3+/7BzDGTgQY488PHaJpmmXZrq6uWCzGsiwAgGEYiqLS6XQqlYrH4zzPC4IQC5BIJOLxTz74kOI4jmVZmqbj8Xi1WsUBAPnBGCOElu7exbj989jYkSNH+vv7eZ7PZDKyLGezWUEQyLckSTzPK4qiqmomgCiK336TzRz+gpjd3d3xeFzXdcMwHMfxHUiSBACgKEpRlJ9yp+HQuabreK1WlJYTwDCMMKDQdBx3FuYX568d6EmyLCvLMvGhaRpCyHdQqVRUVZVluVAoYNwuTV4Z6D02ms+92qo1bDsqIEJov1lvNMZ+vDBxWT3Q08MwTCKRSKVSmUwmkUhUKhXfAYmRuCHRNV23XNROn8jCoXP/WOa7oNtRBhjj0PRarXqjQTQhzAg6ORBFEQRQFMVrtcjZG8/bqtXg0PmTX/Oj+dxWrea1WghtWKZpGAa5jjAwDMMyza1aDSEUXh2eAlEUK5UKxvjOnTshA0Ki6bqO4zRdtzR55fSJ7Oz4pdomCo+arrsnJUSJ6BOfAQm8Wq2qAaJvYIwts6MPcTPQe1QbuVi3LPLkfzngeT6ZTIYShQqSPyigaZmm4/h1hdDGHBwe6D129eL5rVptt16PKv7fEjEMo2naniTviYIgDDlgA0/1HYdDeYzb5JQEgfczSKfTSgBBEBRFCW7Ztu1GWA/O+3VpmaZtN7xWy9wxRvK5U33H5+CwZZok7cRNtGl8BhzHxWIxmqaDPsBvPI90mW03os6ifUAu2q3XLdOcg8MBG180r9WyTIucdhzQNB2VaPXZg6WFybXnDzFut9veHtaGsR1qEu2DpuvMjl862fdVuTjZdF0iY0cihmEEQeA4LplMDg6eXXv+6N4tfaEEixODv1zLl2bUpYXJ9ZW/zB3D3DG2DSMcIZZpIrQRsPRlfON5a6uro/ncQO+xIrxM2qUzi1RVZRiGZdnBwbO+87dvg1tcyzTv3ys/ffzbzdJocWJwfvbijTn13q3pV7UXO8ZL227s1uvkTcfxM2cY2yQ3cOjcQO/Ryt1Fn0FYv/uriASL38Haef3wweLTx38slC7Pz/xwszR6Y059dL/s2nXX9us1+vLm2srq0yedUYEQoiiKpmnSB2E5I4SWl5ejpb2+gRq2TT7Pnvy5uFBculWcHstNj+X0yQszV/OVpeurz/9GGysN217fCPqAFCwpU1JFYSB75jOOZHW/+ar2Yvv1+qP75fnZodtl+Ptt7eXmynsLZ3l5OZ1O8zwPIdQ0Tdd1WZY5jtN1nZiKokRNWZaTyaTuo6hp2pkz3/ckP71+o6zr81NTU4e/TH1Mf6Trv3YckKh1XRcEoVqtSpJUKBQ4jgMAQAglSYIQdnd3AwA0TRNFsVAokAGTzWYzmYyu658dPAgAGB8fk6TvyuXyJXUEANDZB3vgOA7HcWQDsyxL1hMAgOO4bDYrSRLLshRFkc1M9rAkSaIosgFisRiEMBwV/wIQ0fah/6dzIAAAAABJRU5ErkJggg==" nextheight="947" nextwidth="1440" class="image-node embed"><figcaption htmlattributes="[object Object]" class="hide-figcaption"></figcaption></figure><p></p><figure float="none" data-type="figure" class="img-center" style="max-width: null;"><img src="https://storage.googleapis.com/papyrus_images/7bd46b0dc22880a5f1d7be9f438bd55c.png" blurdataurl="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAACAAAAAWCAIAAAAuOwkTAAAACXBIWXMAAAsTAAALEwEAmpwYAAAEeklEQVR4nKVWb0gbZxh/KaH7ZvclH0q2sbCBXxa2DwerHPvjF7nBel/qwZi3DXZurIExPfFPYtWrk1WM5bS1Hp11KabtVY1xDHXTFmtbmjqpV3UmBhujzHmmo14uXe7OQtIbyZte4nRD1h8v4X3yXp7f83v+vBeg78BTXdeDwSBJkhiGURSF4ziGYSzL4jhOEITD4eA4jmVZu92OYRiO4xRFIQhit9upDIqLixmGeeYq7Q1Ax6IoRiKrmrat67ogCAiCWCwWBEFMJpPZbCZJ0mQy2Ww2BEFIkqQoiiRJAIDVasVxHEVReISiKAAAx3Fd17UMcgSCMFdDf70wO97U1FBSUnLq1Hc0TaMoCsOBj0qSBDcQhos9zfXVQHjpXpbA5/O9935x+Scffd924vPPPgUAmEwmq9WKomhhYSFBENC7KIqQA8IwjU8DyVSq/4eWwNydLAHHcWaz+dvm5vW11UAg4M9AyCAYDAqCAF3sJhBFMZ9AFEU5lvY+PTU05u2O/ZXQNC2bIr/fz/O8kQ0ISZLyG+C/c6Jpmqoquq4H7k+O/9i9owYURQEAOI7TNO1hNGooFcUNSdqKy7Icy5mqqkiSpKoKVJBQEoqqatq2oqqRleUxb/foYJdxmlYgiiLsnN0KYjF5Hwqerq8uhpdmrv/UO+xpCy/NwB7NKSAIAsMwi8Xidl/8PbIwNX5l+uYwXFPjV25f6781wd8av3Rz3DPm7R7znhsdPDsy0DEy0DHsaRv2tI0Onk0/MMGHQ3OStJVMpVRVMWqeJsAwzOFwEARxobf3z801Yfr60vyd9Fq4OzdzI7x0byU0uxK6/8daKBxaEDfW5NiWHJPkmBTd3IhubsCqJlMpRVVh6mDsOQK73e52u30+3/9NkQ6H9t9O09NYUFBgFFmStiB5plAbsKqK8jh/Doyuhd/kt6kRfq7IVquV47hIJPL8bartqUDXdZqmoQLjPthzVsV9DJpxaihLE0iSxLKs2+3+B0H+Xsv7zf7NrAJBEEiSfM4UxWU5LstPtrfhistyLkUsy+6e5Lgs757kh9FoXJafne4ockJRHicSsGWTqdSjrUdrq5FkKmVc13tMckJJ3y35ISdTqT0VPNnWHgTmZ+/efhCYX14U+rpcJ78p/23216yCSCQCAGh1uSZHhlqqjrfX0y5nhctZceZkHdzA1d5AdzI1htnJ1HQyNS1Vx5sqvmQqvzrTUt/R7DjdWH26sXrql5HFeSHbpjAEv9/vcDhffeXl2uqqjnbXxQs97p7zAIDa6qrLnr6ec12XPX0AgOJ337nU53b3nPcO8C8dPlz4+mveAZ7r6uy/evXtI0VvvPlWILT888S1gcHBI0VFAACe57MEsEreoSFYkg+PHq2orAQAfFxWVl7+xbHS0tq6OgCAw+FsbGrCsA9ONDSAAwdKMjhWWioIwgsHD7546FC900mSZYIgTE5OkiQpimKOwADLsjabjaIomqaJDOBrubW1led5hmFQFIVvfI7jKIqy2+0+n49hGONPQjAYNLz9DQv9adz2T9AKAAAAAElFTkSuQmCC" nextheight="947" nextwidth="1397" class="image-node embed"><figcaption htmlattributes="[object Object]" class=""><br></figcaption></figure><p></p><figure float="none" data-type="figure" class="img-center" style="max-width: null;"><img src="https://storage.googleapis.com/papyrus_images/b0fcef4b80d70eb13b8bfe3ce366293c.png" alt="Output image" blurdataurl="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAACAAAAAVCAIAAACor3u9AAAACXBIWXMAAB7CAAAewgFu0HU+AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjYuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/P9b71AAAECklEQVR4nK1Vb0gbZxh/Q9mn0H3Jp9uHdJBQvA+bn85C41gPRrM/MmbR2RFGCYNyZZCO2YB0uj+x1UUjm0V3o25NnTFC580Sg3+W5upyZ2LuFjczVw2OeoO1nt1I7hzm1YF5R3LmTD8ow/Xhx8v7e++ee/7c87wPQAjBCkEINTY2AgCee77aYrECAF6tqztehVdV4cercACAxWJ948yZN8+effHUKQCA0Wg0m81PHTny7LFjz2CYwWAAABAEoX0WIQRQWTiO+21laX52ps/bfv7tprZm1/vUO+ea6j/r+OSLns7md8+/VV/3+mnyXFN9W7PrWtfVYf/1jz9sGwoEGIYRBCE0Hr4dGg8Ggy6Xi2GYxwwEg8ETNTV2u31hPnX7m6/YcWbu7hQ/HWLHGR38dEiD9ig8MnhroG+4vyfQ1xXo6+q/cnmaCSyKCd1dCKEsy7sGIpGI0+l0OBz3V1d3CgUdm/n835ubB1BZlnO5bC6X3VDV2ejUQM/VzuYLC2Jia2t7Q1VlWYYQAj31fr8/k8kgVNABYR7C/P50V1F3GUKY++tPX4srORNBCCmKshuBLMuZTMbr9UqSpCvoOui/0Q1V3VBVhND2FvS1uO4vL+bzpX9A07TT6QQA4DguSVJlRSkl0TbaqkV9AN1Q1XweLqV/7my+kM1miykiCCIcDlek6JARwDL9Z3sbIRTo990NjxUjwHFcq6pDG4Bl97VQIIQ7hcLKvYWpsW93CgXgcDjcbjdBECRJ7peig2k2m50au85O3Tptf7mhoYGmaYIgTtTUSJKkKAogSbK1tdXj8TgcjkwmcwgDsiwP9LR/fa3HYrGaTCYcx2022wu1tQRBCIIA7Hb7k0iRur7+SC6JZlhRFM3d4mWiFajH4/k/ZQofp3t94HQ6KYp6UmUKy1o6LTbaYTs5vz+tiEAP7Un1gS57V4WiKNXV1bW1tYvpVJxl4uyoBu77EXZypHRSBDs5zE6OiHwozjLi7DgfHWUnR+bj4fl4OP1jRORDpZdHl36J53LZzfzmw7W1vRT19vZSFLW6uvJobeWB9KuGyv2DEv6QFstIV2L94XJ5XVxOz/B3Bn+a+05e+303Aj1FFy++N8qMxfhELMYnhdTnvX2feruTQipyZyYppG74hzQa4xNJIUV/OeBp71hI34vxiURSvOEfuuRuSQqpH2JcLMZ3XPnotVdeEkVxd6JpGXS5XL7ubjYapWmajUafPnrUZjspCILf70+JoslkMpvNExMTNE3P8rzNdhLDMF+3z+PxpESRJIsTVBDmBm/enOX5yx+0AgA4jtsbmbpwHGc0GjEMI0nSarUSBAEAwDDMZrO53W6KoqxWq8FgwDCMoqjqkni9Xu0cx3Gj0Vg5Mv8FNzs0W31mjKwAAAAASUVORK5CYII=" nextheight="947" nextwidth="1418" class="image-node embed"><figcaption htmlattributes="[object Object]" class="hide-figcaption"></figcaption></figure><p></p><div class="relative header-and-anchor"><h3 id="h-limitations-of-pipeline-parallelism"><strong>Limitations of Pipeline Parallelism</strong></h3></div><p>While <strong>pipeline parallelism</strong> allows AI inference to scale horizontally, it comes with a significant challenge—<strong>inter-node communication bandwidth</strong>. Unlike vertical scaling, where all computations happen within a single high-memory GPU, distributed inference requires frequent data transfers between devices, which can quickly become a bottleneck.</p><p><strong>Minimizing these transfers</strong> is crucial for efficiency. Techniques like <strong>activation checkpointing, fused communication, and tensor rematerialization</strong> help reduce bandwidth overhead by strategically managing how and when data moves between nodes. However, even with these optimizations, factors like <strong>network topology, PCIe/NVLink speeds, and interconnects such as InfiniBand</strong> can impact overall performance.</p><p>As models grow, the balance between <strong>compute efficiency and communication overhead</strong> becomes increasingly difficult. High-bandwidth, low-latency connections are essential to maintaining smooth inference, but the scalability of pipeline parallelism will always be constrained by how effectively inter-node communication is handled.</p><div class="relative header-and-anchor"><h2 id="h-key-takeaways">Key Takeaways</h2></div><ol><li><p><strong>Balanced Cost-Performance Tradeoff</strong>:</p><ul><li><p>Hybrid setups combine enterprise GPUs (e.g., NVIDIA GH200, H100) and consumer GPUs (e.g., NVIDIA RTX 3090), balancing the high upfront and operational costs of enterprise solutions with the affordability of consumer-grade hardware.</p></li><li><p>This configuration provides a cost-efficient way to scale horizontally without compromising significantly on performance.</p></li></ul></li><li><p><strong>Increased Reliability and Fault Tolerance:</strong></p><ul><li><p>By leveraging multiple GPUs across consumer and enterprise tiers, hybrid systems reduce the risk of single points of failure, ensuring higher availability and redundancy for mission-critical applications.</p></li><li><p>If a GPU or node fails, workloads can be dynamically reallocated to maintain continuity, providing strong fault tolerance.</p></li></ul></li><li><p><strong>Competitive Throughput with Optimized Parallelism</strong>:</p><ul><li><p>Hybrid setups benefit greatly from <strong>pipeline parallelism</strong> and other distributed inference optimizations, achieving competitive throughput (tokens/sec) compared to fully enterprise configurations. With proper parallelism, hybrid setups may even have improved throughput.</p></li><li><p>These optimizations mitigate the bottlenecks typically associated with consumer GPUs, such as memory limitations and interconnect latency.</p></li></ul></li><li><p><strong>Scalability Across Diverse Workloads</strong>:</p><ul><li><p>Hybrid setups provide flexibility in scaling horizontally to accommodate growing model sizes and inference demands.</p></li><li><p>They are particularly effective for mid-to-large scale models like <strong>Llama 3.1 405B</strong> and <strong>Deepseek R1 671B</strong>, where pure consumer setups might fall short and enterprise-only setups could be cost-prohibitive.</p></li></ul></li><li><p><strong>Future-Proofing for AI Workflows</strong>:</p><ul><li><p>As model sizes continue to grow, hybrid setups provide a scalable and adaptable architecture that can evolve with technological advancements in both consumer and enterprise hardware.</p></li><li><p>They enable organizations to experiment with state-of-the-art models without committing fully to high-cost enterprise solutions.</p></li></ul></li></ol><hr><p></p><p>Hybrid setups represent a <strong>pragmatic approach to horizontally scaling AI inference</strong>, making high-performance AI accessible to a wider range of organizations while optimizing for cost, latency, and throughput.</p><p><strong>Function Network </strong>enables this scalability by providing a decentralized AI infrastructure, allowing models to run efficiently across distributed compute resources with seamless optimization for performance and cost. To learn more about how Function Network facilitates <strong>efficient decentralized AI inference</strong>, check out our deep dive on how we're tackling distributed AI Inference.</p>]]></content:encoded>
            <author>function.network@newsletter.paragraph.com (Alex Mo)</author>
            <category>ai</category>
            <category>inference</category>
            <category>function</category>
            <category>distributed</category>
            <enclosure url="https://storage.googleapis.com/papyrus_images/4e7b58cb26a10ccb2896704c44dccb69.jpg" length="0" type="image/jpg"/>
        </item>
    </channel>
</rss>