With the rapid development of artificial intelligence (AI) technology, an increasing number of large language model (LLM) applications require efficient computational resources. In this article, we will explore how to integrate APUS's GPU extension into the Application Overlay (AO) system to support more powerful AI model inference.
Before delving into how GPU extensions work in the AO network, let's briefly review how typical AI applications operate and the composition of the AO network to lay the groundwork for the subsequent discussion.
The operation of a typical LLM application involves multiple technical layers:

Application Layer: End-user applications such as Midjourney, Jasper, and GitHub Copilot. These applications meet user needs by calling underlying foundational models.
Foundational Models: These include closed-source models (e.g., GPT-3) and open-source models (e.g., Stable Diffusion). Closed-source models are typically accessed via APIs, while open-source models are released as pre-trained weights for users to deploy and use freely.
Cloud Platforms: Cloud computing platforms (e.g., AWS, GCP, Azure) provide environments for developers to deploy and run foundational models, supporting large-scale computational and storage needs.
Computing Hardware: Underlying hardware (e.g., NVIDIA GPUs, Google TPUs) provides the computational power needed for model training and inference.
The AO network consists of the following key components, which collectively enable distributed computing and message exchange:

ir computing environment and state.
Messages: Standardized data used to exchange information between processes.
Scheduler Units (SUs): Responsible for assigning unique slot numbers to messages and ensuring their storage on Arweave.
Compute Units (CUs): Provide computational services and parse process states.
Messenger Units (MUs): Forward messages across the network, coordinating communication between processes.
Based on this, the ao.TN.1 version of AO implements:
WASM-based Virtual Machine Environment: Supports up to 4 GB of RAM, providing an execution environment for processes.
Lua Runtime Environment (ao-lib): Compiled to WASM, enabling developers to easily develop AO processes using Lua.
Operating System Environment (aos): Users can interact with and operate the system through a Lua command-line interface.

To implement GPU-based inference for LLMs, we need to consider multiple technical layers to adapt the corresponding components in AO. This involves hardware support, software abstraction layers, and specific implementation interfaces.
First, the details of the inference tech stack are as follows:

Hardware Abstraction and Software Support: Through hardware abstraction layers like OpenCL and Vulkan, as well as specific low-level software interfaces such as CUDA, ROCm, and OneAPI, efficient cross-platform computing is achieved. This ensures compatibility and performance of large models across different computing architectures.
Model Optimization and Parallel Frameworks: Tools like TensorRT, OpenVINO, and XLA optimize models to improve computational efficiency. Parallel computing frameworks like Ray and Horovod further accelerate training and inference, especially in large-scale cluster environments.
Deep Learning Frameworks: Mainstream frameworks like TensorFlow and PyTorch, along with interoperability formats like ONNX, provide flexible development environments, enabling easy model conversion and deployment.
In this context, WebAssembly (WASM), as AO's runtime, offers multiple integration methods for interfacing with GPUs. While WASM itself does not support direct GPU calls, extensions like WASI (WebAssembly System Interface) can enable this functionality. Currently, promising extensions such as wasi-nn and wasi-gfx provide different implementation paths.
wasi-nn focuses on supporting neural network inference. It allows developers to load and run neural network models for inference in a WebAssembly environment by calling system-level interfaces, without dealing with low-level details. By providing simple model and tensor operations, wasi-nn enables developers to focus on inference logic, while relying on underlying backend implementations. This design allows wasi-nn to integrate with deep learning engines like TensorFlow Lite and ONNX Runtime as its inference foundation.

In LLM inference applications, the advantage of wasi-nn lies in its ability to easily integrate with mainstream deep learning backends, leveraging their optimization and acceleration capabilities. However, the performance of wasi-nn largely depends on the chosen backend libraries and hardware characteristics.
In contrast, wasi-gfx primarily targets graphics processing and rendering tasks. Its interface design aims to provide WebGPU support, enabling WebAssembly modules to perform high-performance graphics rendering. While wasi-gfx is not focused on AI inference, its underlying GPU call mechanisms and abstraction capabilities are advantageous for model visualization and graphical output.
For LLM inference, another potential use case for wasi-gfx is when inference tasks require intensive graphical interaction or display. This interface can serve as a foundation, enabling more efficient rendering of graphical results in user interfaces during computation.
When selecting a specific interface implementation, the following factors should be considered:
Use Case Differences: wasi-nn is more suitable for mainstream neural network model integration, while wasi-gfx is better for general-purpose graphics computing.
Underlying Implementation: The strength of wasi-nn lies in its integration with various deep learning inference engines, while wasi-gfx provides cross-platform GPU rendering interfaces.
Hardware Support: Both can leverage GPUs for performance optimization, but their target computation types differ: wasi-nn focuses on AI inference, while wasi-gfx focuses on graphics computing.
To enable GPU-based LLM inference in AO, APUS is developing a GPU Extension aimed at Enabling Verifiable Decentralized AI through Deterministic GPU Computing [https://r2krpzvyn24gq75rtedeo56vpiyxvcya2xsntoeaz7ursparocea.arweave.net/jpUX5rhuuGh_sZkGR3fVejF6iwDV5Nm4gM_pGTwRcIg]。
Our approach is to integrate the GPU Extension as a pluggable WASI interface into AO's WASM runtime. Specifically:
Integration Method: Embed the GPU Extension into the WASM runtime (currently Node.js WebAssembly, potentially upgraded to HyperBEAM in the future), enabling AO processes to invoke GPU computations.
Interface Design: The GPU Extension provides interfaces similar to wasi-nn for module usage but is not limited to this. We may design GPU computing interfaces more suited to the AO network based on actual needs.
Implementation Approach: At the underlying level, efficient inference engines like llama.cpp can be used to support mainstream LLMs. Additionally, ensuring deterministic and verifiable GPU computing is crucial to meet the requirements of a decentralized network.
Technical Challenges: Addressing GPU computation consistency in distributed environments and leveraging GPU acceleration for AI inference without compromising security are key challenges.

We believe that by introducing the GPU Extension, the AO network will gain the capability to run GPU-based LLMs, expanding its application scenarios. In future articles, we will delve into the technical details of the GPU Extension, including how to ensure computational determinism, adapt to different hardware environments, and optimize performance in practical applications.
Apus Network's GPU Extension introduces powerful GPU computing capabilities to the AO network, enabling it to support complex AI inference tasks. This extension not only enhances AO's computational performance but also opens new possibilities for decentralized AI. We look forward to collaborating with the community to refine and apply this technology.
With the rapid development of artificial intelligence (AI) technology, an increasing number of large language model (LLM) applications require efficient computational resources. In this article, we will explore how to integrate APUS's GPU extension into the Application Overlay (AO) system to support more powerful AI model inference.
Before delving into how GPU extensions work in the AO network, let's briefly review how typical AI applications operate and the composition of the AO network to lay the groundwork for the subsequent discussion.
The operation of a typical LLM application involves multiple technical layers:

Application Layer: End-user applications such as Midjourney, Jasper, and GitHub Copilot. These applications meet user needs by calling underlying foundational models.
Foundational Models: These include closed-source models (e.g., GPT-3) and open-source models (e.g., Stable Diffusion). Closed-source models are typically accessed via APIs, while open-source models are released as pre-trained weights for users to deploy and use freely.
Cloud Platforms: Cloud computing platforms (e.g., AWS, GCP, Azure) provide environments for developers to deploy and run foundational models, supporting large-scale computational and storage needs.
Computing Hardware: Underlying hardware (e.g., NVIDIA GPUs, Google TPUs) provides the computational power needed for model training and inference.
The AO network consists of the following key components, which collectively enable distributed computing and message exchange:

ir computing environment and state.
Messages: Standardized data used to exchange information between processes.
Scheduler Units (SUs): Responsible for assigning unique slot numbers to messages and ensuring their storage on Arweave.
Compute Units (CUs): Provide computational services and parse process states.
Messenger Units (MUs): Forward messages across the network, coordinating communication between processes.
Based on this, the ao.TN.1 version of AO implements:
WASM-based Virtual Machine Environment: Supports up to 4 GB of RAM, providing an execution environment for processes.
Lua Runtime Environment (ao-lib): Compiled to WASM, enabling developers to easily develop AO processes using Lua.
Operating System Environment (aos): Users can interact with and operate the system through a Lua command-line interface.

To implement GPU-based inference for LLMs, we need to consider multiple technical layers to adapt the corresponding components in AO. This involves hardware support, software abstraction layers, and specific implementation interfaces.
First, the details of the inference tech stack are as follows:

Hardware Abstraction and Software Support: Through hardware abstraction layers like OpenCL and Vulkan, as well as specific low-level software interfaces such as CUDA, ROCm, and OneAPI, efficient cross-platform computing is achieved. This ensures compatibility and performance of large models across different computing architectures.
Model Optimization and Parallel Frameworks: Tools like TensorRT, OpenVINO, and XLA optimize models to improve computational efficiency. Parallel computing frameworks like Ray and Horovod further accelerate training and inference, especially in large-scale cluster environments.
Deep Learning Frameworks: Mainstream frameworks like TensorFlow and PyTorch, along with interoperability formats like ONNX, provide flexible development environments, enabling easy model conversion and deployment.
In this context, WebAssembly (WASM), as AO's runtime, offers multiple integration methods for interfacing with GPUs. While WASM itself does not support direct GPU calls, extensions like WASI (WebAssembly System Interface) can enable this functionality. Currently, promising extensions such as wasi-nn and wasi-gfx provide different implementation paths.
wasi-nn focuses on supporting neural network inference. It allows developers to load and run neural network models for inference in a WebAssembly environment by calling system-level interfaces, without dealing with low-level details. By providing simple model and tensor operations, wasi-nn enables developers to focus on inference logic, while relying on underlying backend implementations. This design allows wasi-nn to integrate with deep learning engines like TensorFlow Lite and ONNX Runtime as its inference foundation.

In LLM inference applications, the advantage of wasi-nn lies in its ability to easily integrate with mainstream deep learning backends, leveraging their optimization and acceleration capabilities. However, the performance of wasi-nn largely depends on the chosen backend libraries and hardware characteristics.
In contrast, wasi-gfx primarily targets graphics processing and rendering tasks. Its interface design aims to provide WebGPU support, enabling WebAssembly modules to perform high-performance graphics rendering. While wasi-gfx is not focused on AI inference, its underlying GPU call mechanisms and abstraction capabilities are advantageous for model visualization and graphical output.
For LLM inference, another potential use case for wasi-gfx is when inference tasks require intensive graphical interaction or display. This interface can serve as a foundation, enabling more efficient rendering of graphical results in user interfaces during computation.
When selecting a specific interface implementation, the following factors should be considered:
Use Case Differences: wasi-nn is more suitable for mainstream neural network model integration, while wasi-gfx is better for general-purpose graphics computing.
Underlying Implementation: The strength of wasi-nn lies in its integration with various deep learning inference engines, while wasi-gfx provides cross-platform GPU rendering interfaces.
Hardware Support: Both can leverage GPUs for performance optimization, but their target computation types differ: wasi-nn focuses on AI inference, while wasi-gfx focuses on graphics computing.
To enable GPU-based LLM inference in AO, APUS is developing a GPU Extension aimed at Enabling Verifiable Decentralized AI through Deterministic GPU Computing [https://r2krpzvyn24gq75rtedeo56vpiyxvcya2xsntoeaz7ursparocea.arweave.net/jpUX5rhuuGh_sZkGR3fVejF6iwDV5Nm4gM_pGTwRcIg]。
Our approach is to integrate the GPU Extension as a pluggable WASI interface into AO's WASM runtime. Specifically:
Integration Method: Embed the GPU Extension into the WASM runtime (currently Node.js WebAssembly, potentially upgraded to HyperBEAM in the future), enabling AO processes to invoke GPU computations.
Interface Design: The GPU Extension provides interfaces similar to wasi-nn for module usage but is not limited to this. We may design GPU computing interfaces more suited to the AO network based on actual needs.
Implementation Approach: At the underlying level, efficient inference engines like llama.cpp can be used to support mainstream LLMs. Additionally, ensuring deterministic and verifiable GPU computing is crucial to meet the requirements of a decentralized network.
Technical Challenges: Addressing GPU computation consistency in distributed environments and leveraging GPU acceleration for AI inference without compromising security are key challenges.

We believe that by introducing the GPU Extension, the AO network will gain the capability to run GPU-based LLMs, expanding its application scenarios. In future articles, we will delve into the technical details of the GPU Extension, including how to ensure computational determinism, adapt to different hardware environments, and optimize performance in practical applications.
Apus Network's GPU Extension introduces powerful GPU computing capabilities to the AO network, enabling it to support complex AI inference tasks. This extension not only enhances AO's computational performance but also opens new possibilities for decentralized AI. We look forward to collaborating with the community to refine and apply this technology.

Getting Started with HyperBEAM: Building a Custom Device for Beginners
AbstractThis guide introduces developers to HyperBEAM's distributed computing framework through hands-on device extension. Learn how to leverage Erlang/OTP architecture and the Converge Protocol to create custom devices. Beginners will gain practical experience through a calculator device demo, understanding NIFs (Native Implemented Functions) and WASM port communication patterns.ChaptersIntroduction to HyperBEAMConverge Protocol : the root of device call logic and pathBuilding a Simple ...

The Future Is Deterministic: HyperBeam Architecture and the Importance of Hashpaths in AO
1. IntroductionAs decentralized computation evolves, HyperBeam emerges as a powerful client implementation of the AO-Core protocol, enabling distributed computation in a modular and verifiable way. By abstracting hardware resources and standardizing computation through devices, HyperBeam allows a wide range of computational models to operate seamlessly within the AO ecosystem. At the core of this system lies the concept of Hashpaths, which serve as unique identifiers for computational state a...
Mint Process Guide - Learn More
APUS Mint Process GuideThe APUS mint process is now fully immutable within a Fair Launch Process (FLP) following the carefully designed emission curve outlined in our whitepaper here. https://yoiqojo25iwvlwjsftpnv5jvzvtqvaguciaeewl4cfe5b7inqpnq.arweave.net/w5EHJdrqLVXZMize2vU1zWcKgNQSAEJZfBFJ0P0Ng9sHow to Mint Apus?Prerequisite: DAI or stETH bridged from Ethereum to the AO Mint Contract OR holding AR and therefore minting AO If you are unfamiliar with the AO mint bridge, please read the follo...

Getting Started with HyperBEAM: Building a Custom Device for Beginners
AbstractThis guide introduces developers to HyperBEAM's distributed computing framework through hands-on device extension. Learn how to leverage Erlang/OTP architecture and the Converge Protocol to create custom devices. Beginners will gain practical experience through a calculator device demo, understanding NIFs (Native Implemented Functions) and WASM port communication patterns.ChaptersIntroduction to HyperBEAMConverge Protocol : the root of device call logic and pathBuilding a Simple ...

The Future Is Deterministic: HyperBeam Architecture and the Importance of Hashpaths in AO
1. IntroductionAs decentralized computation evolves, HyperBeam emerges as a powerful client implementation of the AO-Core protocol, enabling distributed computation in a modular and verifiable way. By abstracting hardware resources and standardizing computation through devices, HyperBeam allows a wide range of computational models to operate seamlessly within the AO ecosystem. At the core of this system lies the concept of Hashpaths, which serve as unique identifiers for computational state a...
Mint Process Guide - Learn More
APUS Mint Process GuideThe APUS mint process is now fully immutable within a Fair Launch Process (FLP) following the carefully designed emission curve outlined in our whitepaper here. https://yoiqojo25iwvlwjsftpnv5jvzvtqvaguciaeewl4cfe5b7inqpnq.arweave.net/w5EHJdrqLVXZMize2vU1zWcKgNQSAEJZfBFJ0P0Ng9sHow to Mint Apus?Prerequisite: DAI or stETH bridged from Ethereum to the AO Mint Contract OR holding AR and therefore minting AO If you are unfamiliar with the AO mint bridge, please read the follo...
Share Dialog
Share Dialog
<100 subscribers
<100 subscribers
No comments yet