With the rapid development of artificial intelligence (AI) technology, an increasing number of large language model (LLM) applications require efficient computational resources. In this article, we will explore how to integrate APUS's GPU extension into the Application Overlay (AO) system to support more powerful AI model inference.

Before delving into how GPU extensions work in the AO network, let's briefly review how typical AI applications operate and the composition of the AO network to lay the groundwork for the subsequent discussion.

I. How AI Applications Operate

The operation of a typical LLM application involves multiple technical layers:

https://www.xenonstack.com/blog/generative-ai-tech-stack-breakdown

Application Layer: End-user applications such as Midjourney, Jasper, and GitHub Copilot. These applications meet user needs by calling underlying foundational models.
Foundational Models: These include closed-source models (e.g., GPT-3) and open-source models (e.g., Stable Diffusion). Closed-source models are typically accessed via APIs, while open-source models are released as pre-trained weights for users to deploy and use freely.
Cloud Platforms: Cloud computing platforms (e.g., AWS, GCP, Azure) provide environments for developers to deploy and run foundational models, supporting large-scale computational and storage needs.
Computing Hardware: Underlying hardware (e.g., NVIDIA GPUs, Google TPUs) provides the computational power needed for model training and inference.

II. Composition of the AO Network

The AO network consists of the following key components, which collectively enable distributed computing and message exchange:

ir computing environment and state.
Messages: Standardized data used to exchange information between processes.
Scheduler Units (SUs): Responsible for assigning unique slot numbers to messages and ensuring their storage on Arweave.
Compute Units (CUs): Provide computational services and parse process states.
Messenger Units (MUs): Forward messages across the network, coordinating communication between processes.

Based on this, the ao.TN.1 version of AO implements:

WASM-based Virtual Machine Environment: Supports up to 4 GB of RAM, providing an execution environment for processes.
Lua Runtime Environment (ao-lib): Compiled to WASM, enabling developers to easily develop AO processes using Lua.
Operating System Environment (aos): Users can interact with and operate the system through a Lua command-line interface.

III. Running GPU-Based LLMs in AO

To implement GPU-based inference for LLMs, we need to consider multiple technical layers to adapt the corresponding components in AO. This involves hardware support, software abstraction layers, and specific implementation interfaces.

First, the details of the inference tech stack are as follows:

https://medium.com/iosg-ventures/ai-training-and-inference-tech-stack-from-silicon-to-sentience-227a6eab3603

Hardware Abstraction and Software Support: Through hardware abstraction layers like OpenCL and Vulkan, as well as specific low-level software interfaces such as CUDA, ROCm, and OneAPI, efficient cross-platform computing is achieved. This ensures compatibility and performance of large models across different computing architectures.
Model Optimization and Parallel Frameworks: Tools like TensorRT, OpenVINO, and XLA optimize models to improve computational efficiency. Parallel computing frameworks like Ray and Horovod further accelerate training and inference, especially in large-scale cluster environments.
Deep Learning Frameworks: Mainstream frameworks like TensorFlow and PyTorch, along with interoperability formats like ONNX, provide flexible development environments, enabling easy model conversion and deployment.

In this context, WebAssembly (WASM), as AO's runtime, offers multiple integration methods for interfacing with GPUs. While WASM itself does not support direct GPU calls, extensions like WASI (WebAssembly System Interface) can enable this functionality. Currently, promising extensions such as wasi-nn and wasi-gfx provide different implementation paths.

1. wasi-nn

wasi-nn focuses on supporting neural network inference. It allows developers to load and run neural network models for inference in a WebAssembly environment by calling system-level interfaces, without dealing with low-level details. By providing simple model and tensor operations, wasi-nn enables developers to focus on inference logic, while relying on underlying backend implementations. This design allows wasi-nn to integrate with deep learning engines like TensorFlow Lite and ONNX Runtime as its inference foundation.

In LLM inference applications, the advantage of wasi-nn lies in its ability to easily integrate with mainstream deep learning backends, leveraging their optimization and acceleration capabilities. However, the performance of wasi-nn largely depends on the chosen backend libraries and hardware characteristics.

2. wasi-gfx

In contrast, wasi-gfx primarily targets graphics processing and rendering tasks. Its interface design aims to provide WebGPU support, enabling WebAssembly modules to perform high-performance graphics rendering. While wasi-gfx is not focused on AI inference, its underlying GPU call mechanisms and abstraction capabilities are advantageous for model visualization and graphical output.

For LLM inference, another potential use case for wasi-gfx is when inference tasks require intensive graphical interaction or display. This interface can serve as a foundation, enabling more efficient rendering of graphical results in user interfaces during computation.

Considerations for Interface Extensions

When selecting a specific interface implementation, the following factors should be considered:

Use Case Differences: wasi-nn is more suitable for mainstream neural network model integration, while wasi-gfx is better for general-purpose graphics computing.
Underlying Implementation: The strength of wasi-nn lies in its integration with various deep learning inference engines, while wasi-gfx provides cross-platform GPU rendering interfaces.
Hardware Support: Both can leverage GPUs for performance optimization, but their target computation types differ: wasi-nn focuses on AI inference, while wasi-gfx focuses on graphics computing.

IV. How APUS's GPU Extension Works

To enable GPU-based LLM inference in AO, APUS is developing a GPU Extension aimed at Enabling Verifiable Decentralized AI through Deterministic GPU Computing [https://r2krpzvyn24gq75rtedeo56vpiyxvcya2xsntoeaz7ursparocea.arweave.net/jpUX5rhuuGh_sZkGR3fVejF6iwDV5Nm4gM_pGTwRcIg]。

Our approach is to integrate the GPU Extension as a pluggable WASI interface into AO's WASM runtime. Specifically:

Integration Method: Embed the GPU Extension into the WASM runtime (currently Node.js WebAssembly, potentially upgraded to HyperBEAM in the future), enabling AO processes to invoke GPU computations.
Interface Design: The GPU Extension provides interfaces similar to wasi-nn for module usage but is not limited to this. We may design GPU computing interfaces more suited to the AO network based on actual needs.
Implementation Approach: At the underlying level, efficient inference engines like llama.cpp can be used to support mainstream LLMs. Additionally, ensuring deterministic and verifiable GPU computing is crucial to meet the requirements of a decentralized network.
Technical Challenges: Addressing GPU computation consistency in distributed environments and leveraging GPU acceleration for AI inference without compromising security are key challenges.

We believe that by introducing the GPU Extension, the AO network will gain the capability to run GPU-based LLMs, expanding its application scenarios. In future articles, we will delve into the technical details of the GPU Extension, including how to ensure computational determinism, adapt to different hardware environments, and optimize performance in practical applications.

V. Conclusion

Apus Network's GPU Extension introduces powerful GPU computing capabilities to the AO network, enabling it to support complex AI inference tasks. This extension not only enhances AO's computational performance but also opens new possibilities for decentralized AI. We look forward to collaborating with the community to refine and apply this technology.

I. How AI Applications Operate

The operation of a typical LLM application involves multiple technical layers:

Application Layer: End-user applications such as Midjourney, Jasper, and GitHub Copilot. These applications meet user needs by calling underlying foundational models.
Foundational Models: These include closed-source models (e.g., GPT-3) and open-source models (e.g., Stable Diffusion). Closed-source models are typically accessed via APIs, while open-source models are released as pre-trained weights for users to deploy and use freely.
Cloud Platforms: Cloud computing platforms (e.g., AWS, GCP, Azure) provide environments for developers to deploy and run foundational models, supporting large-scale computational and storage needs.
Computing Hardware: Underlying hardware (e.g., NVIDIA GPUs, Google TPUs) provides the computational power needed for model training and inference.

II. Composition of the AO Network

The AO network consists of the following key components, which collectively enable distributed computing and message exchange:

ir computing environment and state.
Messages: Standardized data used to exchange information between processes.
Scheduler Units (SUs): Responsible for assigning unique slot numbers to messages and ensuring their storage on Arweave.
Compute Units (CUs): Provide computational services and parse process states.
Messenger Units (MUs): Forward messages across the network, coordinating communication between processes.

Based on this, the ao.TN.1 version of AO implements:

WASM-based Virtual Machine Environment: Supports up to 4 GB of RAM, providing an execution environment for processes.
Lua Runtime Environment (ao-lib): Compiled to WASM, enabling developers to easily develop AO processes using Lua.
Operating System Environment (aos): Users can interact with and operate the system through a Lua command-line interface.

III. Running GPU-Based LLMs in AO

First, the details of the inference tech stack are as follows:

Hardware Abstraction and Software Support: Through hardware abstraction layers like OpenCL and Vulkan, as well as specific low-level software interfaces such as CUDA, ROCm, and OneAPI, efficient cross-platform computing is achieved. This ensures compatibility and performance of large models across different computing architectures.
Model Optimization and Parallel Frameworks: Tools like TensorRT, OpenVINO, and XLA optimize models to improve computational efficiency. Parallel computing frameworks like Ray and Horovod further accelerate training and inference, especially in large-scale cluster environments.
Deep Learning Frameworks: Mainstream frameworks like TensorFlow and PyTorch, along with interoperability formats like ONNX, provide flexible development environments, enabling easy model conversion and deployment.

1. wasi-nn

2. wasi-gfx

Considerations for Interface Extensions

When selecting a specific interface implementation, the following factors should be considered:

Use Case Differences: wasi-nn is more suitable for mainstream neural network model integration, while wasi-gfx is better for general-purpose graphics computing.
Underlying Implementation: The strength of wasi-nn lies in its integration with various deep learning inference engines, while wasi-gfx provides cross-platform GPU rendering interfaces.
Hardware Support: Both can leverage GPUs for performance optimization, but their target computation types differ: wasi-nn focuses on AI inference, while wasi-gfx focuses on graphics computing.

IV. How APUS's GPU Extension Works

Our approach is to integrate the GPU Extension as a pluggable WASI interface into AO's WASM runtime. Specifically:

Integration Method: Embed the GPU Extension into the WASM runtime (currently Node.js WebAssembly, potentially upgraded to HyperBEAM in the future), enabling AO processes to invoke GPU computations.
Interface Design: The GPU Extension provides interfaces similar to wasi-nn for module usage but is not limited to this. We may design GPU computing interfaces more suited to the AO network based on actual needs.
Implementation Approach: At the underlying level, efficient inference engines like llama.cpp can be used to support mainstream LLMs. Additionally, ensuring deterministic and verifiable GPU computing is crucial to meet the requirements of a decentralized network.
Technical Challenges: Addressing GPU computation consistency in distributed environments and leveraging GPU acceleration for AI inference without compromising security are key challenges.

Apus Network

More from Apus Network

Apus Network

More from Apus Network

No activity yet

More from Apus Network

Apus Network

Apus Network

No activity yet

More from Apus Network

How to Implement GPU-Based LLM Inference in AO

How to Implement GPU-Based LLM Inference in AO

I. How AI Applications Operate

II. Composition of the AO Network

III. Running GPU-Based LLMs in AO

1. wasi-nn

2. wasi-gfx

Considerations for Interface Extensions

IV. How APUS's GPU Extension Works

V. Conclusion

I. How AI Applications Operate

II. Composition of the AO Network

III. Running GPU-Based LLMs in AO

1. wasi-nn

2. wasi-gfx

Considerations for Interface Extensions

IV. How APUS's GPU Extension Works

V. Conclusion

No activity yet

No activity yet