
Subscribe to yiyao

Subscribe to yiyao
Share Dialog
Share Dialog
<100 subscribers
<100 subscribers
The boom in generative AI is computationally based. One of its properties is that adding more computation leads directly to a better product. Typically, R&D investments are more directly related to the value of the product, and the relationship is clearly sublinear. But this is not currently the case with AI, and the main factor driving the industry today is simply the cost of training and reasoning.
While we don't know the real numbers, we have heard from reliable sources that the supply of computing power is so tight that demand is more than 10 times greater! So we believe that access to computing resources at the lowest total cost of ownership is now a determining factor in the success of AI companies.
In fact, we have seen many companies spend more than 80% of their total funding on compute resources.
In this article, we try to break down the cost factors for AI companies. The absolute numbers will of course change over time, but we otherwise I AI companies are limited by access to computing resources that will immediately ease. So, hopefully, this is a helpful framework to think about.
Why are the computational costs of AI models so high?
There is a wide variety of generative AI models, and the cost of inference and training depends on the size and type of model. Fortunately, most of today's most popular models are based on Transformer architectures, including popular Large Language Models (LLMs) such as GPT-3, GPT-J, or BERT. While the exact number of inference and learning operations of a transformer is model-specific (see this paper), there is a fairly accurate rule of thumb that depends only on the parameter The number of parameters (i.e., the weights of the neural network) of the model and the number of input and output Tokens.
Token is basically a short sequence of a few characters. They correspond to words or parts of words. The best way to get an intuition about tokens is to try tokenization using publicly available online tokenizers such as OpenAI. For GPT-3, the average length of a token is 4 characters.
Transformer's rule of thumb is that for a model with an input of p parameters and an output sequence of length n tokens, forward pass-through (i.e., inference) requires approximately 2np floating-point operations (FLOPS)¹. For training the same model, each token requires approximately 6*p floating-point operations (i.e., the additional backward pass requires four more operations ²). You can estimate the total training cost by multiplying this by the number of tokens in the training data.
The memory requirements of the Transformer also depend on the model size. For inference, we need p model parameters to fit in memory. For learning (i.e., backpropagation), we need to store additional intermediate values for each parameter between the forward and backward passes. Assuming we use 32-bit floating point numbers, this is an additional 8 bytes per parameter. For training a model with 175 billion parameters, we need to keep more than a terabyte of data in memory -- more than any GPU in existence today, requiring us to partition the model onto different cards. The memory requirements for inference and training can be optimized by using shorter length floating point values, with 16 bits becoming commonplace and 8 bits expected in the near future.

The table above shows the size and computational cost of several popular models. GPT-3 has about 175 billion parameters, corresponding to 1,024 token inputs and outputs, and a computational cost of about 350 trillion floating-point operations (i.e., Teraflops or TFLOPS). Training a model like GPT-3 requires about 3.14*10^23 floating point operations. Other models such as Meta's LLaMA have much higher computational requirements. Training such a model is one of the more computationally demanding tasks humans have undertaken to date.
To summarize: AI infrastructure is expensive because the underlying algorithmic problems are extremely difficult to compute. The algorithmic complexity of sorting a database table with a million entries is trivial compared to the complexity of generating a single word with GPT-3. This means that you have to choose the smallest model that solves your use case.
The good news is that for transformer, we can easily estimate how much computation and memory a model of a particular size will consume. Therefore, choosing the right hardware becomes the next consideration.
The time and cost debate for GPUs
How does computational complexity translate into time? A processor core can typically execute 1-2 instructions per cycle, and due to the end of Dennard Scaling, processor clock rates have remained stable at around 3 GHz for the last 15 years. Executing a single GPT-3 inference operation without utilizing any parallel architecture would require 350 TFLOPS/(3 GHz*1 FLOP) or 116,000 seconds, or 32 hours. This is highly impractical; instead, we need specialized chips to accelerate this task.
Virtually all of today's AI models run on cards that use a large number of dedicated cores. For example, the NVIDIA A100 graphics processor has 512 "tensor cores" that can perform 4×4 matrix multiplication (equivalent to 64 multiplications and additions, or 128 FLOPS) in a single cycle. Artificial intelligence gas pedal cards are often referred to as GPUs (graphics processing units) because the architecture was originally developed for desktop gaming. In the future, we expect AI to increasingly become a distinct product family.
With a nominal performance of 312 TFLOPS, the A100 could theoretically reduce GPT-3 inference time to about 1 second. However, this is an overly simplified calculation for a number of reasons. First, for most use cases, the bottleneck is not the computational power of the GPU, but the ability to get data from dedicated graphics memory to the tensor core. Second, 175 billion weights would take up 700 GB and would not fit into the graphics memory of any GPU. Techniques such as partitioning and weight streaming would need to be used. Third, there are some optimizations (e.g., using shorter floating-point representations such as FP16, FP8, or sparse matrices) that are being used to speed up the computation. Overall, however, the figures above give us an intuitive idea of the overall computational cost of LLM today.
It takes about three times as long to train a transformer model per token as it does to perform inference. However, given that the training dataset is 300 million times larger than the inference cues, training takes a billion times longer. On a single GPU, training takes decades; in practice, this is done on large compute clusters in dedicated data centers or, more likely, in the cloud. Training is also harder to parallelize than inference because updated weights must be swapped between nodes. memory and bandwidth between GPUs often becomes a more important factor, and high-speed interconnects and dedicated architectures are common. For training very large models, creating a suitable network setup may be the primary challenge. Looking ahead, AI gas pedals will have networking capabilities on the card or even on the chip.
So, how does this computational complexity translate into cost? As we saw above, a GPT-3 inference, which takes about 1 second on the A100, has a raw computational cost of between $0.0002 and $0.0014 for 1000 tokens (compared to OpenAI's pricing of $0.002/1000 token). This is a very low price point, making most text-based AI use cases economically viable.
Training GPT-3, on the other hand, is much more expensive. At the above rate, again calculating only the computational cost of 3.14*10^23 FLOPS, we can estimate the cost of a single training session on the A100 card to be $560,000. In practice, for training, we will not get nearly 100% efficiency on the GPU; but we can also use optimization to reduce training time. Other estimates of GPT-3 training costs range from $500,000 to $4.6 million, depending on hardware assumptions. Note that this is the cost of a single run, not the overall cost. Multiple runs may be required, and cloud providers will want a long-term commitment (more on this below). Training top-tier models is still expensive, but affordable for well-funded startups.
In summary, today's generative AI requires significant investment in AI infrastructure. There is no reason to believe this will change in the near future. Training a model like GPT-3 is one of the most computationally intensive tasks ever undertaken by humans. While GPUs are becoming faster and we are finding ways to optimize training, the rapid expansion of AI offsets both of these effects.
AI Infrastructure Considerations
At this point, we have tried to give you some idea of the scale required to perform AI model training and inference, and the underlying parameters that drive them. With this background, we would now like to provide some practical guidelines on how to decide which AI infrastructure to use.
External vs. internal infrastructure GPUs are cool. Many engineers and engineering-minded founders favor configuring their own AI hardware, not only because it allows for fine-grained control over model training, but also because there is some fun to be had in leveraging large amounts of computational power (Appendix A).
However, the reality is that many startups -- especially app companies -- do not need to build their own AI infrastructure on day one. Instead, hosted modeling services like OpenAI or Hugging Face (for language) and Replicate (for image generation) allow founders to quickly search for product-market fit without having to manage the underlying infrastructure or models.
These services have become so good that many companies can depend on them directly. Developers can achieve meaningful control over model performance through cue engineering and higher-order fine-tuning abstractions (i.e., fine-tuning through API calls). Pricing for these services is consumption-based, so it is also often cheaper than running separate infrastructure. We have seen a number of application companies generating over $50 million in ARR, valued at over $1 billion, running hosted modeling services in the background.
On the other hand, some startups -- especially those training new base models or building vertically integrated AI applications -- can't avoid running their own models directly on GPUs. Either because the models are actually products and the team is looking for "model-market fit," or because fine-grained control over training and/or inference is required to achieve certain functionality or reduce marginal costs at scale. Either way, managing the infrastructure can be a source of competitive advantage.
Building the Cloud and Data Center In most cases, the cloud is the right place for your AI infrastructure. For most startups and large companies, the lower upfront costs, ability to scale up and down, regional availability, and fewer distractions from building your own data center are attractive.
There are, however, a few exceptions to this rule:
If you have a very large operation, running your own data center may become more cost effective. The exact price point varies depending on location and setup, but typically requires more than $50 million per year in infrastructure spending. You need very specific hardware that you can't get from a cloud provider. For example, there are no widely available types of GPUs, and unusual memory, storage, or networking requirements. You can't find an acceptable cloud for geopolitical reasons. If you do want to set up your own data center, there are already comprehensive GPU price/performance analyses available for your own setup (e.g., Tim Dettmer's analysis). In addition to the cost and performance of the cards themselves, the choice of hardware depends on power, space, and cooling. For example, two RTX 3080 Ti cards together have similar raw compute power to the A100, but each consume 700 W vs. 300 W. Over a three-year lifecycle, at a market price of $0.10/kWh, the 3500 kWh power difference adds nearly two times the cost of the RTX3080 Ti (~$1,000).
In summary, we expect the vast majority of startups to use cloud computing.
Compare Cloud Service Providers Amazon Web Services (AWS), Microsoft Azure and Google Cloud Platform (GCP) all offer GPU instances, but new providers have emerged that focus specifically on AI workloads. Here is the framework we have seen used by many founders to select a cloud provider:
Pricing: The table below shows pricing for some of the major and smaller specialty clouds as of April 7, 2023. This data is for reference only, as instances vary greatly in terms of network bandwidth, data egress costs, additional costs for CPUs and networks, available discounts, and other factors.

The computing power of a particular piece of hardware is a commodity. To put it bluntly, we would expect prices to be fairly uniform, but that's not the case. While there are substantial feature differences between clouds, they are not enough to explain the nearly 4x difference in pricing between vendors for the on-demand NVIDIA A100.
At the top of the price range, large public clouds charge a premium based on brand reputation, proven reliability, and the need to manage a variety of workloads. Smaller, specialized AI providers offer lower prices by running dedicated data centers, such as Coreweave, or by nesting other clouds, such as Lambda Labs.
In practice, most large buyers negotiate prices directly with cloud providers, often committing to some minimum spend requirement as well as a minimum time commitment (we are seeing 1-3 years). After negotiations, the price differences between clouds narrow somewhat, but we see the rankings in the table above remain relatively stable. It is also important to note that smaller companies can get aggressive pricing from a professional cloud without a large spending commitment.
Availability: The most powerful GPUs, such as the Nvidia A100 s, have been in short supply for the past 12+ months.
Considering the huge buying power and resource pools of the top three cloud providers, it is logical to assume they have the best availability. But, somewhat surprisingly, many startups don't find this to be true. The big cloud providers have a lot of hardware, but also a lot of customer demand to meet -- Azure, for example, is the primary host for ChatGPT -- and are constantly adding/releasing capacity to meet that demand. Meanwhile, Nvidia has committed to making hardware widely available across the industry, including allocating it to new specialty providers. (They are doing this both to be fair and to reduce their dependence on a few large customers who are also competing with them.)
As a result, many startups are finding more available chips at smaller cloud providers, including the cutting-edge Nvidia H100 s. If you are willing to work with newer infrastructure companies, you may be able to reduce the wait time for hardware and potentially save money in the process.
Compute delivery models: Today's large clouds only offer instances with dedicated GPUs because GPU virtualization is still an unresolved issue. Specialized AI clouds offer other models, such as containers or batch jobs, that can handle individual tasks without incurring the startup and teardown costs of instances. If you are comfortable with this model, it can significantly reduce costs.
Network interconnection: Specifically for training, network bandwidth is a major factor in choosing a provider. Training certain large models requires clusters that use dedicated networks between nodes, such as NVLink. for image generation, exit traffic costs are also a major cost driver.
Customer Support: Large cloud providers serve a large number of customers in thousands of product SKUs. Unless you are a large customer, it can be difficult to get customer support attention or get issues resolved. Many specialized AI clouds, on the other hand, offer fast and responsive support for even small customers. This is partly because they operate on a smaller scale, but also because their workloads are more homogeneous, so they have more incentive to focus on specific AI features and bugs.
Comparing GPUs All else being equal, the highest-end GPUs will perform best on almost all workloads. However, as you can see in the table below, the best hardware is also quite expensive. Choosing the right type of GPU for your particular application can significantly reduce costs and may make the difference between a viable and non-viable business model.

Deciding how far to go down the list - i.e., determining the most cost-effective GPU choice for your application - is primarily a technical decision that is beyond the scope of this article. However, we will share below some of the selection criteria that we believe are most important:
Training and Inference: As we saw in the first section above, training a Transformer model requires us to store 8 bytes of data for training, in addition to model weights. This means that a typical high-end consumer GPU with 12 GB of memory can barely be used to train a model with 4 billion parameters. In practice, training large models is done on clusters of machines, ideally with many GPUs per server, lots of VRAM, and high bandwidth connections between servers (i.e., clusters built with top-of-the-line data center GPUs).
Specifically, many models are most cost effective on the NVIDIA H100, but for now it is hard to find and usually requires a long-term commitment of more than a year. The NVIDIA A100, on the other hand, runs most model training; it is easier to find, but for large clusters, it may also require a long-term commitment.
Memory requirements: The number of parameters for large LLMs is too large to fit on any card. They need to be partitioned into multiple cards and require a training-like setup. In other words, even for LLM inference, you may need H100 or A100. but smaller models (e.g., Stable Diffusion) require less VRAM. while A100 is still popular, we have seen startups use A10, A40, A4000, A5000, and A6000, and even RTX cards.
Hardware support: While the vast majority of workloads at the companies we talked to are running on Nvidia, some companies are starting to experiment with other vendors. The most common is Google's TPU, and Intel's Gaudi 2 seems to be getting some attention. The challenge with these vendors is that the performance of your model is often highly dependent on the availability of software optimizations for these chips. You may have to do a PoC to get an idea of performance.
Latency requirements: In general, less latency-sensitive workloads (e.g., batch data processing or applications that do not require interactive user interface responses) can use less powerful GPUs. this can reduce compute costs by a factor of 3-4 (e.g., compare A100 s vs. A10 s on AWS). User-facing applications, on the other hand, often require high-end cards to deliver an attractive real-time user experience. Optimizing the model is often necessary to bring the cost to a manageable range.
Peak: Generative AI companies often see a sharp rise in demand because the technology is so new and exciting. It is not uncommon for requests to increase by a factor of 10 in a day, or consistently by 50% per week, on top of a new product release. It is often easier to handle these spikes on low-end GPUs because more compute nodes are likely to be available on demand. If this traffic comes from less engaged or less retained users, it often makes sense to serve such traffic with lower cost resources at the expense of performance.
Optimization and scheduling models Software optimization can dramatically impact the runtime of a model - a 10x gain is not uncommon. However, you need to determine which methods will work best for your particular model and system.
Some techniques work for a fairly wide range of models. Speedups achieved using shorter floating point representations (i.e. FP16 or FP8 compared to the original FP32) or quantization (INT8, INT4, INT2) are usually linear in the reduction of the number of bits. This sometimes requires model modification, but there are now a growing number of techniques to automate work with mixed or shorter precision. Pruned neural networks reduce the number of weights by ignoring low values of weights. Combined with efficient sparse matrix multiplication, this can achieve significant speedups on modern GPUs. In addition, another set of optimization techniques addresses memory bandwidth bottlenecks (e.g., by streaming model weights).
Other optimizations are highly model-specific. For example, Stable Diffusion has made significant progress in the amount of VRAM required for inference. Still another class of optimizations is hardware-specific. NVIDIA's TensorML includes some optimizations, but can only be run on NVIDIA hardware. Last, but not least, the scheduling of AI tasks can create huge performance bottlenecks or improvements. Assigning models to GPUs to minimize weight swapping, picking the best GPU for the task if multiple GPUs are available, and minimizing downtime by batching workloads ahead of time are common techniques.
Finally, model optimization remains a black magic, and most of the startups we've spoken to have partnered with third parties to help solve some of these software aspects. Typically, these are not traditional MLops vendors, but companies that specialize in optimizing for specific generative models, such as OctoML or SegMind.
The boom in generative AI is computationally based. One of its properties is that adding more computation leads directly to a better product. Typically, R&D investments are more directly related to the value of the product, and the relationship is clearly sublinear. But this is not currently the case with AI, and the main factor driving the industry today is simply the cost of training and reasoning.
While we don't know the real numbers, we have heard from reliable sources that the supply of computing power is so tight that demand is more than 10 times greater! So we believe that access to computing resources at the lowest total cost of ownership is now a determining factor in the success of AI companies.
In fact, we have seen many companies spend more than 80% of their total funding on compute resources.
In this article, we try to break down the cost factors for AI companies. The absolute numbers will of course change over time, but we otherwise I AI companies are limited by access to computing resources that will immediately ease. So, hopefully, this is a helpful framework to think about.
Why are the computational costs of AI models so high?
There is a wide variety of generative AI models, and the cost of inference and training depends on the size and type of model. Fortunately, most of today's most popular models are based on Transformer architectures, including popular Large Language Models (LLMs) such as GPT-3, GPT-J, or BERT. While the exact number of inference and learning operations of a transformer is model-specific (see this paper), there is a fairly accurate rule of thumb that depends only on the parameter The number of parameters (i.e., the weights of the neural network) of the model and the number of input and output Tokens.
Token is basically a short sequence of a few characters. They correspond to words or parts of words. The best way to get an intuition about tokens is to try tokenization using publicly available online tokenizers such as OpenAI. For GPT-3, the average length of a token is 4 characters.
Transformer's rule of thumb is that for a model with an input of p parameters and an output sequence of length n tokens, forward pass-through (i.e., inference) requires approximately 2np floating-point operations (FLOPS)¹. For training the same model, each token requires approximately 6*p floating-point operations (i.e., the additional backward pass requires four more operations ²). You can estimate the total training cost by multiplying this by the number of tokens in the training data.
The memory requirements of the Transformer also depend on the model size. For inference, we need p model parameters to fit in memory. For learning (i.e., backpropagation), we need to store additional intermediate values for each parameter between the forward and backward passes. Assuming we use 32-bit floating point numbers, this is an additional 8 bytes per parameter. For training a model with 175 billion parameters, we need to keep more than a terabyte of data in memory -- more than any GPU in existence today, requiring us to partition the model onto different cards. The memory requirements for inference and training can be optimized by using shorter length floating point values, with 16 bits becoming commonplace and 8 bits expected in the near future.

The table above shows the size and computational cost of several popular models. GPT-3 has about 175 billion parameters, corresponding to 1,024 token inputs and outputs, and a computational cost of about 350 trillion floating-point operations (i.e., Teraflops or TFLOPS). Training a model like GPT-3 requires about 3.14*10^23 floating point operations. Other models such as Meta's LLaMA have much higher computational requirements. Training such a model is one of the more computationally demanding tasks humans have undertaken to date.
To summarize: AI infrastructure is expensive because the underlying algorithmic problems are extremely difficult to compute. The algorithmic complexity of sorting a database table with a million entries is trivial compared to the complexity of generating a single word with GPT-3. This means that you have to choose the smallest model that solves your use case.
The good news is that for transformer, we can easily estimate how much computation and memory a model of a particular size will consume. Therefore, choosing the right hardware becomes the next consideration.
The time and cost debate for GPUs
How does computational complexity translate into time? A processor core can typically execute 1-2 instructions per cycle, and due to the end of Dennard Scaling, processor clock rates have remained stable at around 3 GHz for the last 15 years. Executing a single GPT-3 inference operation without utilizing any parallel architecture would require 350 TFLOPS/(3 GHz*1 FLOP) or 116,000 seconds, or 32 hours. This is highly impractical; instead, we need specialized chips to accelerate this task.
Virtually all of today's AI models run on cards that use a large number of dedicated cores. For example, the NVIDIA A100 graphics processor has 512 "tensor cores" that can perform 4×4 matrix multiplication (equivalent to 64 multiplications and additions, or 128 FLOPS) in a single cycle. Artificial intelligence gas pedal cards are often referred to as GPUs (graphics processing units) because the architecture was originally developed for desktop gaming. In the future, we expect AI to increasingly become a distinct product family.
With a nominal performance of 312 TFLOPS, the A100 could theoretically reduce GPT-3 inference time to about 1 second. However, this is an overly simplified calculation for a number of reasons. First, for most use cases, the bottleneck is not the computational power of the GPU, but the ability to get data from dedicated graphics memory to the tensor core. Second, 175 billion weights would take up 700 GB and would not fit into the graphics memory of any GPU. Techniques such as partitioning and weight streaming would need to be used. Third, there are some optimizations (e.g., using shorter floating-point representations such as FP16, FP8, or sparse matrices) that are being used to speed up the computation. Overall, however, the figures above give us an intuitive idea of the overall computational cost of LLM today.
It takes about three times as long to train a transformer model per token as it does to perform inference. However, given that the training dataset is 300 million times larger than the inference cues, training takes a billion times longer. On a single GPU, training takes decades; in practice, this is done on large compute clusters in dedicated data centers or, more likely, in the cloud. Training is also harder to parallelize than inference because updated weights must be swapped between nodes. memory and bandwidth between GPUs often becomes a more important factor, and high-speed interconnects and dedicated architectures are common. For training very large models, creating a suitable network setup may be the primary challenge. Looking ahead, AI gas pedals will have networking capabilities on the card or even on the chip.
So, how does this computational complexity translate into cost? As we saw above, a GPT-3 inference, which takes about 1 second on the A100, has a raw computational cost of between $0.0002 and $0.0014 for 1000 tokens (compared to OpenAI's pricing of $0.002/1000 token). This is a very low price point, making most text-based AI use cases economically viable.
Training GPT-3, on the other hand, is much more expensive. At the above rate, again calculating only the computational cost of 3.14*10^23 FLOPS, we can estimate the cost of a single training session on the A100 card to be $560,000. In practice, for training, we will not get nearly 100% efficiency on the GPU; but we can also use optimization to reduce training time. Other estimates of GPT-3 training costs range from $500,000 to $4.6 million, depending on hardware assumptions. Note that this is the cost of a single run, not the overall cost. Multiple runs may be required, and cloud providers will want a long-term commitment (more on this below). Training top-tier models is still expensive, but affordable for well-funded startups.
In summary, today's generative AI requires significant investment in AI infrastructure. There is no reason to believe this will change in the near future. Training a model like GPT-3 is one of the most computationally intensive tasks ever undertaken by humans. While GPUs are becoming faster and we are finding ways to optimize training, the rapid expansion of AI offsets both of these effects.
AI Infrastructure Considerations
At this point, we have tried to give you some idea of the scale required to perform AI model training and inference, and the underlying parameters that drive them. With this background, we would now like to provide some practical guidelines on how to decide which AI infrastructure to use.
External vs. internal infrastructure GPUs are cool. Many engineers and engineering-minded founders favor configuring their own AI hardware, not only because it allows for fine-grained control over model training, but also because there is some fun to be had in leveraging large amounts of computational power (Appendix A).
However, the reality is that many startups -- especially app companies -- do not need to build their own AI infrastructure on day one. Instead, hosted modeling services like OpenAI or Hugging Face (for language) and Replicate (for image generation) allow founders to quickly search for product-market fit without having to manage the underlying infrastructure or models.
These services have become so good that many companies can depend on them directly. Developers can achieve meaningful control over model performance through cue engineering and higher-order fine-tuning abstractions (i.e., fine-tuning through API calls). Pricing for these services is consumption-based, so it is also often cheaper than running separate infrastructure. We have seen a number of application companies generating over $50 million in ARR, valued at over $1 billion, running hosted modeling services in the background.
On the other hand, some startups -- especially those training new base models or building vertically integrated AI applications -- can't avoid running their own models directly on GPUs. Either because the models are actually products and the team is looking for "model-market fit," or because fine-grained control over training and/or inference is required to achieve certain functionality or reduce marginal costs at scale. Either way, managing the infrastructure can be a source of competitive advantage.
Building the Cloud and Data Center In most cases, the cloud is the right place for your AI infrastructure. For most startups and large companies, the lower upfront costs, ability to scale up and down, regional availability, and fewer distractions from building your own data center are attractive.
There are, however, a few exceptions to this rule:
If you have a very large operation, running your own data center may become more cost effective. The exact price point varies depending on location and setup, but typically requires more than $50 million per year in infrastructure spending. You need very specific hardware that you can't get from a cloud provider. For example, there are no widely available types of GPUs, and unusual memory, storage, or networking requirements. You can't find an acceptable cloud for geopolitical reasons. If you do want to set up your own data center, there are already comprehensive GPU price/performance analyses available for your own setup (e.g., Tim Dettmer's analysis). In addition to the cost and performance of the cards themselves, the choice of hardware depends on power, space, and cooling. For example, two RTX 3080 Ti cards together have similar raw compute power to the A100, but each consume 700 W vs. 300 W. Over a three-year lifecycle, at a market price of $0.10/kWh, the 3500 kWh power difference adds nearly two times the cost of the RTX3080 Ti (~$1,000).
In summary, we expect the vast majority of startups to use cloud computing.
Compare Cloud Service Providers Amazon Web Services (AWS), Microsoft Azure and Google Cloud Platform (GCP) all offer GPU instances, but new providers have emerged that focus specifically on AI workloads. Here is the framework we have seen used by many founders to select a cloud provider:
Pricing: The table below shows pricing for some of the major and smaller specialty clouds as of April 7, 2023. This data is for reference only, as instances vary greatly in terms of network bandwidth, data egress costs, additional costs for CPUs and networks, available discounts, and other factors.

The computing power of a particular piece of hardware is a commodity. To put it bluntly, we would expect prices to be fairly uniform, but that's not the case. While there are substantial feature differences between clouds, they are not enough to explain the nearly 4x difference in pricing between vendors for the on-demand NVIDIA A100.
At the top of the price range, large public clouds charge a premium based on brand reputation, proven reliability, and the need to manage a variety of workloads. Smaller, specialized AI providers offer lower prices by running dedicated data centers, such as Coreweave, or by nesting other clouds, such as Lambda Labs.
In practice, most large buyers negotiate prices directly with cloud providers, often committing to some minimum spend requirement as well as a minimum time commitment (we are seeing 1-3 years). After negotiations, the price differences between clouds narrow somewhat, but we see the rankings in the table above remain relatively stable. It is also important to note that smaller companies can get aggressive pricing from a professional cloud without a large spending commitment.
Availability: The most powerful GPUs, such as the Nvidia A100 s, have been in short supply for the past 12+ months.
Considering the huge buying power and resource pools of the top three cloud providers, it is logical to assume they have the best availability. But, somewhat surprisingly, many startups don't find this to be true. The big cloud providers have a lot of hardware, but also a lot of customer demand to meet -- Azure, for example, is the primary host for ChatGPT -- and are constantly adding/releasing capacity to meet that demand. Meanwhile, Nvidia has committed to making hardware widely available across the industry, including allocating it to new specialty providers. (They are doing this both to be fair and to reduce their dependence on a few large customers who are also competing with them.)
As a result, many startups are finding more available chips at smaller cloud providers, including the cutting-edge Nvidia H100 s. If you are willing to work with newer infrastructure companies, you may be able to reduce the wait time for hardware and potentially save money in the process.
Compute delivery models: Today's large clouds only offer instances with dedicated GPUs because GPU virtualization is still an unresolved issue. Specialized AI clouds offer other models, such as containers or batch jobs, that can handle individual tasks without incurring the startup and teardown costs of instances. If you are comfortable with this model, it can significantly reduce costs.
Network interconnection: Specifically for training, network bandwidth is a major factor in choosing a provider. Training certain large models requires clusters that use dedicated networks between nodes, such as NVLink. for image generation, exit traffic costs are also a major cost driver.
Customer Support: Large cloud providers serve a large number of customers in thousands of product SKUs. Unless you are a large customer, it can be difficult to get customer support attention or get issues resolved. Many specialized AI clouds, on the other hand, offer fast and responsive support for even small customers. This is partly because they operate on a smaller scale, but also because their workloads are more homogeneous, so they have more incentive to focus on specific AI features and bugs.
Comparing GPUs All else being equal, the highest-end GPUs will perform best on almost all workloads. However, as you can see in the table below, the best hardware is also quite expensive. Choosing the right type of GPU for your particular application can significantly reduce costs and may make the difference between a viable and non-viable business model.

Deciding how far to go down the list - i.e., determining the most cost-effective GPU choice for your application - is primarily a technical decision that is beyond the scope of this article. However, we will share below some of the selection criteria that we believe are most important:
Training and Inference: As we saw in the first section above, training a Transformer model requires us to store 8 bytes of data for training, in addition to model weights. This means that a typical high-end consumer GPU with 12 GB of memory can barely be used to train a model with 4 billion parameters. In practice, training large models is done on clusters of machines, ideally with many GPUs per server, lots of VRAM, and high bandwidth connections between servers (i.e., clusters built with top-of-the-line data center GPUs).
Specifically, many models are most cost effective on the NVIDIA H100, but for now it is hard to find and usually requires a long-term commitment of more than a year. The NVIDIA A100, on the other hand, runs most model training; it is easier to find, but for large clusters, it may also require a long-term commitment.
Memory requirements: The number of parameters for large LLMs is too large to fit on any card. They need to be partitioned into multiple cards and require a training-like setup. In other words, even for LLM inference, you may need H100 or A100. but smaller models (e.g., Stable Diffusion) require less VRAM. while A100 is still popular, we have seen startups use A10, A40, A4000, A5000, and A6000, and even RTX cards.
Hardware support: While the vast majority of workloads at the companies we talked to are running on Nvidia, some companies are starting to experiment with other vendors. The most common is Google's TPU, and Intel's Gaudi 2 seems to be getting some attention. The challenge with these vendors is that the performance of your model is often highly dependent on the availability of software optimizations for these chips. You may have to do a PoC to get an idea of performance.
Latency requirements: In general, less latency-sensitive workloads (e.g., batch data processing or applications that do not require interactive user interface responses) can use less powerful GPUs. this can reduce compute costs by a factor of 3-4 (e.g., compare A100 s vs. A10 s on AWS). User-facing applications, on the other hand, often require high-end cards to deliver an attractive real-time user experience. Optimizing the model is often necessary to bring the cost to a manageable range.
Peak: Generative AI companies often see a sharp rise in demand because the technology is so new and exciting. It is not uncommon for requests to increase by a factor of 10 in a day, or consistently by 50% per week, on top of a new product release. It is often easier to handle these spikes on low-end GPUs because more compute nodes are likely to be available on demand. If this traffic comes from less engaged or less retained users, it often makes sense to serve such traffic with lower cost resources at the expense of performance.
Optimization and scheduling models Software optimization can dramatically impact the runtime of a model - a 10x gain is not uncommon. However, you need to determine which methods will work best for your particular model and system.
Some techniques work for a fairly wide range of models. Speedups achieved using shorter floating point representations (i.e. FP16 or FP8 compared to the original FP32) or quantization (INT8, INT4, INT2) are usually linear in the reduction of the number of bits. This sometimes requires model modification, but there are now a growing number of techniques to automate work with mixed or shorter precision. Pruned neural networks reduce the number of weights by ignoring low values of weights. Combined with efficient sparse matrix multiplication, this can achieve significant speedups on modern GPUs. In addition, another set of optimization techniques addresses memory bandwidth bottlenecks (e.g., by streaming model weights).
Other optimizations are highly model-specific. For example, Stable Diffusion has made significant progress in the amount of VRAM required for inference. Still another class of optimizations is hardware-specific. NVIDIA's TensorML includes some optimizations, but can only be run on NVIDIA hardware. Last, but not least, the scheduling of AI tasks can create huge performance bottlenecks or improvements. Assigning models to GPUs to minimize weight swapping, picking the best GPU for the task if multiple GPUs are available, and minimizing downtime by batching workloads ahead of time are common techniques.
Finally, model optimization remains a black magic, and most of the startups we've spoken to have partnered with third parties to help solve some of these software aspects. Typically, these are not traditional MLops vendors, but companies that specialize in optimizing for specific generative models, such as OctoML or SegMind.
No activity yet