NVIDIA’s $20B non-exclusive IP licensing deal with Groq looks puzzling if examined through a purely architectural lens. On the surface, it appears to pair the world’s most successful GPU vendor (with a software moat built around abstraction) with a company best known for a deterministic, VLIW-style inference accelerator—an approach many people associate with brittle compiler tooling, hardware-dependent scheduling complexity, and poor long-term ergonomics for end-users.
This is a perspective that tries to address some of this, by giving an alternative explanation related to economics and distribution as opposed to instruction sets (or leveling the competition or geopolitical deals to sell more hardware to China). This also came about at 3am on the occasion of peeking at Max Weinbach's twitter thread putting my infant daughter to bed, so take it with a huge pinch of salt.
tl;dr NVIDIA’s response is to turn inference into a sellable product again, not merely a cloud workload—by increasing utilization per square millimeter, reducing memory pressure, and making deterministic inference deployable outside hyperscaler walls. This will be done via moving determinism downward—into hardware, microcode, and compiler/runtime layers—while keeping user-facing abstractions stable. The result is a plausible future where GPUs support multiple execution personalities, including a Groq-inspired deterministic inference mode, without exposing VLIW complexity to developers.
From a strategic standpoint, there are 3 key pressures NVIDIA is seeking to address:
Inference workloads that are more irregular and latency sensitive, which are points that hyperscalers could look to capitalize on (along with them integrating vertical silicon)
Bottlenecks from TSMC, soaring HBM demand and CoWos
Enterprise inference workloads that need to be on-prem and have guarantees around predictable outcomes and latency
Inference is no longer batch-friendly, relatively uniform, and tolerant of latency variance + it isn't looking to be amortized behind the same infrastructure that was set up for training. These days, inference is increasingly dominated by low-batch or batch-one workloads, strict latency requirements, and memory pressure rather than compute pressure, especially when you consider this from an enterprise standpoint. KV caches grow quicker; mixture-of-experts, speculative decoding, and conditional execution introduce irregularity that GPUs can handle, but not always efficiently.
Deterministic execution is possibly now, in NVIDIA's viewpoint, a first-class goal to be optimized for. Inference is no longer a throughput problem, but one of utilization - and therefore, an economic problem.
Well known fact: NVIDIA’s real constraint today is not demand. Demand is effectively unbounded. The constraint is supply: access to advanced TSMC nodes, CoWoS packaging capacity, and high-bandwidth memory. At this stage, adding more theoretical FLOPs to a die matters less than extracting more useful work from every square millimeter of silicon and every byte of memory bandwidth already available.
The most underappreciated consequence of this shift in inference is commercial rather than technical. Deterministic, high-utilization inference makes it possible to sell inference as a product again.
For the last decade, hyperscalers have dominated compute (and inference) by renting it. This works best when workloads are elastic, latency variance is acceptable, and scale hides inefficiency. It works poorly when data must stay on-prem, latency must be bounded, and customers want capital assets rather than usage-based APIs.
Enterprises, regulated industries, defense, healthcare, and industrial customers increasingly fall into the latter bucket. They want predictable performance, known costs, vendor support, and systems they can deploy inside their own environments. Hyperscalers are not particularly good at selling boxes into these markets; but NVIDIA is.
Notably, this is also something we're seeing in NVIDIA's architecture post-Blackwell. Tensor Core execution is less warp-centric, Tensor Memory allows for explicit reuse as opposed to positioning it as a cache heuristic and there are multiple mentions of long-lived tensor pipelines (Blackwell is already hinting at this). More responsibility is being pushed into the compiler and runtime, while the user-facing abstraction remains CUDA or Triton.
This is also where Grok's ideas matter. Grok's positioning itself on three pillars: deterministic execution, predictable memory access and good utilization. Deterministic execution, explicit dataflow, and pipeline-centric design allow Groq hardware to deliver consistent latency and very high effective utilization for a narrow class of inference workloads. NVIDIA doesn't likely care about VLIW, but rather this utilization focus and discipline.
Chiplet-based designs allow NVIDIA to physically separate concerns: tensor-heavy compute regions optimized for persistent pipelines, control-heavy regions for irregular logic, and tiered memory that reflects actual access patterns rather than a flat illusion. Predictable traffic reduces interconnect pressure, locality improves, and memory reuse can be guaranteed rather than hoped for, which means higher yield, (theoretically) better binning, and more usable SKUs per wafer. This basically means using the same silicon and same software stack, but enable different execution capabilities.
The best part - moving away from the supply constraints outlined earlier.
1) As opposed to monolithic SM's, one can imagine multiple tensor-heavy compute tiles with reduced warp scheduler complexity and fewer CUDA cores per tile. The idea would be to own local TMEM/SRAM and toggle between dynamic GPU computation or static inference mode, with a focus on compute utilization and deterministic pipelines. Groq IP could essentially take the form of a static sub-ISA inside these GPU's - not exposed publicly, but used for specific inference graphs by recognizing these (transformer decode, MoE, cache updates) and then locking these into a deterministic pipeline.
2) We don't need control logic polluting these tensor dies, so there could be a separate die for orchestration, CUDA-heavy workloads ; these would feed tensor pipelines and talk over NVLink-C2C.
3) Tiered memory chiplets that are meant to be used per their specific capabilities- HBM for KV cache/weights, SRAM for hot activations/reuse-heavy ops and DDR for the rest
4) Interconnect as the key underpinning factor here, tying everything together neatly

Static or semi-static inference pipelines only work for stable/common model patterns. Rapid model churn erodes their value. Compiler and runtime complexity grows exponentially, and NVIDIA can only justify that investment where scale exists. Inside hyperscalers, vertically integrated silicon will continue to dominate workloads tightly coupled to internal models and infrastructure.
Inference is simply too valuable to be given up on; NVIDIA might be seeking to productize this as a box and sell this to enterprises as opposed to hyperscalers (which is in a similar vein to the point that Ben Thompson was making in his Stratchery article earlier this month). The bet here is that Nvidia needs to keep its throne and not let the inference game get away from it - and it will do what it does best: sell inference as a product by controlling hardware, software and distribution.
NVIDIA’s $20B non-exclusive IP licensing deal with Groq looks puzzling if examined through a purely architectural lens. On the surface, it appears to pair the world’s most successful GPU vendor (with a software moat built around abstraction) with a company best known for a deterministic, VLIW-style inference accelerator—an approach many people associate with brittle compiler tooling, hardware-dependent scheduling complexity, and poor long-term ergonomics for end-users.
This is a perspective that tries to address some of this, by giving an alternative explanation related to economics and distribution as opposed to instruction sets (or leveling the competition or geopolitical deals to sell more hardware to China). This also came about at 3am on the occasion of peeking at Max Weinbach's twitter thread putting my infant daughter to bed, so take it with a huge pinch of salt.
tl;dr NVIDIA’s response is to turn inference into a sellable product again, not merely a cloud workload—by increasing utilization per square millimeter, reducing memory pressure, and making deterministic inference deployable outside hyperscaler walls. This will be done via moving determinism downward—into hardware, microcode, and compiler/runtime layers—while keeping user-facing abstractions stable. The result is a plausible future where GPUs support multiple execution personalities, including a Groq-inspired deterministic inference mode, without exposing VLIW complexity to developers.
From a strategic standpoint, there are 3 key pressures NVIDIA is seeking to address:
Inference workloads that are more irregular and latency sensitive, which are points that hyperscalers could look to capitalize on (along with them integrating vertical silicon)
Bottlenecks from TSMC, soaring HBM demand and CoWos
Enterprise inference workloads that need to be on-prem and have guarantees around predictable outcomes and latency
Inference is no longer batch-friendly, relatively uniform, and tolerant of latency variance + it isn't looking to be amortized behind the same infrastructure that was set up for training. These days, inference is increasingly dominated by low-batch or batch-one workloads, strict latency requirements, and memory pressure rather than compute pressure, especially when you consider this from an enterprise standpoint. KV caches grow quicker; mixture-of-experts, speculative decoding, and conditional execution introduce irregularity that GPUs can handle, but not always efficiently.
Deterministic execution is possibly now, in NVIDIA's viewpoint, a first-class goal to be optimized for. Inference is no longer a throughput problem, but one of utilization - and therefore, an economic problem.
Well known fact: NVIDIA’s real constraint today is not demand. Demand is effectively unbounded. The constraint is supply: access to advanced TSMC nodes, CoWoS packaging capacity, and high-bandwidth memory. At this stage, adding more theoretical FLOPs to a die matters less than extracting more useful work from every square millimeter of silicon and every byte of memory bandwidth already available.
The most underappreciated consequence of this shift in inference is commercial rather than technical. Deterministic, high-utilization inference makes it possible to sell inference as a product again.
For the last decade, hyperscalers have dominated compute (and inference) by renting it. This works best when workloads are elastic, latency variance is acceptable, and scale hides inefficiency. It works poorly when data must stay on-prem, latency must be bounded, and customers want capital assets rather than usage-based APIs.
Enterprises, regulated industries, defense, healthcare, and industrial customers increasingly fall into the latter bucket. They want predictable performance, known costs, vendor support, and systems they can deploy inside their own environments. Hyperscalers are not particularly good at selling boxes into these markets; but NVIDIA is.
Notably, this is also something we're seeing in NVIDIA's architecture post-Blackwell. Tensor Core execution is less warp-centric, Tensor Memory allows for explicit reuse as opposed to positioning it as a cache heuristic and there are multiple mentions of long-lived tensor pipelines (Blackwell is already hinting at this). More responsibility is being pushed into the compiler and runtime, while the user-facing abstraction remains CUDA or Triton.
This is also where Grok's ideas matter. Grok's positioning itself on three pillars: deterministic execution, predictable memory access and good utilization. Deterministic execution, explicit dataflow, and pipeline-centric design allow Groq hardware to deliver consistent latency and very high effective utilization for a narrow class of inference workloads. NVIDIA doesn't likely care about VLIW, but rather this utilization focus and discipline.
Chiplet-based designs allow NVIDIA to physically separate concerns: tensor-heavy compute regions optimized for persistent pipelines, control-heavy regions for irregular logic, and tiered memory that reflects actual access patterns rather than a flat illusion. Predictable traffic reduces interconnect pressure, locality improves, and memory reuse can be guaranteed rather than hoped for, which means higher yield, (theoretically) better binning, and more usable SKUs per wafer. This basically means using the same silicon and same software stack, but enable different execution capabilities.
The best part - moving away from the supply constraints outlined earlier.
1) As opposed to monolithic SM's, one can imagine multiple tensor-heavy compute tiles with reduced warp scheduler complexity and fewer CUDA cores per tile. The idea would be to own local TMEM/SRAM and toggle between dynamic GPU computation or static inference mode, with a focus on compute utilization and deterministic pipelines. Groq IP could essentially take the form of a static sub-ISA inside these GPU's - not exposed publicly, but used for specific inference graphs by recognizing these (transformer decode, MoE, cache updates) and then locking these into a deterministic pipeline.
2) We don't need control logic polluting these tensor dies, so there could be a separate die for orchestration, CUDA-heavy workloads ; these would feed tensor pipelines and talk over NVLink-C2C.
3) Tiered memory chiplets that are meant to be used per their specific capabilities- HBM for KV cache/weights, SRAM for hot activations/reuse-heavy ops and DDR for the rest
4) Interconnect as the key underpinning factor here, tying everything together neatly

Static or semi-static inference pipelines only work for stable/common model patterns. Rapid model churn erodes their value. Compiler and runtime complexity grows exponentially, and NVIDIA can only justify that investment where scale exists. Inside hyperscalers, vertically integrated silicon will continue to dominate workloads tightly coupled to internal models and infrastructure.
Inference is simply too valuable to be given up on; NVIDIA might be seeking to productize this as a box and sell this to enterprises as opposed to hyperscalers (which is in a similar vein to the point that Ben Thompson was making in his Stratchery article earlier this month). The bet here is that Nvidia needs to keep its throne and not let the inference game get away from it - and it will do what it does best: sell inference as a product by controlling hardware, software and distribution.
<100 subscribers
<100 subscribers
Share Dialog
Share Dialog
Musings on Tech
Musings on Tech
No comments yet