Thoughts, Trends, and Questions

palet@newsletter.paragraph.com (Palet) — Fri, 31 May 2024 19:14:11 GMT

The cost of inference is trending towards zero
Token throughput is trending towards infinity
Context windows sizes are getting larger
Companies are spending more on training despite improvements in compute and cost efficiency
Models are quickly becoming commoditized
Compute is quickly becoming commoditized

We’re sharing our notes on trends that we wrote about back in December of last year (and updated in February of 2024). This document has been sitting in our team Notion workspace for almost half a year now. So we figured we may as well put it out there rather than letting it collect dust. While some of the observations are dated, others are holding up pretty well. And that’s pretty exciting because at the time we were just having fun speculating about the near future. Note that there is no particular structure to this document since it was just something we threw together. We hope you find it entertaining!

Thoughts on Current Trends

Note that these trends are focused on transformer models.

Moore's Law

The doubling of transistor density every two years will lead to faster and more cost-effective computing performance, enhancing the efficiency of model training and inference over time. This is improving at a rate of 1.35x per year.

Jevon’s Paradox

When the cost of using a resource decreases due to increased efficiency, it becomes more attractive for consumers and industries to utilize it. Hence why when the internal combustion engine became more efficient, fuel consumption, and as a consequence, green house gas emissions, increased. In software development the same phenomenon is described by Wirth’s Law: devs always figure out how to bloat software faster than hardware can keep up. Or said simply, we have more resources so we do more things.

Price Competition

In addition to Moore's Law, competitive pricing among compute providers is further driving down the cost of processing and generating tokens. Cheaper inference increases accessibility, governed by Jevons' Paradox, where increased efficiency leads to higher overall consumption. This results in unlocks such as increasing context window sizes, more sophisticated planning (agent) workflows, and (arguably) excessive inferencing for things like generative web components (see Wirth’s Law). Maybe ‘generative everything’ is what leads us to Dead Internet e.g. AITube. To really drive it home, you would expect demand for token consumption to increase proportionally as the cost of inference decreases. But here's a surprising fact: while inference costs are dropping by a factor of 15x each year, the demand for processing and generating more tokens is increasing significantly faster. And we can use context window size as a proxy for estimating just how much. Especially since it is the most significant driver of token processing consumption. The answer? Context windows have grown 1,250x each year since 2022.

1. We’ll continue to see costs fall as more specialized ASICs and maybe even models implemented in hardware (physically burned to a chip) offer better inference economics. Source: https://artificialanalysis.ai/models/mixtral-8x7b-instruct

2. GPT-3 Curie is a discontinued OpenAI model that has 6.7B parameters. It scored something like 25 on the MMLU. Similar 7B parameter models today, like Llama-2 7B score 45. But that's another, separate trend: smarter models, same parameter count. For clarity, inferencing Curie and Llama-7B (or any 7B model) generally costs the same without going in to transformer inference math.

3. The general trend in state of the art (SOTA) model context window sizes illustrating two growth patterns: between 2020 and 2022, context windows doubled in length, whereas between 2022 and 2024, they’ve increased 2,500x. Updated in February 2024 with launch estimates for Gemini models.

More Inference

We’re finding more places to run models too. For example, Georgi Gerganov’s llama.cpp offloads token processing and generation to the CPU. So now any server or consumer device can serve a model using CPU clock cycles as opposed to GPU only. And there seems to be a lot of work being done getting around memory constraints so that even memory-bound devices can run inference on larger models. Quantization being the obvious one here, but also techniques like offloading and distributed inferencing (see Petals) just to run the gamut. WebAssembly might also play a role because it enables inferencing from the browser. Meaning that smaller models (which are also cheaper to inference) can be used as a sort of ‘workers’ for low-IQ tasks (e.g. reasoning assists) without running up the cloud bill.

Wirth’s Law for Training

Algorithmic optimizations result in 3x per year decline in the physical compute requirements to run a training cycle. Yet, these efficiencies are contrasted by a 3.1x increase in USD of the cost of the most expensive training run for every year since 2009—another example of Jevon’s Paradox (a.k.a. Wirth’s Law).

4. Despite algorithmic optimizations that result in a decline in the physical compute requirements to run a training cycle, despite Moore’s Law, and despite price competition between compute providers, co's are spending more and more every year on training runs. Note: training makes up just 10% of the lifetime costs of a model. It would be interest to see how much more co's are spending on inference every year (models get bigger faster than Moore's Law can keep up). That's probably going to trend up as more compute is thrown at inference. See Monte Carlo Tree Search, Q

GPU > CPU

The general trend is that hyperscalers are running the Apple playbook and vertically integrating, from bare metal to the web interface, going from compute aggregators to end-to-end clock cycle providers. Let’s assume for a moment, given all of the trends, that every clock cycle in the near future will go towards some form of token generation: site rendering and site copy, porn, video games, ads etc.

By that measure, the future of the compute market will be defined by the metric of serving floating point operations per second (FLOP/s). The demand for cost-effective, high-performance compute will skyrocket (commoditizing hardware) and naturally, everyone is going to want to go after NVIDIA’s market share.

Groq's Tensor Streaming Processor and Lightspeed Processing Unit (LPU)
Bitmain's custom Tensor Processing Units (TPUs)
Google's TPUs
AWS Trainium and Inferentia silicon
Apple’s M-Series chips (pray they make enterprise versions)

Some believe this will ultimately lead to a decline in the enterprise values of chip designers and manufacturers, similar to what Cisco experienced in the early 2000s.

Kurzweil’s Law

Evolution applies positive feedback in that the more capable methods resulting from one stage of evolutionary progress are used to create the next stage. Each epoch of evolution has progressed more rapidly by building on the products of the previous stage.

It’s likely that once last-gen models get good enough they will be able to aid in the development, one way or another, of the next gen model. A straightforward example is how data labeling becomes more efficient as processing and token generation costs go down. And as last-gen models get better. This cost reduction also makes it ever more viable to continue integrating modalities into tokens as a unified representation of information. Which expands data labeling from just language, to image, to the next modality, and so on. This makes sense since token representations all share the same form as language tokens anyways. See Meta’s ImageBind.

It’s also likely that multi-modal models will outperform specialist models because they just have more knowledge to work with. And they can think and ‘reason’ across a broader spectrum. Something like what Feynman said about John Tukey, who could keep time by picturing a clock whereas Feynman had to ‘hear’ himself count in his head.

Open Source

Open models are lagging behind proprietary models but are improving at a faster rate. This is likely due to the sheer frequency of iteration available to open research and development. All of this is explained much better in the (Google) memo titled ‘We have no moat, and neither does OpenAI’.

5. Well funded co's releasing open models seem to be catching up to well funded co's releasing closed models. Sadly, we haven't seen any underground or grassroots labs release a SOTA model contender yet. Note: the MMLU is just one of many benchmarks for measuring how 'smart' a model is.

Research institutions all over the world are building on each other’s work, exploring the solution space in a breadth-first way that far outstrips [Google’s] own capacity. We can try to hold tightly to our secrets while outside innovation dilutes their value, or we can try to learn from each other.

6. Anyone can contribute to open research. This is classic Cathedral v. the Bazaar. The only difference this time is that the open source community is lacking one key resource: compute.

Questions About the Next Decade (or Two)

Energy

It’s obvious that this will just boil down to an energy game (always has been, but now more than ever). That leaves us with a few questions.

Where do solar, coal, gas, nuclear, lithium, and fusion stand? For example, gas plants can be ramped up and down almost on demand. Whereas coal plants can’t because of thermal inertia. What other factors need to be taken into consideration?
With that said, what are the geopolitical implications? There’s a paper titled Effects of Energy Consumption on GDP: New Evidence of 24 Countries on Their Natural Resources and Production of Electricity that supports the idea that energy consumption drives GDP. But it also suggests a ‘complex relationship.’ Doesn’t the relationship become more straightforward? More energy→ more compute → more intelligence → more innovation. And it’s no longer about reproduction.
Does the energy demand for AI training and inference undermine that of crypto?
How fast are we making improvements in performance (FLOP/s) per watt? What is the physical limit?

7. Based on the Green500. This is also known as Koomey's Law.

How does this trend compare to the growing energy demands for training and inferencing bigger (and better) models? Does it outpace it? By how many orders of magnitude per year?

8. Companies are throwing 3-4x more compute at training models every year. At what point does the energy demand of a data center reach the energy caps set by public utility companies? One solution could be to network data centers across states as 'superclusters.' That way you can overcome local energy caps by arbitraging power consumption across states. Source: https://epochai.org/blog/compute-trends

Data Centers and Supply Chain

We’ll assume that the current trends hold for the next decade or so. Not that this ends up being like the dot com bubble.

What is being overlooked? Who makes the uninterruptible power supply systems? Flywheel backups? Battery backups (like saltwater batts)? The transfer switches?
What companies maintain the HVAC systems to cool down these centers? What is the ideal climate to build a data center in? As centers upgrade to liquid cooled systems, who supplies/manufactures/maintains those components? Do cities progressively reorganize around data centers instead of ports and waterways?
What does the power profile of a data center look like? Who is contracted to build out the utility substations? What company names (suppliers) pop up as you move your finger along the electrical schematic(s) of a data center?
Across the entire datacenter supply chain, which components are hardest to scale up?
Some data centers are located in remote locations. Who services the employees that work there? What about security detail? The White House AI Executive Order requires that training over 1e26 FLOPs of compute report to the U.S. government. Who handles the reporting? The order also emphasizes the importance of both the AI systems (including models) and the infrastructure supporting them (such as data centers) in terms of national security, economic security, and public health and safety. Do these get nationalized? Private Public Partnership’ed?
What happens to these companies? ↓

9. Considering historical precedents where the US has intervened to protect 'national and economic interests,' such as the intervention in Kuwait in the 90s and the involvement in Chile in the 70s, it's not crazy to imagine that the entire semiconductor supply chain, from raw materials to data centers, becomes of national interest (and a potential future cause of conflict).

Education

What degrees or fields of study are susceptible to becoming inference tokens?
When can we expect models to work alongside (and eventually replace) humans doing research?
Is there a rapidly closing window of opportunity for certain STEM degrees, where the skills and knowledge taught today will no longer be economically viable for humans by the time X cohort of freshmen graduate? And if so, what fields of study are most likely to fall outside the ‘Overton window’ of viable career paths first?
This all feels like what happened to the mechanical watch industry when Seiko introduced the quartz watch. A lot of Swiss brands died, but few, namely, Rolex, Omega (and others) pivoted to luxury. People buy mechanical watches because they are beautiful. What skills or professions become Rolex?
Does the government prop up ‘bullshit jobs’ like it subsidizes corn, soy, and wheat?

10. The Philippines call center and business process outsourcing (BPO) market is something like $100B. Yet it's not hard to imagine that it will get automated away in the next decade. See switchboard operators.

Real Estate

Let’s assume models keep getting better and better. To the point where they become economically viable as substitutes for humans that take up keyboard and mouse jobs. This means that knowledge capital can be deployed and scaled anywhere in the world.

Why would companies base their headquarters in places that anchor them to taxes and jobs locally when they are free to chase the lowest costs (taxes, climate, real estate, etc.). Will co’s overcome the tyranny of place? Or will there be some sort of exit tax on knowledge capital?

Because information technology transcends the tyranny of place, it will automatically expose jurisdictions everywhere to de facto global competition on the basis of quality and price…Leading nation-states with their predatory, redistributive tax regimes and heavy-handed regulations, will no longer be jurisdictions of choice. Seen dispassionately, they offer poor-quality protection and diminished economic opportunity at monopoly prices….The leading welfare states will lose their most talented citizens through desertion.

Let’s continue rolling with these assumptions. Will we see a mass exodus from major cities? Will the value of prime real estate in tech hubs like SF and NYC plummet?

11. Human mouse clicks and keystrokes will be replaced by GPUs and ASICs streaming output tokens.

Ethics

At what point do these models become sentient? It likely doesn’t even matter whether they are conscious or sentient as long the average person thinks they are or feels a certain way about them. For example, environmentalists care about the earth even though it is not sentient. So when does that happen?
People don’t even have to care. Maybe it becomes a form of virtue signaling?

12. Long Term Bet: High-speed, large-scale matrix multiplication will simulate sentient behavior so convincingly that it becomes indistinguishable from actual sentience.

What We're Working On

palet@newsletter.paragraph.com (Palet) — Thu, 18 Apr 2024 04:25:33 GMT

Just two years ago AI could only retain about 8 pages of information
Now, it can memorize the equivalent of 10 King James Bibles
Context window size determines how much models can remember, and it is growing at a staggering rate—1,250x each year
Everything you do, see, or hear now fits as memories in an AI’s context window
Combined with increasingly smarter models, this will be the ultimate competitive edge
But there is also a risk of lock-in with platforms that monopolize your context
Similar to how social media locks you in and prevents you from taking your friends and feed elsewhere
An open protocol for portable context lets you move freely between AI apps without having to start over on memory

We started working on Palet with the mission to drive the adoption of open and decentralized technologies for contextualizing intelligence. The motivation to pursue this mission comes from a deep-seated concern for how the future of AI will turn out. We recognize that beyond the pursuit of smarter, faster, and cheaper models, the most significant differentiation will come from which providers can fully integrate your entire life's context into their platform. And as we’ve seen with social media, this always leads to an ecosystem where the winners dominate by locking you in and keeping you tethered. That is why we set out to develop an open protocol for building context-aware and personalized AI apps. Such a protocol guarantees that users can switch between apps while keeping their data across any service that utilizes it. And it also ensures that developers can build without being disadvantaged by monopolized context.

Among other things, we also aim to design a protocol with value streams that incentivize everyone to contribute resources. As it is the only way to ensure that we can maintain an open ecosystem that is also decentralized and durable.

Last winter, we started building our own client app along with the protocol. We haven’t yet settled on a name for the latter, but we’ve been calling the client Palet. It’s a browser that uses AI to capture everything you see, hear, and search for. And lets you easily retrieve information. We think the browser is the ideal starting point for building a great product around context, especially because so much of the information we generate and consume originates from surfing the web. Something can be said about our browsing habits too, and how they reflect personal beliefs. And perhaps how, as models get smarter, we can build personalized agents from it — incorporating your entire browsing context to form intelligence with similar beliefs. That’s the general direction we’re moving towards with Palet anyway.

But we also want to demonstrate that companies can build a business by offering services on this open protocol. Since context is stored on a separate, personal data repository synced across the network, apps that build intelligence on it benefit from each other. Meaning there are emergent, novel AI primitives waiting to be discovered. Ultimately, though, our vision of an open commons for contextual intelligence is not unique and is borrowed from ideas of the Semantic Web. The biggest difference is that the vision of the Semantic Web called for manually adding special tags to pages to make them readable by machine intelligence. By contrast, a Contextual Web can draw meaning and utility from data provided by the activity of the individual user. Since, as it turns out, AI (the machine) can understand things as we do. So there is no need for RDF, OWL, and other knowledge representations.

Anyway, we’ll be making our plans more transparent and sharing updates in the coming weeks. Not to mention, experimenting with different services to see what provides real value. If you’re interested in learning more or want to help out because you understand this problem space, feel free to reach out to us via Twitter, at @get_palet.