Delip Rao

Ride the Hardware Lottery!

delip-rao@newsletter.paragraph.com (Delip Rao) — Sat, 20 Nov 2021 04:21:43 GMT

As you can tell from the previous two posts on Page Street Labs, I have been obsessed with Very Large Parameter (VLP) models lately. I wasn’t always this way. On my personal blog and Twitter feed, I have written enough about the culture of building models by stacking layers and praying it works. Ever since we figured out that adding more parameters (more layers specifically) helps, folks have been pushing that limit. Here’s an example from ImageNet:

And most of those efforts are B-O-R-I-N-G (but with a side of good lessons that may not be widely applicable). However, something is fundamentally different about Very Large Parameter models (think GPT-3 scale and beyond) in terms of their capabilities.

Sara Hooker from Google published an essay on arXiv (btw, should this be on arXiv too? There’s a BibTeX entry at the end ... sempai cite me please?) explaining how certain areas of research get a lot of attention and win a lot of support — including software and hardware — over others that the article calls “The Hardware Lottery”, and how this discourages innovation investments in other areas. The hardware lottery is not new; not even in computer science, but the central plea of that essay is “Hardware Lotteries are holding us back and should be avoided.”

Hardware lotteries, and more generally, resource lotteries, have existed for as long as technology has, simply because of the nature of innovation and the economics of capital-intensive enterprises. For example, sophisticated hydraulic engines were replaced by electrical counterparts over the decades, but until then a lot of interesting ideas and applications (e.g. Bessemer process and the mass production of steel) came out of hydraulic technology. Even with electrical actuators, the strength of the electrical field you can generate limits the pressure you can produce while hydraulics are only limited by the strength of materials. So, even if Electricity was a uniformly superior/efficient technology, there are no absolute winners; It all depends on the context. I can keep going, but this is not a post on the evolution of innovation (a favorite topic of mine so we will no doubt revisit that in a later post). This is a post about **exploiting the inevitable hardware lotteries. **To do that fully, we need to understand the scaling of models before we write them off as wasteful exercises.

Not All Big Things Are Alike

Our intuitions about scaling are often flawed. Strange things happen at the extremes of scale. Practically all theories about every aspect of life — natural or human-made — break down when scaled in either direction. Let’s take a few examples:

Time: We can plan for the next few hours mentally straightforwardly, with some effort, and with the aid of a calendar for the next few days, but many of us struggle thinking about consequences for the next few years. Very few people can think about the implications for, say, the next ten years (many of them unsurprisingly are famed investors and that’s not an accident).

Money: As money scales, folks have trouble understanding what it is and what it can do as the nature of money itself changes with accumulation. As Marx notes:

The accumulation of capital, which originally appeared only as its quantiative extension, comes to fruition, as we have seen, through a progressive qualitative change in its composition. — Das Kapital (1867)

Lottery winners routinely get discombobulated when confronting their winnings, and most Americans have trouble grasping the extent of our national debt.

The nuance between the simple accumulation of capital and its qualitative effects is best illustrated in this supposed conversation between Fitzgerald and Hemingway in a 1920 Paris café:

Fitzgerald: *“The rich are different from us.” *Hemingway: “Yes, they have more money.”

Crowds: People scale differently, too. It is not uncommon for a large collection of mediocre hires to come together and form a brilliant organizational unit. While individual and small group opinions are less interesting, Twitter, one of the largest opinion scaling experiments, has produced a sea change in thinking (#metoo, #BLM, ...) and massive information operations at once.

Physical Sciences: Nature is self-organized hierarchically, and it’s not a flaw that entirely new laws are needed to understand different scales (for e.g., Quantum Mechanics at a sub-atomic scale and General Relativity at a cosmological scale). In other words, every quantitative shift is accompanied by its own qualitative shifts.

“The whole becomes not only more than but very different from the sum of its parts.” — Anderson (1972), More is Different.

Even in an exact field like mathematics, singular limits exist, and known recurrences break down past a limit.

So, a natural question: What happens when we scale the number of parameters in a neural network to absurd levels? Are there “emergent” realities that cannot be explained by the component parts?

We already see some of this in VLP models like GPT-3 where the model is able to “solve” several unseen problems in natural language or other domains after seeing only a few examples (so-called “zero-shot” / “few-shot” generalization). But we don’t really understand how or why that happens. Studying this emergent reality should be the foremost preoccupation for anyone working on VLP models.

But What About Electric Shaver Brains?

Often discussions in AI wander into (sometimes unwarranted) comparisons with the human brain. One argument against parameter scaling is that human brains run on the power of an electric shaver, we could be wasting our time, effort, and energy running these VLP models on GPUs/TPUs. This argument is based on some flawed assumptions:

1. The human brain is a perfect piece of engineering and should be mimicked. This assumption is a trap for human thinking; indeed, few things are as marvelous to the human mind as the human brain. Francois Jacob, in his influential 1977 article, “Evolution and Tinkering” explains it best:

It is hard to realize that the living world as we know it is just one of the many possibilities; that its actual structure results from the history of the earth. … They represent, not a perfect product of engineering, but a patchwork of odd sets of pieced together when and where opportunities arose.

Opportunism reflects the “very nature of a historical process full of contingency”. In other words, we are a product of a variety of lotteries — physical, ecological, and historical. Changing the nature of these lotteries would lead to different outcomes and not necessarily provably better outcomes.

2. Low power != Fewer parameters. The ultra-low-power nature of human brains has more to do with its substrate than the number of connections. In fact, studies in evolutionary neurobiology and comparative neuroanatomy reveal strong correlations between body weight, brain weight, and the number of neurons.

Brain weight and neuron number scaling in mammals and birds (Striedter and Northcutt, 2020).

And it’s not that the brains of hominid species have remained static either. In fact, some researchers like Suzana Herculano-Houzel argue that Homo Sapiens have actually got a hardware upgrade over their ancestors primarily due to the invention of cooking, which provides a means to improve energy density in food, much like upgrading your power supply unit because you added extra GPUs.

Perhaps, the future of AI will be very large parameter models running on ultra-low-power bioelectronics.

Aside: Pruning and Distillation of AI models are ways to reduce power consumption while approximating capabilities. Pruning removes inconsequential weights from a model, while distillation trains a seperate smaller model to mimic the outputs of a larger model in a teacher-student fashion. While these approaches have practical applications, they don’t create new capabilites over existing models.

Ride the Lottery!

The hardware lottery is a kind of resource lottery. Resource lotteries for innovation are not new in science and engineering, and if we look back, even nature has several examples of such lotteries. In fact, a general observation would be that resource lotteries are inevitable and we are also better served by focusing on answering interesting questions posed by current realities than an imagined future. In trying to create a uniform exploration of idea spaces divorced from economic/practical realities (to “avoid the hardware lottery”), we would be missing out on interesting research opportunities by shunting works simply because they don’t fit our current understanding of how the human brain works or is capable of.

In particular, one has to keep in mind that not all big models are alike, and Very Large Parameter models are uniquely interesting in that they add more capabilities to the model in ways we don’t understand today.

One way to look at AI modeling of today is to imagine ourselves in some Cambrian era with all sorts of brains proliferating from the most efficient to the least efficient. With competition for resources, the least efficient options will eventually get culled out, but in their wake, they may leave behind an understanding we would not achieve otherwise. Efficiency and capability of intelligent systems are two separate goals and any arguments in limiting the exploration of one goal for another comes from over-investing and extrapolating the limitations of current technologies.

Despite an electric shaver-like power efficiency, the human brain has limits that some of the Very Large Parameter models transcend (even if that’s unreliable today). A future I would like to live in is where human brains are augmented with capabilities that seem alien to me at this time of writing, via a second brain that does things so much differently than our wet brains.

Acknowledgments: Many thanks to Melanie Mitchell, Jen-Hao Yeh, and Cristian Strat for comments on early drafts of this.

Cite this article:

@misc{clarity:ride-hardware-lottery,
  author = {Delip Rao},
  title = {Ride the Hardware Lottery!},
  howpublished = {\url{https://pagestlabs.com/clarity/ride-hardware-lottery}},
  month = {November},
  year = {2020}
}

GPT-3 Turk's Gambit and The Question of AI Automation

delip-rao@newsletter.paragraph.com (Delip Rao) — Sat, 20 Nov 2021 04:19:22 GMT

As I write this issue, I realize how incredibly lucky and privileged I’ve been working as an academic researcher in AI, teaching the latest techniques (including writing a book), consulting for different industry leaders on how to bring AI to their business, working on social media misinformation, and building a variety of AI products at big corporations to startups (mine and others). From that vantage point, I am offering a highly opinionated commentary on piercing the hype, which is a kind of misinformation, around few-shot models like GPT-3. My goal in studying this is twofold: 1) Raise awareness about potential safety issues as technology hype is primarily a safety problem, but mostly, 2) develop clarity in thinking about what is actually possible with such models. As we meditate on the true nature of advanced technologies like GPT-3 and its applications to automation, it forces us to examine what we mean by automation itself from the perspective of AI models. We will examine ways in which models and humans have coexisted in the past and the present, and what it bodes for the future as the AI technology landscape is changing as frequently as it is.

The Tyranny of Appearances

DailyNous, a philosophy blog, invited a bunch of researchers to share their opinion about GPT-3 and related questions. Shortly after the essays appeared, someone claimed in a clickbait tweet (GPT-3 clickbait tweets are now a genre of their own, but we will use this one for illustration):

“I asked GPT-3 to write a response to the philosophical essays ... It's quite remarkable!”

Folks on Twitter did not disappoint. The tweet and its enclosing 4-page “Response” were retweeted more than a thousand times because the clickbait language and the presentation in the Response document probably made people go, “ZOMG! I can’t believe an AI did this.” To aid that, the Response came with a misleading (and an ethically dubious) “NOTE” to prime the readers into thinking that.

What was not covered in the original tweet or the Response document was the amount of human involvement needed to produce a text of that length and clarity -- with multiple generations for each sentence and careful picking and choosing that went in the composition of the generated text. A more likely scenario for the cogent text in the Response is illustrated here, which raises an interesting design question of how best to faithfully portray generated content (not a topic of this issue but worth exploring from a safety/trust point of view).

Raphaël Millière, the author of the “Response”, to his credit, published the details of the production process later, which was only shared a few dozen times as opposed to more than a thousand or so times for the original misleading clickbait. As usual, misinformation flies, and the truth comes limping after it.

Aside: The word misinformation means many things. Withholding some facts to misrepresent something is a kind of misinformation. For a good taxonomy of misinformation, I recommend this article from Claire Wardle, a fact-checker and misinformation researcher.

The Turk’s Gambit

Such sensational overrepresentations of technology are commonplace in the Valley. Many demos in VC pitches are carefully orchestrated wizard-of-oz shows, much like Von Kempelen impressing Maria Theresa’s court with his chess-playing “automaton” — The Turk.

There are accounts (personal favorite is by Tom Standage) of how Kemplen, and later Mälzel, captivated audiences ranging from peasants to nobility to the scholars at the time on the Turk’s abilities for almost a century across Europe and America before its limits and workings were discovered. The Turk was a marvel of engineering and ingenuity, but more importantly, a storytelling device. It captivated generations to come — e.g., Charles Babbage was impressed by it — and raised questions that weren’t asked frequently before, much like the GPT-3 demos are asking of us now:

While Kemplen and Mälzel were showmen and some trickery was expected of them, how does one ethically present results for technologies like GPT-3? As we will see, this is not just a question of ethics and attribution, but also a question of AI Safety — i.e., preventing AI models from being harmfully utilized.
How do we avoid the steep descent into the “trough of disillusionment” that inevitably comes after peak hype and fast-forward our way to the “slope of enlightenment” and the “plateau of productivity”? If we clear the clouds of hype, the resulting clarity will make us ask the right questions about the technology.\

AI Model Capability Hype is Fundamentally an AI Safety Issue

Bringing clarity amid a model capability hype is useful for identifying true product/business opportunities. But a more critical purpose is to ensure our implementation choices lead to products that are safe for the consumer and the world we inhabit.

Safety issues from the AI model hype can arise in two different ways. The first is when product builders overreach model capabilities and, either knowingly or unknowingly, set wrong expectations with the customers. These are usually self-correcting (unless you’re on a 4y startup exit trajectory) as customers inevitably complain and regulatory bodies step in, but not without significant harm to the company building the product and its customers. Tesla overreaching the driver-assist feature of its cars to “Fully Self Driving” in its product promotion materials (and in tweets from Elon Musk himself) is an example.

Customers mislead into believing these hyped-up capabilities could potentially endanger themselves and others due to misplaced trust. As AI models become easier to use (as GPT-3’s few-shot examples promise), folks building with AI models will increasingly not be AI experts who designed those models. Building appropriate safety valves in the product and regulatory framework around its use becomes critical.

The second way AI models can become unsafe due to hype is by customer overreach. People are inherently creative in how they use their tools. Folks using AI models outside of the advertised purposes for fun or entrepreneurial reasons can similarly bring harm.

Good policies, responsible communication practices, regulation, and consumer education are indispensable for creating an environment of safe consumption of AI technologies. Many of these practices are often at odds with short-term gains but not necessarily so with long-term rewards. There is a lot more to talk about AI Safety, but in this issue, I will focus on the question: how do we free ourselves from the tyranny of appearances of AI models and truly understand their automation capabilities.

What is AI Automation, and what isn’t?

AI automation is not a dualistic experience. One of the dangers of the hype over-attributing capabilities of a system is, we lose sight of the fact that automation is a continuum as opposed to a discrete state. In addition to stoking irrational fears about automation, this kind of thinking also throws out of the window any exciting partial-automation possibilities (and products) that lie on the spectrum.

For convenience, we can break up the automation spectrum offered by the deployment of AI models for a task into five ordered levels:

Manual: Human does all the job for the task.
Extend: Model extends/augments the human capability for the task.
Offload: The model partially offloads the *complexity of the task *(more on this later) by automatically solving some of it.
Fallback: The model solves the task entirely most times, and occasionally, it cedes control to humans because of the complexity of the task or the human voluntarily takes over for whatever reason.
Replace: The human becomes irrelevant in solving the task.

I am deriving this categorization from Sheridan and Verplanck’s 1978 study on undersea teleoperators but adapted for modern AI models. On the surface, this representation might appear similar to SEA levels of autonomous driving (those were influenced by the 1978 study as well). Still, the critical difference in this article is the inclusion of the task and the relation between the model, application, and the task. The SEA autonomous driving levels, on the other hand, are focused on a fixed task — driving on “publicly accessible roadways (including parking areas and private campuses that permit public access)”. We cannot talk about the automation capabilities of a model in isolation without the task and its application considered together.

The Interplay of Task and Application Complexity in AI Automation

Traditional software automation is focused on a specific task and any work related to it is built from scratch. AI model-based automation on the other hand is unique in the sense that you can have a model trained on one task — say face recognition — and used in multiple applications ranging from unlocking your phone to matching a suspect against a criminal database. Each of those applications has a different tolerance for false-positives/false-negatives (aka. “risk”). This train-once-use-everywhere pattern is becoming increasingly popular with large parameter models that are trained on massive datasets with expensive compute. This pattern is especially true with large model fine-tuning and also with recent zero-shot and few-shot models.

While this pattern is cheap and convenient, a lot of the problems in AI deployments result from transferring expectations on a model from its use in one scenario to another and being unable to quantify the domain-specific risk correctly. Sometimes, just retraining the model on the application-specific dataset may not be sufficient without making changes to the architecture of the model (“architecture engineering”) to handle dataset-specific nuances.

To illustrate the dependence of automation level on the task and application consider, for example, the task of machine translation of natural language texts. Google has one of the largest models for machine translation, so let’s consider that as our model of choice. The translation model, and certainly its API, appear general enough to give an unsuspecting user to try it on her favorite application. Now let’s consider a few application categories where machine translation can be applied — News, Poetry, Movie subtitles, Medical transcripts, and so on. Notice that for the same model and same task, depending on the application, the automation levels vary widely. So it never makes sense to assign “automation levels” to a model or to a task or an application, but the combination of the model and the application.

The automation level assignments in this figure are approximate and may not reflect Google’s current systems. This example is also overly simplified as performance on “news” may also not be homogenous. Translation qualities may be different across different news domains — e.g., financial news vs. political news — or across different languages. Yet this simplification is useful for illustration purposes.

Hype (and the subsequent user dissatisfaction) often happens when folks conflate, knowingly or unknowingly, the automation level offered in one application to another. For example, someone claiming all of humanity’s poetry will be accessible in English after their experience translating news articles to English.

To consider an example with GPT-3, the success of the AI Dungeon, a text adventure game, is an example of such a phenomenon. In the case of AI Dungeon, the outputs of the model could be interpreted creatively in any way you like (i.e., there are very few “wrong” answers by the model, if any). The error margin is effectively infinity offering zero risk in directly deploying the model modulo some post hoc filtering for toxic/obscene language and avoiding sensitive topics. Based on those outcomes, it wouldn’t make sense to deploy the model unattended as it stands today, say, for business applications. And in some cases, like healthcare, it may make sense not to deploy the model at all.

Aside 1: So far, when we consider situations where models “fallback” to humans we haven’t considered the thorny problem of knowing when to fallback. Today’s Deep Learning models, including GPT-3, are incredibly bad at telling when they are unsure about a prediction, so situations which require reliable fallback to humans cannot take advantage of such models.
Aside 2: Modeling improvements can push a model’s generalization capability across a wide range of applications, but their deployability will still widely vary. In fact, in risk intolerant applications with very little margin for acceptable error (consider the use of facial recognition for policing), we may choose to never deploy a model. Whereas in other applications, say use of facial recognition to organize photos, the margin of acceptable error may be wide enough that one might just give a shrug about the model fails, and hope for a better model update in the future.
Edwards, Perrone, and Doyle (2020) explore the idea of assigning automation levels to “language generation”. This is poorly defined as language generation, unlike self-driving, is not a task, but a means to accomplish one of the many tasks in NLP like dialogue, summarization, QA, and so on. For that reason, it does not make sense to assign an automation level for GPT-3’s language generation capabilities without also considering the task in question.

Capability Surfaces, Task Entropy, and Automatability

Another way to view the performance of a model on a task is to consider its Capability Surface. To develop this concept, first, let’s consider an arbitrary, but fixed, ordering of the applications (domains) where the model trained on a task is applied. Now, for each application, let’s plot the automation capability levels of the model across different domains. Now, consider an imaginary “surface” that connects through these points. Let’s call this the capability surface.

AI Models rarely have a smooth capability surface. We then define Task Entropy as a measure of the roughness of this capability surface. As the model for a task becomes sophisticated, and it is trained with increasingly large datasets and compute, the task entropy, for that fixed model, reduces over time. The task entropy is then a measure of the Automatability of a task using that model.

Aside: All this can be laid out more formally. But for this publication, I am taking a “poetic license” and focusing on developing intuitions.

Capability Surfaces of Few-shot and Zero-shot models

In traditional AI modeling (supervised or fine-tuned), the task is usually fixed, and the application domains can vary. However, in zero-shot and few-shot models, such as GPT-3, not only do the application domains vary, but also the tasks can vary too. The tasks solved by a GPT-3 like model may not even be enumerable.

In the case of GPT-3, the task may not even be explicitly defined, except with a list of carefully designed “prompts”. Today, the way to arrive at the “right” prompts is prospecting by querying the model with different prompts until something works. Veteran users now may have developed intuitions for how to structure the prompt for a task based on experiential knowledge. Despite this care, the predictions may be unreliable, so carefully understanding the risks inherent to the application and engineering around it is indispensable.

Aside: GPT-3 is often touted as a “no code” enabler. This is only partially true. In many real-world problems, such as writing assistance and coding assistance, the amount of boilerplate is so high and the narratives are so predictable in language that it is reasonable to expect GPT-3 to contextually autocomplete big chunks based on the training data it has seen. This is not necessarily a negative. With bigger models like GPT-3 the lego blocks we play with have become incresingly sophisticated, but a significant amount of talent and, many times, coding is needed to put together something non-trivial at scale. As Denny Britz points out (personal communication), “[the cost of error when writing code with GPT-3 is kind of high.] If you need to debug and check GPT's code, and modify it, are you really saving much from copy/pasting Stackoverflow code?” Another problem with generality of GPT-3 based applications is that they tend to cover only the most common paths, while reality has a fat tail of “one-offs”.

Embracing this way of thinking using capability surfaces and task entropy allows us to develop a gestalt understanding of a model and foresee its many application possibilities without succumbing to hyped-up demos and misrepresented text completion examples.

Summary

Automation is not an all-or-nothing proposition. An AI model’s automation capability is highly conjoined with the task and application it is used in. This realization leads to many exciting partial-automation possibilities that can be highly valuable. Studying a model’s Capability Surface and the Task Entropy can be critical as to applying the model to a task. While capability surfaces of traditional supervised and fine-tuned models are far from being smooth, it only gets worse with few-shot models, where the number of tasks and applications are uncountably many. Studying capability surfaces of complex models is essential for piercing through the hype and ensuring safe deployments of those models.

Disclosures: GPT-3 or similar models did not assist in any of this writing. This article mentions multiple entities. I was not incentivized in any way to include them. They appear only because of the discussion I wanted to have.

Acknowledgments: Many thanks to Jen Hao-Yeh, James Cham, Peter Rojas, and Denny Brtiz for reading/commenting on drafts of this article.

Cite this article:

@misc{clarity:ai-automation-1,   
  author = {Delip Rao},
  title = {GPT-3 Turk’s Gambit and The Question of AI Automation},
  howpublished = {\url{https://pagestlabs.com/clarity/ai-automation-1}},
  month = {August},
  year = {2020}
}

GPT-3 and A Typology of Hype

delip-rao@newsletter.paragraph.com (Delip Rao) — Sat, 20 Nov 2021 04:11:30 GMT

Language is funny. Words have no “grounded” meanings unless you also take the full context of the reader and the writer, and yet we use words to get to that wordless essence with strangers we will never know. This was in full display when GPT-3 went viral, at least in Tech Twitter, over last weekend. Many researchers, including myself, used the words “GPT-3” and “hype” in the same Tweet to contain people's expectations. OpenAI's CEO, Sam Altman, even tweeted out “GPT-3 hype is way too much”.

But is GPT-3 a hype? There is definitely buzz about it on Twitter. Should buzz about something be the same as hype? To examine these questions further, we need to first develop clarity around the word “hype” itself.

Background: The GPT in GPT-3 stands for Generative Pre-Training or Generative Pre-training Transformers depending on how you parse the earlier papers on the topic. In June 2018, Alec Radford and friends at OpenAI used a (then) novel combination of a generative deep learning architecture called the Transformer (from Google) and a technique for training with unlabeled data called unsupervised pre-training (also known as self-supervision). The resulting model is the GPT model.
The self-attention mechanism of the Transformer offers a general way to model parts of the input to depend on other parts of the input (with a lot of compute) without the model designer having to specify those relationships either by feature engineering or by architecture engineering. Somewhat precsiently, the authors of the original Transformer model titled their paper as “Attention Is All You Need”. The combination of Transformers and Unsupervised Pre-Training is not limited to GPT family of models. There’s a slew of language models (BERT, XLNet, T5 ..) from Google, Facebook, and various university labs releasing models using this combination. Hugging Face, a natural language processing company, maintains most publicly available transformer models in an easy to use open source software package.
By early 2019, OpenAI progressed their infrastructure to scale the same model with 10x the number of parameters and data. This was GPT-2. Later in 2019, we saw OpenAI introduce the SparseTransformer, an improvement over the earlier transformer models to reliably attend over longer documents. Finally, in 2020, OpenAI released GPT-3 via their beta API, which created the buzz in question. GPT-3 not only scales up the amount of data and compute used over GPT-2 but also replaces the vanilla Transformer with the SparseTransformer and other improvements to produce a model with the best zero-shot and few-shot learning performance to date. Few-shot learning refers to AI models/systems that can learn from a couple of examples. Zero-shot learning can do that with no training examples (think “fill in the blanks” kind of problems that rely on knowledge). Traditional ways to consume tranformer models use a technique called fine-tuning, where you adapt models for new scenarios by retraining on new labeled data. Beginning with GPT-2, OpenAI pushed few-shot and zero-shot learning as the primary way to consume transformer models, and it appears the promise has landed with GPT-3.
Couple important points to note: 1) Besides, OpenAI's SparseTransformer, there are consistent improvements in Transformer technology to speed up training, use fewer resources, and to improve attention over longer contexts. So one should expect the cost of this training (GPT-3 training budget is ~10-15 million USD) will go down significantly within the next year, 2) since GPT-3 is entirely trained on publicly available datasets the only moats to reproducing the work are budget and talent. Of these, the latter is the harder challenge but not unsurmountable by big AI research labs, like Baidu, Google, Facebook, Amazon, and some very select startups, and 3) The few-shot problem-solving capability of GPT-3 and other transformer models is not universal. While the model consistently impresses with few-shot learning for complicated tasks and patterns, it can fail, for example, on something as simple as learning to reverse a string even after seeing 10,000 examples.
The few-shot learning capabilities of GPT-3 lead to some very interesting demos, ranging from automatic code generation to “search engines” to writing assistance and creative fiction. Many of these applications are first-of-a-kind, and enable things that were not possible before, making the excitement and hype around GPT-3 understandable.

A Typology of Hype

The word hype implies something is amplified (often unjustly) and, therefore, the thing doesn't deserve as much attention. To say GPT-3 is a technology “hype” is to dismiss, what appears to be a qualitatively different model that's capable of solving complex problems that haven't been solved before (esp. in zero/few-shot settings) -- see Melanie Mitchell's analogy experiments or Yoav Goldberg's linguistic probing experiments. This is not to say folks are also not amplifying either cluelessly or irresponsibly with claims of sentience and general intelligence. Then there are folks downplaying it to folks being completely dismissive about it. To call GPT-3 a hype and, hence to not pay attention to it, is throwing the champagne with the cork. Many who worked on Machine Learning long before Deep Learning became formalized as a discipline, eagerly dismissed Deep Learning as “hype” in its early days and missed several exciting opportunities to contribute. Nothing better illustrates than this sentence, Eugene Charniak wrote in the preface to his Deep Learning book.

Prof. Charniak's humbling confession raises several questions: How does one distinguish real breakthrough technologies vs. pure manufactured social media shenanigans when faced with a buzzing topic? How does one not get too caught up with folks who downplay and fail to see something? How do we see-through potential inflations in the narratives of folks amplifying something? These are important questions to ask, especially with emerging technologies since, by definition, the wider community hasn't yet had a chance to dive deeper and codify the knowledge surrounding it.

Since AI moves faster than most computer science fields, and we want to operate and make decisions in the zone between the “innovation trigger” and “peak inflated expectation” of the Gartner hype cycle. We will have to rely on ways to look at hype other than frameworks outlined in, say, Christensen's Seeing What's Next and other classics. Further, none of the traditional management literature looks at early hype from a social media POV, where the hype creators have highly disparate incentives than the “market”.

To do this, I will argue in the rest of the post why it is important to see both the folks who are amplifying the topic and downplaying the topic to arrive a gestalt understanding without getting too influenced by the individual social media actors. Interestingly, the word “downplay” is not a good antonym for hype in the context of social networks where the default function is amplification, where even downplaying acts as a contributor to the overall buzz. For the sake of convenience, let's call hype as +hype and the opposite of hype as -hype. The term -hype can vary from casual hedging to downplaying to being dismissive to something downright vicious. Like how some Inuits purportedly have more than 50 words for snow, I feel it is essential to have different words for different kinds of +hype and -hype. Since I'm no Tolkien, instead of inventing new words, I will give you a 2x2 to explain the continuum of hype.

Before I dive into the 2x2 (higher-resolution image here), let me first point out some usual caveats:

The trouble of putting things in discrete categories can give an impression of those things being bounded and non-overlapping, but the reality is a complex entanglement. Since that's too hairy to examine, I use discrete categories like “folks who created it”, “futurism-types”, and so on. My naming of the categories can be iterated on and improved since naming is one of the hardest problems not only in Computer Science but elsewhere too.
The (X, Y) coordinates for various categories on the 2x2 are also approximate and, in some cases, reflect the bias of its creator. If you disagree with something, it is useful to mentally rearrange those categories or have a discussion about it. I would love to hear that!
As with any 2x2, it's a thinking device and not a definitive map. So, in summary, there is nothing sacrosanct about this 2x2. Still, hopefully, the rest of the post will explain why this mental model can help us experience the texture of +hype and -hype around an emerging technology better.

To construct this 2x2, a useful axis to partition the hype continuum is the direct experience or knowledge of the person w.r.t the technology in question. Other possibilities exist, but since I am most interested in understanding the relative merits of the +hype and -hype direct experience/knowledge is a good discriminator.

At the edges of the 2x2 is the zone of engagement seekers. You have to watch out for people in these zones. They will co-opt any topic du jour and create content along the +hype and/or the -hype to drive engagement. They use “Wow!”, “SHOCKING!”, and other hyperbole to aid the spread of their message. Followers oblige. In the middle, around the low |hype|, there is a narrow band (in yellow), which is the zone of caution and indecisiveness. These folks don't talk much because either they choose not to as a policy or they are waiting for more information. These folks are not super useful either in concluding about the buzz topic. Then we have our four quadrants, A–D. Each quadrant is populated with a few examples of contributors to the overall buzz. As a mental model exercise, I will be going over each of the quadrants explaining what they are and applying it to break down the GPT-3 buzz and conclude as necessary.

A: (more direct experience/knowledge of the tech, +hype)

**Folks who created it: **This is understandable as the authors would want to spread awareness of their work. In many cases, such folks can overreach, but if they are researchers of repute, then it might be useful to pay attention to what they are saying closely and corroborate with other researchers in the field or do some lit search on your own. If it is not the authors, but the PR department of the company or universities, take all claims with a fist of salt. Interestingly, for GPT-3, unlike GPT-2, we saw very little direct comms from OpenAI. I think this is because OpenAI might have revamped their strategy to invite a bunch of trusted folks early in on the API access and let them build nifty video demos to become billboards for GPT-3. It's very clever and should be in the playbook of every company building creator tools/APIs.

**Folks who see potential applications: **These are inventor types. They are generally optimistic and come up with ten ideas on how to use anything you show them and are enthusiastic about sharing that. OpenAI harvested many of them either by design or by chance. I don't have data on this, but most of the early demo videos of GPT-3 apps I saw on Twitter came via Y-Combinator alums. Every demo video became a pitch deck for GPT-3. Tech Twitter lit up after this even though the official arXiv paper first came out by the end of May. While this is exciting "proof of work" that demonstrated what is possible, many of the videos are generated from cherry-picked content, so they appear more magical than it would if you were to use the API directly. If the results are cherry-picked, should we dismiss GPT-3's buzz? No. Perhaps the real magic of GPT-3, as it stands today, is GPT-3 with a human in the loop. There are numerous products in that category that's worth exploring with this model. Eventually, OpenAI released GPT-3 API to a bunch of NLP researchers (I have to apply yet), and we see interesting results come out of that already.

Aside: If you watched some of the demo videos and were floored about the capabilities of these new GPT-3 powered products, it is important to note that what you are watching is not the real-time performance and the results are lightly/heavily cherry-picked depending on the task. As a result, the videos can be more impressive/misleading than real-world experience (a general lesson for all video demos). That is not to say there is no signal in those videos. If anything, one should take the results in the videos as the upper bound of what's possible with current tech. We still have a long way to go to that upper bound. Max Woolf gave a good review of the system’s limitations of the current (July 2020) API that's worth reading.

Folks who think it will unblock their future work: These are the hopefuls. But it's usually tricky to tease them out unless you know them and their work very well.

Folks who “see the light”: These are folks who understand something deeper about the tech that others don't, despite not having a ton of evidence. Think of Hinton, (Yoshua) Bengio, LeCun, or Schmidhuber believing in shallow, deep networks way before the compute infrastructure existed. This crowd is usually a minority and are probably not adding much to visibility to the buzz, but they are important to listen to. The paradox is there are many undiscovered Hintons in the world right in the thick of the community. We have no one to blame but our biases.

B: (less direct experience/knowledge of the tech, +hype)

Angels and Investors: While some investors have direct experience in the tech they are investing in, many tend to rely on social proofs, pattern matching, and FOMO. For a sufficiently advanced tech like GPT-3 with very little background about it (other than a 70-page ArXiv paper), demos can be compelling to pattern match with other breakthrough technologies they've invested in. Unless they understand that demos are misleading, and there is a sufficient community exploration of the limits of the tech in question, investor hype is bound to happen. This doesn't mean there isn't a signal in the buzz the investor-type crowd is generating; it just means you need a substantial discounting factor there. Also, since these folks are relatively well connected in social media, the “buzz” we observe from their posts needs to be normalized with their follower counts.

Aside: The excitement for investors is understandable since few-shot learning in GPT-3 provides the promise of no-code and low-code. Someone wrote GPT-3 is the next Bitcoin in terms of value creation. This is absurd. While GPT-3 certainly makes new possibilities available, I am going to bet with anyone willing that Stackoverflow will be alive and teeming with blood and flesh coders asking and answering help for low-level programming bugs for several years to come. If anything the investor excitement around GPT-3 reminds me of the early days of Deep Learning when mentioning DL on the pitch deck was a great fundraising strategy for the founder. Just like how Deep Learning will not help if you don't have a good business/product, it's very unlikely to create value out of thin air with a sprinkling of GPT-3, since the ease-of-use promise of such tech necessarily implies more competition and all other factors that separate winners from losers become more important than your ability to use GPT-3. Another perspective to consider is when Generative Adversarial Networks (GAN) and style transfer was in vogue and folks thought it will “replace artists” or at least flood the art market with computer-generated art. I won’t hold my breath for that to happen. Instead, GANs are slowly making significant progress in other areas less talked areas like data compression, speech synthesis, ...

First-time smitten: Some technologies enable certain folks to do something for the very first time. Imagine someone who has no idea how to write code and has never developed an app before, but now they see a path for them to develop something. That empowerment can feel exciting, and these folks substantially contributed to GPT-3 buzz. There is a useful signal here. While these folks are potential customers for future startups, they don't indicate much about the limits of the technology itself.

Velvet-rope winners: OpenAI's product launch strategy for GPT-3 was similar to Clubhouse, in effect, where folks wait in a metaphorical “velvet rope” to be admitted to the club. Practically everyone who got access to the API could not wait to post screenshots of the API in action even if they were posting text generation examples that were posted before 100 times, making it purely a signaling action. There is not much signal in the buzz added by this crowd.

Futurism types: These are folks who have been waiting for space pods and personal jetpacks. You can count on their excitement for any scientific progress. But their contribution to the buzz is not going to inform you much about the tech either.

C: (less direct experience/knowledge of the tech, -hype)

Cynics, Contrarians, and Negative Campaigns: Ignore them. Mooks: These are folks who blindly retweet who they follow out of allegiance.

There is no signal in this quadrant, so you can step aside to quadrant D, which in my opinion, is quite interesting.

D: (more direct experience/knowledge of the tech, -hype)

False alarm survivors: These are folks who have gotten excited by similar promises in the past and have been let down. As a self-protection mechanism, their default response is to downplay the impact of the tech. This crowd is usually made of experienced folks who know what they are talking about, which makes it harder to ignore their opinions.

Shackled by your previous work: Sometimes folks fail to see what's new in a work like GPT-3 because they've been mentally imprisoned by the work they are doing or did in the past. It can come from a place of gross oversimplification — for example, someone comparing what GPT-3 is doing as “just another language model” — or failure to see the shift in the future as a discontinuity from the past. This is best illustrated by the ImageNet competition results. In 2012 the winning entry used a Deep Learning-based approach and beat all other entries by a wide margin. Between 2012 and 2013, a lot of folks failed to see this “margin” as a shift as opposed to a linear extension of the current techniques. I was one of them, and I didn't course-correct until 2013-2014.

Something similar is happening with GPT-3, if not as dramatic. Zero-shot and few-shot learning are not new. But it appears like there is a qualitative (and definitely quantitative) difference between the few-shot learning capabilities of GPT-3 and previous transformer models, including the GPT-2. Part of this could be because of the narrative that GPT-3 paper itself offers: “We use the same model and architecture as GPT-2 ...”

The opaqueness of the API is not helping external researchers in diving deep into GPT-3, but that could change because OpenAI might open up some of their models for more scrutiny than black-box level probing experiments or GPT-3 like models will get replicated elsewhere in open source. One thing is for sure: GPT-3 has unleashed the interest in few-shot and zero-shot learning beyond academic discussions, a trend that will only continue to strengthen in the future. A general trend that subsumes GPT-3 is self-supervision-based representations. To use Yann LeCun's famous cake analogy, if supervised learning (what drove progress until 2018-2019) is the icing on the cake, self-supervision is the cake itself. Self-supervision will be changing all of Artificial Intelligence in the future.

Shackled by other constraints: Another source of -hype is from folks who think (rightly) that GPT-3 will not work for them today, either because it's slow or too expensive. While those are valid constraints for today, vesting our beliefs 100% in them will only restrict our freedom for innovating in the future since many of those constraints will be solved by progress in science. The buzz generated by such posts is not useful to assess the value of the technology.

Summary

GPT-3 and the buzz behind it is the beginning of the transition of few-shot learning technology from research to actionable products. But every breakthrough technology comes with a lot of social media buzz that can delude our thinking about the capabilities of such technologies. Examining the buzz closely as we’ve done here using mental models that are systematic can help expose some of our biases. No mental model is perfect, and all come with biases of their own, so using as many such models and having a conversation about them is essential. To further reduce bias, those conversations should be diverse, open, and inclusive.

Few-shot learning from GPT-3 like models can provide entirely new ways of building solutions. Given the state of the art today, the sweet spot for leveraging few-shot learning is in situations that involve a human in the loop. It will be an overreach to confer an “autonomous” status to these models today, but it will be just as silly not to use them at all because the models make “errors”. Here’s a straightforward business idea: Use GPT-3 and humans-in-the-loop to build a data annotation business for training traditional low-resource supervised learning models. A human-in-the-loop model distillation!

GPT-3 marks the beginning of a Cambrian explosion of few-shot learning products, but that era will not be limited to or dominated by GPT-3 alone. We will see few-shot learning capabilities beyond the written text. Imagine the possibilities of doing few-shot learning from images, videos, speech, time-series, and multi-modal data. All this will happen in the early part of this decade, resulting in a proliferation of machine learning in more aspects of our lives. This proliferation will raise the urgency of working on bias, fairness, explainability, and transparency of ML models. So will the importance of working on fighting adversarial applications of ML models.

Disclosures: This article mentions multiple entities, including OpenAI. The author or Page Street Labs were not incentivized in any way to include them. They appear only because of the discussion I wanted to have.

Acknowledgments: Thanks to Jen-Hao Yeh for reading and commenting on early drafts of this.

Cite this:

@misc{clarity:gpt3-hype,
  author = {Delip Rao},
  title = {GPT-3 and a Typology of Hype},
  howpublished = {\url{https://pagestlabs.substack.com/p/gpt-3-and-a-typology-of-hype}},
  month = {July},
  year = {2020}
}