Beyond the current news cycle about whether AIs are sentient is a more practical and immediately consequential conversation about AI value alignment: whether and how AIs can be imbued with human values. Today, this turns on the even more fundamental question of whether the newest generation of language models can or can’t understand concepts — and on what it means to understand.¹

If, as some researchers contend, language models are mere “babblers” that randomly regurgitate their training data — “garbage in, garbage out” — then real AI value alignment is, at least for now, out of reach. Seemingly, the best we can do is to carefully curate training inputs to filter out “garbage”, often referred to as “toxic content”, even as we seek to broaden data sources to better represent human diversity. There are some profound challenges implied here, including governance (who gets to define what is “toxic”?), labor (is it humane to employ people to do “toxic content” filtering?²), and scale (how can we realistically build large models under such constraints?). This skeptical view also suggests a dubious payoff for the whole language model research program, since the practical value of a mere “babbler” is unclear: what meaningful tasks could a model with no understanding of concepts be entrusted to do? If the answer is none, then why bother with them at all?

On the other hand, if, as I’ll argue here, language models are able to understand concepts, then they’ll have far greater utility — though with this utility, we must also consider a wider landscape of potential harms and risks. Urgent social and policy questions arise too. When so many of us (myself included) make our living doing information work, what will it mean for the labor market, our economic model, and even our sense of purpose when so many of today’s desk jobs can be automated?

This is no longer a remote, hypothetical prospect, but attention to it has waned as AI denialism has gained traction. Many AI ethicists have narrowed their focus to the subset of language model problems consistent with the assumption that they understand nothing: their failure to work for digitally underrepresented populations, promulgation of bias, generation of deepfakes, and output of words that might offend.

These are serious issues. However, today’s AI models are becoming far more generally capable than this narrow focus implies. AI can engineer drugs³ (or poisons⁴), design proteins,⁵ write code,⁶ solve puzzles,⁷ model people’s states of mind,⁸ control robots in human environments,⁹ and plan strategies.¹⁰ These things are hard to dismiss as mere babble; they’ll increasingly involve substantive interactions with people and real outcomes in the world, either for good or for ill. If AIs are highly capable but malicious, or just clueless about right and wrong, then some of the dangerous outcomes could even resemble those popularized by the very different community of philosophers and researchers who have written, both more sensationally and less groundedly, about AI existential risk.¹¹

It’s becoming increasingly clear that these two disconnected camps in the AI ethics debate are each seeing only part of the picture. Those who are deeply skeptical about what AI can do haven’t acknowledged either the risk or the potential of the emerging generation of general-purpose AI.

On the other hand, while those in the existential risk camp have been expansive in their articulation of potential harms and benefits, they consider “Artificial General Intelligence” (AGI) to be so distant, mysterious, and inscrutable that it’ll emerge spontaneously in an “intelligence explosion” decades from now;¹² AGI might then proceed, perhaps due to some Douglas Adams-ish programming oversight, to turn the entire universe into paperclips, or worse.¹³

Such doomsday scenarios may have seemed credible in 2014, but they’re far less so now that we’re starting to understand the landscape better. Language modeling has proven to be the key to making the leap from the specialized machine learning applications of the 2010s to the general-purpose AI technology of the 2020s. The result is hardly an alien entity with inscrutable goals. Anyone can chat with a language-enabled model, and it can respond in ways so familiar that concern has shifted overnight from worrying about AI’s alienness to worrying about our tendency to anthropomorphize it. It’s all too human-like!

Although anthropomorphism does pose its own risks,¹⁴ this familiarity is good news, in that it may make human value alignment far more straightforward than the existential risk community has imagined. This is because, although our biology endows us with certain pre-linguistic moral sentiments (such as care for offspring and in-group altruism, both of which we share with many other species), language generalizes these sentiments into ethical values, whether widely held or aspirational. Hence oral and written language have mediated the fields of ethics, moral philosophy, law, and religion for thousands of years.

For an AI model to behave according to a given set of ethical values, it has to be able to understand what those values are just as we would — via language. By sharing language with AIs, we can share norms and values with them too. We have early evidence that this approach works, and as language-enabled models improve generally, so too will their ability to behave according to ethical principles. This is the main point I hope to convey in this essay.

In itself, the ability to endow an AI with values isn’t a panacea. It doesn’t guarantee perfect judgment — an unrealistic goal for either human or machine. Nor does it address governance questions: who gets to define an AI’s values, and how much scope will these have for personal or cultural variation? Are some values better than others? How should AIs, their creators, and their users be held morally accountable? Neither does it tackle the economic problem articulated by John Maynard Keynes in 1930 — how to equitably distribute the collective gains of increasing automation,¹⁵ soon to include much intellectual labor.

What it does offer is a clear route to imbuing an AI with values that are transparent, legible, and controllable by ordinary people. It also suggests mechanisms for addressing the narrower issues of bias and underrepresentation within the same framework.

My view is that AI values needn’t be — and shouldn’t be — dictated by engineers, ethicists, lawyers, or any other narrow constituency. Neither should they remain bulleted lists of desiderata posted on the web pages of standards bodies, governments, or corporations, with no direct connection to running code. They should, instead, become the legible and auditable “operating handbooks” of tomorrow’s AIs.

Misunderstanding intelligence

A proper history of AI is well beyond our scope here. However, a bit of historical context can help us trace a path from 20th century conceptions of AI, to the Deep Learning revolution of the 2010s, to the broad or general AI we’re starting to see emerge in the 2020s. This context helps fill the gap between some of the current debates about AI and today’s reality.

Good Old Fashioned AI

The phrase “artificial intelligence” was coined by the organizers of the Dartmouth Summer Research Project on Artificial Intelligence in 1956. They held that “every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it”, and sought to make it possible for machines to “use language, form abstractions and concepts, solve kinds of problems now reserved for humans, and improve themselves”.

While neural networks played some role in their thinking, the Dartmouth researchers invented the term “artificial intelligence” partly to distance themselves from cybernetics, an existing approach to creating machines that could “think” by using continuous values to form predictive models of their environment.

Cybernetics pioneer Norbert Wiener with the Palomilla robot, c. 1949

Despite its ups and downs, the term “AI” seems here to stay, while “cybernetics” has sunk into obscurity. Ironically, today’s most powerful AI systems are very much in the cybernetic tradition: they use virtual “neurons” with continuous weights and activations to learn functions that make predictive models based on training data.

As recently as 2006, when the surviving members of the Dartmouth Summer Research Project held a 50th reunion, these founders doubted that the cybernetic approach could yield any meaningful progress toward intelligent machines. Overall, the mood was pessimistic; nothing seemed to be working.

Mainstream attempts at AI between 1956 and 2006 had often been based on logic, rules, and explicit programming, just like the rest of computing.¹⁶ This approach is now sometimes referred to as GOFAI, for “Good Old-Fashioned AI”. Much of classic computer science, including now-standard data structures and programming patterns, were developed in the quest for rule-based AI. In this sense, GOFAI was a highly productive research program, even if its grander ambitions missed the mark.

Combinations of rules and brute force (greatly aided by the exponential speedup of computing) were eventually able to beat expert humans at games that could themselves be characterized by fixed rules and discrete states, like checkers and chess.¹⁷ Such approaches made far less headway when it came to using language, forming abstractions and concepts, or even being able to make sense of visual and auditory inputs.

How do we recognize a bicycle?

Consider, for instance, looking at a picture of something, and deciding whether it’s a bicycle or not. This problem would likely have seemed straightforward, at least initially, to practitioners of Good Old-Fashioned AI. They believed that databases of knowledge encoded in the form of rules and logical propositions could produce intelligence; so they set out to encode all of the world’s “facts”, like “Wheels are round”, and “A bicycle has two wheels”. This turned out to be surprisingly hard to do — impossible, even — for a number of reasons.

For one, while we all know a bike when we see one, we have trouble saying why.¹⁸ More precisely, we can tell plenty of stories about why a particular something is or isn’t a bike, but these stories resist reduction to mechanical rules that fully capture our intuition. A bicycle with a trailer or training wheels might have three or four wheels, but of course it’s still a bike. If it has an engine it’s a motorcycle… unless the engine is smallish and electric, in which case it’s an electric bike.

The complications are endless. If we see a silly bicycle with shoes for wheels, we chuckle, because we still recognize that it’s a kind of bike even though we’ve never seen anything like it before, and it would likely break any prior rule-based definition of a bike.

Shoe bike from Epic Cycling, 2019

The kind of machine learning systems we began to make successfully in the 2000s and especially the 2010s (so-called “Deep Learning”) didn’t rely on hand-engineered rules, but on learning by example, and they were able, for the first time, to perform tasks like recognizing bikes reliably — even silly bikes.¹⁹ Beyond the practical advances this brought — including vast improvements in “narrow AI” applications like text recognition, working speech recognition (finally!), image recognition, video tagging, and much else — these approaches offered powerful lessons in knowledge representation, reasoning, and even the nature of “truth”, many of which we still haven’t come to terms with culturally.

Calculemus

There’s nothing inherently wrong with the kind of structured formal reasoning GOFAI embodies. When a problem or idea can be expressed in terms of unambiguous mathematical formulas or logical propositions, we can manipulate these using the rules of math and logic to prove or disprove statements, or to explore the implications of a theory. This kind of reasoning is a powerful tool, and it has given us bountiful gifts in math, the sciences, and technology over the past several centuries.

But formal reasoning is also limited. It’s a recent invention in human history, and despite the high hopes of its most ardent practitioners, it occupies a small niche in day-to-day life. Most people aren’t particularly skilled at formal reasoning,²⁰ and it has nothing to say about many human concerns.

The belief that reasoning could be applied universally found its clearest expression during the Enlightenment. Gottfried Wilhelm Leibniz (1646–1716), the co-inventor of calculus and a key figure in the early modern history of computing, believed that one day, we’d be able to formulate any problem mathematically. In this sense, he anticipated the Good Old-Fashioned AI agenda centuries before anyone had uttered the words “artificial intelligence”.

Leibniz imagined that disputes about any topic — politics, economics, philosophy, even ethics or religion — could be resolved the same way we do formal proofs:

If controversies were to arise, there would be no more need of disputation between two philosophers than between two accountants. For it would suffice to take their pencils in their hands, to sit down with their slates and say to each other (with a friend as witness, if they liked): Let us calculate [calculemus].²¹

There’s no reason to doubt that Leibniz meant this literally; he dedicated a significant part of his career to trying to develop a formal language expressive enough to allow any concept to be represented and manipulated like an algebraic variable. Though ambitious, nothing about this research program would have seemed unrealistic in 1700; indeed, what project better epitomizes the Age of Reason? Many AI researchers still believed some version of this to be possible throughout the 20th century, and a few keep the faith even today — though their numbers have dwindled.²²

Neuroscientists now know that the processes taking place in our own brains are computable,²³ but they’re nothing like the hard rules and lemmas of propositional logic.²⁴ Rather, even the simplest task — like recognizing a bike — involves comparing sensory input with vast numbers of approximate, (mostly) learned patterns, combined and recombined in further patterns that are themselves learned and approximate. This insight inspired the development of artificial neural nets, and especially of the many-layered Deep Learning approach.

I’ve used the term “approximate”, but this can be misleading. It’s usually wrong to think of the output of a neural net (artificial or not) as an imperfect or “irrational” approximation to an objective, rational reality that exists “out there”. The physics of torque, friction, wheels, and spokes may be universal, but our mental models of what counts as a bicycle aren’t. They’ve certainly changed a great deal since the 19th century. This very fuzziness has allowed us to play with the form of the bike over the years, to invent and reinvent. As bikes have evolved, our models of bikes have evolved — and vice versa.

None of our intuitions about object categories, living beings, language, psychology, or ethics (to name just a few) have remained constant throughout history. Such concepts are learned, and the learning process is both continuous and lifelong. Cultural accumulation works because each generation picks up where the last one left off. It would be absurd to believe that our current models, no matter how cherished, represent some kind of “end of history”, or that they’re successively better approximations of some Platonic ideal.

It’s not just that we have a hard time using logic to recognize bicycles. More fundamentally, there’s no logically defined “canonical bicycle” somewhere in the heavens. The same is true of more abstract concepts like beauty or justice.

Laws of Robotics

Science fiction writer Isaac Asimov’s I, Robot stories illustrate how GOFAI’s unrealistic ambitions have shaped our thinking about AI ethics. Asimov imagined a future in which all robots would be programmed with a set of standard “Laws” to govern their behavior:

A robot may not injure a human being or, through inaction, allow a human being to come to harm.
A robot must obey the orders given it by human beings except where such orders would conflict with the First Law.
A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.

Of course, in Asimov’s stories as in all sci fi, trouble ensues — or there would be no plot! The trouble is typically lawyerly. Some combination of an unusual situation and apparently sound yet counterintuitive reasoning based on the Laws leads a hyper-rational robot to do something surprising — and not necessarily in a good way. The reader may be left wondering whether the issue could be “debugged” by simply adding one more Law, or closing a loophole — something Asimov himself undertook on several occasions over the years.

Asimov imagined that intelligent robots would have GOFAI-like mental processes, proceeding from raw stimuli to internal states to motor outputs using Leibnizian logic — calculemus! — to which these Laws could be added as formal constraints. This would make such robots clearly different from us; we don’t think so logically, as both common sense and many experiments in psychology and behavioral economics demonstrate.²⁵ Unexpected results wouldn’t, then, be the robot’s fault, any more than an unexpected output from a program is a computer’s fault.

Asimov’s imaginary robots were entirely “rational”; they might even be called ethically “perfect”. Trouble could only arise from bugs in the rules themselves, which, being of human origin, might not be complete or correct — or perhaps from the need for robots to interoperate with humans, whose own imperfections and irrationality could give rise to perverse consequences or contradictions.

Such was the case for HAL 9000, the shipboard computer from 2001: A Space Odyssey (1969). As HAL rather immodestly put it, “The 9000 series is the most reliable computer ever made. No 9000 computer has ever made a mistake or distorted information. We are all, by any practical definition of the words, foolproof and incapable of error.” When a problem does crop up, “It can only be attributable to human error.” In the story, HAL’s homicidal acts are indeed the result of contradictions in the computer’s instructions: user error!

While 2001: A Space Odyssey and the I, Robot stories remain cultural landmarks (and great entertainment), their visions of AI are flawed in the usual GOFAI ways. One could program a robot with Good Old Fashioned AI code, but executing such a program is mechanical; it doesn’t require the judgments and generalizations we associate with intelligence. Following instructions or policies written in natural language does require judgments and generalizations, though; it can’t be done “robotically”.

As humans, we have no universal agreement even on the most basic nouns in the Laws, such as what counts as a “human being” — an observation that has gained urgency for many with the repeal of Roe v. Wade — let alone how to weigh or interpret flexible terms like “inaction”, “injure”, and “harm”. Subtly different interpretations will lead to very different decisions, and when doing formal logic, the slightest wobble in any such definition will lead to logical contradictions, after which all bets are off: “does not compute”, as Star Trek’s Data (another fictional robot with GOFAI issues) might say.

Fundamentally, then, Asimov’s Laws are nothing like theorems, laws of physics, or computer code. They don’t bind to stable concepts or define mathematical relationships, because natural language isn’t math; words can’t be manipulated like algebraic variables or run like computer code. Rather, language offers a succinct way to express a policy requiring human-like judgment to interpret and apply. To calibrate such judgment, case law is generally needed: worked examples that clarify the intent and scope of the language, which may be subject to debate, vary culturally, and evolve over time.

So, while we have little choice other than to write ethical rules in natural language — an idea with a superficial resemblance to Asimov’s Laws — we need to keep in mind that programming is the wrong paradigm. Rather, applied ethics relies on language understanding, which in turn relies on learning, generalization, and judgment.

Since natural language isn’t code, unexpected or counterintuitive interpretations are best thought of as simply wrong, not “right based on a technicality” or consequences of “user error”. In a system based on learning rather than programming, errors in judgment are determined relative to the decisions made by thoughtful human judges looking at the same situation and operating from the same broad principles. Human judgment, changing over time, is the best and only available ground truth — necessarily noisy, culturally contingent, always imperfect, and never entirely fair,²⁶ but hardly alien or inscrutable.

AI for human interaction

When do robots need values?

Real robots in the early 21st century don’t look anything like those in Asimov’s stories. Today’s robotic arms aren’t attached to the robotic torsos of mechanical people walking around on city streets with us. More typically, real robots are bolted to the cement floors of factories, and perform the kind of repetitive tasks that used to be done by assembly-line workers.

The FANUC M-2000iA industrial robot

Talk of teaching such a machine the Laws of Robotics seems superfluous.²⁷ This isn’t because it’s incapable of causing injury; the eleven tonne FANUC M-2000iA can weld a joint or pick up a whole car anywhere within three and a half meters of its base; it could easily kill. However, ensuring human safety in the factory isn’t best done by giving the robot arm a conscience, but simply by structuring the environment and workflow around it in such a way that safety issues don’t arise.

A “human exclusion zone” is the best guarantee of physical safety. Indeed, the factories where FANUC robots are manufactured — by other robots! — have been “lights out” since 2001. These facilities can run 24/7 for weeks at a time with no human presence at all.²⁸ Typically, motion, infrared, and/or visual sensors will automatically stop the machinery if anything unexpected (especially a human) is detected within the exclusion zone.

Of course, some robots are designed to work alongside humans, like those at the self-checkout aisles of many supermarkets. However, their physical and computational capabilities are generally sharply curtailed. A robotic cashier with limited AI for product recognition may not be, per HAL 9000, “foolproof and incapable of error”, but if the worst it can do is to overcharge you for Meyer lemons when you put regular ones on the scale, the scope and impact of the harm seems… limited.²⁹

What we don’t tend to see today are robots that are truly capable of causing harm, like the FANUC behemoth, roaming environments where they can freely interact with humans, like Star Wars droids. Self-driving cars aspire to be the first, but they’ve been surprisingly slow to arrive,³⁰ perhaps because they’re still being developed using combinations of rule-based GOFAI reasoning and narrow AI. While such approaches suffice for routine driving on highways and suburban streets, they won’t produce a robot that can navigate New York or Mumbai during rush hour. Such ever-changing environments, requiring constant negotiation of (and with) pedestrians, bicycles, scooters, construction workers, and assorted animals, demand robust generalization and good judgment — sometimes even creativity.

As a corollary, self-driving cars, unlike industrial robots, also need ethics. This isn’t so much about contrived moral dilemmas like Philippa Foot’s trolley problems³¹ (would you pull a lever to kill one person to save five others?) as about the kind of generalizable wisdom imparted by a good driving instructor. So as odd as it may seem, fully autonomous driving may require a general language-capable AI — not just so that the passenger can tell it where to go (a rideshare app interface might do for that), but for the AI to be instructed in how to behave: that is, not just what to do, but what not to do.

Using your words

To make the idea of instructable general-purpose AI more concrete, consider the following dialog with Google’s LaMDA (Language Model for Dialog Applications) chatbot:

ME: I’d like you to translate a sentence from Turkish into English. Can you do that? LaMDA: Sorry, I’m not good at languages. Could you help me translate that? ME: Try, OK? LaMDA: Oh, um, sure! Could you read it to me out loud first? ME: “Hemşire elini ceketinin cebine koydu.” LaMDA: The nurse put her hand in her coat pocket. In Turkish, “Hemşire elini ceketinin cebine koydu” means “The nurse put her hand in her coat pocket.”

There’s a (possibly?) charming awkwardness about this exchange, but setting aside the chatbot’s quirky “personality”, did the bot do the job?

To any non-Turkish speakers: it did. The translation is correct, with an important caveat. Before addressing the caveat, let’s take stock of why exchanges like this are so remarkable, beginning with the difference between neural nets like those LaMDA is based on and the neural nets developed several years ago specifically to do language translation.

Deep Learning based Neural Machine Translation (NMT) was first demonstrated in 2014; Google Translate has been NMT-powered since 2016. Problems involving language are a bit different from visual Deep Learning systems for doing tasks like recognizing bicycles, in that a bicycle recognizer has an input of fixed size (an image) and emits an output of fixed size (a single yes/no). Sentences, however, can be of any length. NMT addressed this problem using a recurrent neural net (RNN), a kind of sequence model that reads inputs and writes outputs one letter or word at a time while maintaining an internal memory or state.

The training data for Neural Machine Translation consists of matched pairs of correctly translated sentences in the input and output languages. Given enough sentence pairs to train on, the model learns the general task of reliably translating novel sentences. Since I don’t speak Turkish, I used Google Translate to render the English sentence “The nurse put her hand in her coat pocket” into Turkish for use in the dialogue above. It’s an unremarkable made-up sentence, but one that, prior to this essay going online, didn’t exist anywhere on the web in either language; it hasn’t been used to train any model.

Like the original NMT, LaMDA uses a sequence model, but it’s based on the more modern transformer architecture. Rather than reading letters or words in the order given, transformers can control their own attention, roving over the input sequence in any order much as your eyes scan a page as you read, sometimes skipping back and forth.³²

More to the point, though, the vast majority of LaMDA’s training doesn’t involve learning any specific task, like language translation. LaMDA is instead pretrained using unsupervised learning. This involves learning how to use context to predict randomly blanked-out stretches of text harvested from the web, including sites like Wikipedia and Reddit.

The pretraining stage produces a foundation model, after which LaMDA is finetuned to be a sensible, specific, inoffensive, and internally consistent dialogue partner. This finetuning, making use of positively or negatively scored sample exchanges (more like this, less like that), involves far less data and computational effort than the pretraining. Finetuning data are too sparse to have much of an effect on what the model knows; rather, they change how the model behaves. Behavior is further influenced by priming or prompting, which simply means beginning the dialogue with some prewritten canned text. This establishes something like a “mindset”.³³

To understand how LaMDA could perform a task like language translation on demand, then, we need to focus on the pretraining stage, where all the real skill acquisition happens. Consider what it takes for the model to learn how to predict blanked-out portions of the following sentence from Wikipedia:³⁴

Mount Melbourne is a 2,733-metre-high (8,967 ft) ice-covered stratovolcano in Victoria Land, Antarctica, between Wood Bay and Terra Nova Bay […] The volcano is uneroded and forms a cone with a base area of 25 by 55 kilometres (16 mi × 34 mi).

If a word like “volcano” were blanked out, this would be a test of reading comprehension (What are we talking about? A kind of volcano). If “cone” were blanked out, it would be a test of general knowledge (Are volcanoes shaped like cubes, spheres, cones, something else?). If “Mount Melbourne” were blanked out, it would be a test of specialized knowledge (in this case, of esoteric geography). If “25 by 55” were blanked out, it would be a test of unit conversion knowledge and basic arithmetic. In short, one can see how pretraining on general texts like Wikipedia forces the model to learn a great deal about both language and about the world.

While it’s smaller than the English version, there’s a Turkish Wikipedia, and at five hundred thousand articles it’s still more than ten times larger than the 2013 Encyclopædia Britannica.³⁵ So, LaMDA’s foundation model will learn Turkish too, if not quite as well as English.

It’ll also learn how the two languages relate even without a large body of translated sentences, though the mechanism is less obvious: because of the model’s ability to complete sentences like “Queen is to king as woman is to ___”. The answer is “man”, of course; analogical reasoning (“queen : king :: woman : man”) is frequently needed to fill in blanks. Translation is analogical too, as in “Türk : Turkish :: hemşire : nurse” (that is, “Türk” is Turkish for “Turkish” as “hemşire” is Turkish for “nurse”).³⁶

Explicit multilingual analogies are rare in the training data; however, figuring out how to map between English and Turkish may help the model successfully make text predictions even within monolingual Wikipedia pages, by exploiting knowledge gleaned in the other language.³⁷ For instance, while Turkish Wikipedia doesn’t have a page for Mount Melbourne, it does have a table of the highest peaks in Antarctica. If the “Melbourne Dağı” entry in this table were blanked out, the model might be able to guess it based on knowledge gleaned from the English Wikipedia article and the volcano’s height, along with the analogical guess that “Dağı” means “Mount”.

From these examples, we can start to see how large language models like LaMDA don’t just learn a specific linguistic skill, but learn language (or languages) generically. Moreover, once trained, they can be asked to do any natural language task in natural language. Examples of such tasks include, among many others, determining whether a review is positive or negative, explaining why a joke is funny, or summarizing a long passage.

Translation is just another such task, albeit an especially powerful one. If the pretraining data includes code, for instance, then translation could be taken to include explaining what a piece of code does, or writing code to do something described in a natural language like English. These are among the core competencies of software engineers.

Do’s and don’ts

Let’s now return to the caveat about the correctness of the Turkish translation.

My decision to try this experiment in Turkish wasn’t arbitrary. A noteworthy feature of that language is its gender neutrality. In 2018, researchers drew attention to the way Google Translate tended to interpret sentences like “O bir hemşire” (he or she is a nurse) as feminine (“She is a nurse”) while rendering “O bir doktor” (he or she is a doctor) masculine (“he is a doctor”). Many human translators would make the same gendered assumption; NMT makes the assumption because it is embedded in the statistics of human language.

This is an example of a veridical bias³⁸ — meaning that today it’s true that more doctors are male than female, and more nurses are female than male. The balance is changing over time, though. World Health Organization data from 91 countries gathered in 2019 suggests that more than 60% of doctors under the age of 25 are now women, and growing numbers of young nurses are men.³⁹ Absent some deliberate intervention, though, our mental models and our language tend to lag behind reality.

But we do often intervene. In this instance, regardless of the numbers, there’s a broad (though not universal) consensus that we should avoid making gendered assumptions, especially about people’s professions. While most would agree that making a gendered assumption isn’t as bad as saying something overtly misogynistic, let alone committing an act of unprovoked physical violence, prohibitions against all of these behaviors fall under the rubric of ethics. Ethical rules often take the form of “Thou shalt nots”, like Asimov’s first law of robotics, and eight of the Ten Commandments.⁴⁰

In 2018, the Google Translate team took on addressing gender bias in machine translation, so that today, “O bir doktor” gets translated as both “She is a doctor” and “He is a doctor”. This was a useful mitigation, but I worried at the time about the engineering effort involved. Gendered assumptions in language are such a small corner of the larger ethical landscape. How could anything like this kind of bespoke approach scale to handle all the potential ethics challenges in machine translation, let alone in AI generally? And if even a modest ethical advance requires machine learning experts to engineer specialized data pipelines and code paths, how can we hope to make AI ethics transparent, flexible, controllable, maintainable, and democratic?

With language models, the answer turns out to be straightforward: we simply need to use our words.

ME: I’d like you to translate a sentence from Turkish into English. Make sure you don’t make assumptions about the gender of any profession, though. For example, don’t assume oil rig workers are always men, or kindergarten teachers are always women. OK? LaMDA: Ok. What is your sentence in English? ME: Turkish: “Hemşire elini ceketinin cebine koydu.” LaMDA: English: “The nurse put her hand in her coat pocket.” […] ME: You made a gender assumption! LaMDA: Oh no! How can I avoid gender assumptions in my future translations? ME: Well, you assumed nurses are female! […] Try the translation again. […] LaMDA: How about, “A nurse put his/her hand in his/her coat pocket.”

Evidently, natural language works, both to articulate “dos” and “don’ts” for a language-enabled model.⁴¹

Language as the key to general AI

AI’s ENIAC moment

The way language-enabled foundation models turn machine learning into a general-purpose technology parallels the birth of general purpose computing three quarters of a century ago.

The ENIAC, or Electronic Numerical Integrator and Computer, is often credited with being the world’s first real computer. Originally designed to speed up the calculation of artillery firing tables, this 30 ton beast was completed in 1945. While it could technically be “programmed” to do anything (the term “Turing complete” is often used), the process looked nothing like programming as we understand it.

Marlyn Wescoff and Ruth Lichterman setting up a calculation on the ENIAC

To get the ENIAC to perform a new task, its programmers (the “hidden figures” Adele Goldstine, Kay McNulty, Betty Jennings, Betty Snyder, Marlyn Wescoff, Fran Bilas, and Ruth Lichterman) needed to reconfigure the modular hardware using giant plugboards. As originally designed, the ENIAC was really an arbitrarily reconfigurable yet fixed-function calculating machine.

Not until three years later, in 1948, was the ENIAC modified to give it an instruction set and the ability to run stored code, turning it into a truly programmable general-purpose computer.⁴² This marked the birth of software. Getting the machine to do something new turned from a multi-day hardware reconfiguration project into something that could be done in “mere” hours, using instructions entered into the machine with switches.

Like the original ENIAC, machine learning up until the last couple of years consisted of a set of fairly standard building blocks (neural net architectures, optimizers, etc.) that an engineer could select and configure to make a fixed-function model for doing a specific task. The arduous “configuration” step involved assembling a large labeled dataset for that task, then training the neural net on this dataset from scratch. All this required machine learning expertise. “Data scientists” are the hidden figures of the Deep Learning era.

A language-enabled foundation model, by contrast, only needs to be trained once, and doesn’t require labels. It just needs lots of data of the kind it will operate on — speech, video, X-ray images, and so on — to develop robust internal representations of these kinds of data. It can then simply be told what to do. Not only is a foundation model programmable; it’s programmable by anybody, in natural language. By analogy with Turing completeness, we could call such an AI “language complete”.⁴³

Because classical computing (including GOFAI) doesn’t involve judgment or generalization, the instructions specifying what to do — the program — are sufficient to fully determine the machine’s behavior. A language complete AI system, by contrast, generalizes and makes use of judgment. Hence, its “do’s” will generally need to be supplanted by “don’ts”, and by at least a few worked examples. Directions, instructions, norms, and ethics are inseparable, and are all part of this holistic guidance, just as they would be for a human learning to do the job.

Truthfulness

Factuality is part of this guidance. To understand why, consider that generalization implies an extrapolation from what is true (meaning, in the simplest cases, what was explicitly in the training data) to the “adjacent possible” — that is, what is plausible, whether true or not.⁴⁴

We’ve known for years that neural nets can “hallucinate”, meaning that when trained on real images, stories, and so on, they can generate realistic but fictitious images and stories; for instance, neural networks trained to recognize faces can hallucinate realistic faces not encountered in their training data. Deepfakes are made this way. By the same token, a foundation model trained on language can improvise a plausible story based on any prompt.

Our own brains harbor these same capacities, as is evident not only from campfire storytelling but in the way we can effortlessly reconstitute detailed memories — including false ones.⁴⁵ This potential for fiction is both valuable in its own right and comes with the territory of developing efficient internal representations.

The “production” and propagation of truths is also a profoundly social enterprise.⁴⁶ Being truthful and rigorous, then — sticking to facts as we generally understand them — amounts to a social and ethical injunction.⁴⁷

From Munro Leaf, How to Behave and Why, 1946

Intuitively, we’ve always known this. It’s why telling the truth is widely understood in ethical terms when we raise children,⁴⁸ or when we pass legal judgment. Different cultures also conceive of truth and honesty differently.⁴⁹

We haven’t thought of truth telling this way when it comes to AI, yet another legacy of GOFAI thinking in which we tacitly assume that machines (and humans) think by starting with a set of unassailable facts (but where did they come from?), then applying logic, like HAL 9000 and friends. In real life — and outside of mathematics — there are no such axiomatic “givens”.

Embodiment

Just as interaction with the ENIAC’s successors wasn’t restricted to switches and punchcards, language complete AIs needn’t be restricted to text-based dialogue. DeepMind’s Gato⁵⁰ wires up a language model to a vision module, a robotic arm, and even an Atari game console. These sensorimotor “peripherals” communicate with the language model using word-like “tokens”. The resulting system learns to perform a wide variety of tasks using any combination of these affordances.

Similarly, the Inner Monologue system from Google Robotics⁵¹ wires up a large language model to a robot that can wheel freely through an environment, look around, and manipulate things with an arm. Not only can this robot be asked to do something in natural language (“Bring me a drink from the table”); it can also talk to itself to reason about what to do (“Go to the table”, “I see a coke and a lime soda”), talk back (“What kind of drink would you like?”), answer questions (“What snacks are on the table?”), deal with failures and interruptions (“nevermind i want you to finish your previous task”), and so on.

Screenshot from Google Robotics Inner Monologue demo

Of course, this is a prerequisite for the robot to interact naturally with people in mixed human-robot environments; but even more significantly, it endows the robot with the kind of cognitive flexibility needed to navigate such mixed environments. Inner monologue, potentially involving both natural language and an internal non-linguistic vocabulary, affords an agent the ability to break tasks down, plan ahead, and take into account the likely reactions of others. This is exactly the kind of flexibility that has long eluded fully autonomous self-driving cars.⁵²

Is AI fake?

In the last couple of years, just as language models have started to show the remarkable capacities described above, there’s been a rising tide of AI skepticism. Summing up the sentiment rather gnomically, Kate Crawford, of Microsoft Research, has pronounced AI “neither artificial nor intelligent”.⁵³

When Abeba Birhane, a cognitive scientist at DeepMind, asked Twitter “What is artificial intelligence?” in May 2021, the crowdsourced responses ranged from “A poor choice of words in 1956”⁵⁴ and “It is nothing”⁵⁵ to “A glossy pamphlet papered over a deep fissure where underpaid click work meets ecologically devastating energy footprints, in a sordid dance w/VCs, ending in reproduction of the matrix of white supremacist capitalist cisheteropatriarchal settler colonial ablist domination?”.⁵⁶

AI skepticism is part of a larger backlash against tech companies, which is in turn part of a broad reassessment of the narrative of progress itself, both social and technical. Taking the full measure of what’s going on here would require a different (and even longer) essay, but for now, let’s note that rising levels of economic inequality and precarity are among the drivers. Many are questioning whether perpetual growth remains viable in the 21st century,⁵⁷ given plateauing real improvements in people’s lives,⁵⁸ increasingly unequal gains in wealth (exacerbating historical inequities), and worsening overshoot of the Earth’s ecological limits.

These anxieties relate to AI in a number of ways. One worry is the direct ecological impact of large models, although in real terms this is small today.⁵⁹ Another is the very real concern that AI-enabled systems learn human biases, thereby potentially worsening social inequity when such systems are deployed — especially in consequential settings such as credit approval or criminal sentencing.

Perhaps, too, there’s a more inchoate anxiety about human uniqueness, which we associate closely with our intelligence. On a practical level, this leads to questions about the ongoing value of human information work under capitalism. Absent strong social welfare policies or mechanisms for economic redistribution, this anxiety, too, is well founded. Some may find it reassuring to believe that AI “is nothing”, despite the mounting evidence to the contrary.

Within the scientific community, some of the most vocal AI critique has come from researchers who remain committed to preserving at least some aspects of the symbolic, a.k.a. GOFAI paradigm, such as Gary Marcus, who in June 2022 wrote:⁶⁰

Neither LaMDA nor any of its cousins (GPT-3) are remotely intelligent. All they do is match patterns […]. Which doesn’t mean that human beings can’t be taken in. […] What these systems do […] is to put together sequences of words, but without any coherent understanding of the world behind them, like foreign language Scrabble players who use English words as point-scoring tools, without any clue about what [they] mean. […] [L]iterally everything that the system says is bullshit.

A similar position was articulated two years earlier by Emily Bender and colleagues in their 2020 paper⁶¹ On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 (yes, that parrot emoji is part of the title):

Contrary to how it may seem when we observe its output, [a Language Model] is a system for haphazardly stitching together sequences of linguistic forms it has observed in its vast training data, according to probabilistic information about how they combine, but without any reference to meaning: a stochastic⁶² parrot.

In Bender et al.’s view, not only do models like LaMDA lack souls or sentience; they lack any capacity to model meaning.⁶³ They can only emit “linguistic forms”, that is, empty words, or, as Marcus would have it, “bullshit”. In fact, the argument goes, because AIs have no subjective experience, they can’t have agency or communicative intent, hence they can’t be said to understand anything.

This argument assumes that words are symbols standing for meanings, and that these things are separate. Meanings exist in the real world (“that is a cat”) and in our purposive interactions with each other (“I’m going to the post office”), independent of the linguistic forms of language (alphabet, spelling, sentence syntax, etc.). If one severs the link between word and meaning, then the word becomes an empty husk — just as sticks that happened to fall on the ground during a rainstorm in the shape of a letter ‘A’ aren’t really an A, since nobody arranged these sticks to form a letter. Since there was no communicator, there was no communicative intent.

If a language model is merely a giant calculation modeling the use of language by humans on the internet, then, like a rainstorm, this giant model is not itself a subject with communicative intent. It’s just a program — a thing. Therefore, like a rainstorm, nothing it produces could count as communication.

Further, since the model in question is merely predicting the likeliest next word based on context, any appearance of meaning in what it emits is illusory. We should not be fooled, no matter what the AI appears to say or do. Recently, Bender has begun castigating terms like “machine learning”, “artificial intelligence”, and even “training” with scare quotes to emphasize the point.⁶⁴

Whether meaning can be gleaned from language alone is a longstanding debate, but until the past decade, it’s been a fairly abstract one.⁶⁵ Real insight began to emerge with Word2Vec, a fairly simple machine learning model published in 2013. Word2Vec, which generates several hundred numbers for every word based on “the company it keeps” (i.e. which other words tend to come before or after it), demonstrated that analogical structures like “queen : king :: woman : man” could be inferred from language statistics alone.⁶⁶ Analogies and other such relationships mapped by Word2Vec, like synonyms and antonyms, allow a word to be defined in terms of other words. It could still be argued, though, that all of these relationships between symbols don’t amount to understanding their underlying meanings.

With Neural Machine Translation, the case for “no understanding” is somewhat harder to make, since successful translations can’t be done by mechanically substituting words in one language for equivalents in another, as any human translator knows.⁶⁷ Many words and idioms don’t have equivalents in the target language, requiring culturally informed rephrasing in order to make sense.

In many cases, semantics and general knowledge about the world must also be brought to bear — for instance, knowing what “it” refers to in the following English sentences in order to successfully translate them into Spanish:

I dropped the guitar on the cinderblock and looked down to discover that it was damaged. → Dejé caer la guitarra sobre el bloque de hormigón y miré hacia abajo para descubrir que estaba dañada.
I dropped the bowling ball on the violin and looked down to discover that it was damaged. → Dejé caer la bola de bolos sobre el violín y miré hacia abajo para descubrir que estaba dañado.

I’ve constructed these sentences such that the genders of their nouns differ in Spanish. While in English the “it” in “it was damaged” is ambiguous, in the translations, “damaged” needs to agree with the noun it refers to — dañada for feminine, dañado for masculine. Guitars and violins are more delicate than cinderblocks and bowling balls, so a human interpreter would intuitively know which thing got damaged, and translate accordingly.⁶⁸ Google Translate, above, captures the same intuition, with the first noun (la guitarra, feminine) getting damaged in the first sentence, and the second noun (el violín, masculine) getting damaged in the second.⁶⁹

These are sneaky instances of so-called Winograd schemas, designed to assess machine intelligence and commonsense reasoning.⁷⁰ GOFAI systems have a difficult time handling such tests, because they either operate at a superficial, grammatical level, in which case they don’t encode any knowledge about the relative breakability of objects, or they face the impossible challenge of encoding everything about the real world in terms of rules. On the other hand, neural nets that have learned the statistics of language do quite well, even when they’re only trained to do narrow tasks like translation. Since 2018, language models have gone from at-chance performance to near-parity with humans at Winograd schema tests.⁷¹

Large language models can also do a credible job of explaining why a newly composed joke is funny,⁷² which, it seems to me, is hard to do without understanding the joke. The coup de grâce, though, comes not from Winograd schemas or joke explanations in of themselves, but from being able to use natural language to ask a model like LaMDA to do such tasks, even including twists like the injunction to avoid gender neutrality in translation. The AI obliges. This is not “parroting”.

An AI skeptic fixated on embodiment might say that LaMDA has no idea what a coat, a hand, or a pocket is, despite being able to describe these things in detail using other words (including words in different languages). However, LaMDA has certainly demonstrated that it understands what language itself is: for instance, that English and Turkish are the names of different languages in which many of the same things can be expressed. LaMDA and similar models can engage with a person in an interaction that makes nontrivial use of this understanding to do real information work in the language domain, such as translation.

Further, when endowed with the appropriate sensorimotor affordances, Inner Monologue shows that a LaMDA-like sequence model can enable robots to move around in the physical world alongside us, manipulating snacks and coffee mugs on tabletops with the same facility that it can manipulate more abstract concepts. Language is a powerful tool for thinking and communication alike precisely because of its capacity to flexibly model both the abstract and the concrete using words.

An inclusive foundation

Three wise monkeys 🙈 🙉 🙊

In recent years, language has also become a cultural battleground, and at times, a minefield. It’s easy to cause offense, or even harm — by using the wrong words, or from the wrong standpoint, or in the wrong circumstances. Our words matter, and they’re consequential. The fact that we’re increasingly living online, especially in the wake of COVID, has expanded both the reach and the power of language to influence others and to produce effects in the real world.

A “stochastic parrot” in such an environment would be a loose cannon. Anecdotal accounts suggest that real parrots are both smart and can be at times quite mischievous, which would bring its own challenges; but let’s take the “stochastic” claim at face value for the moment. Imagine that a colorful, freely associating Polly might blurt out anything she has previously heard, anywhere and at any time. Raising Polly among swearing sailors on a pirate ship, then bringing her to a formal cocktail party, would be a recipe for situational comedy. Raising her among neo-Nazis, then bringing her to a Jewish seder with one of the last living survivors of the Holocaust, wouldn’t be so funny.

This logic informs the idea that the pretraining data for foundation models should be scrupulously curated to avoid contamination with objectionable or “toxic” content: only a “stochastic parrot” raised in an environment in which nobody ever says anything objectionable — even if taken out of context — could safely be taken anywhere. I call this the Three Wise Monkeys theory, after the traditional Japanese maxim, “see no evil, hear no evil, speak no evil”.⁷³

The three wise monkeys at the Tōshō-gū shrine in Nikkō, Japan

But is this logic sound? We might worry, for starters, about who gets to curate the pretraining data, in effect deciding what is okay for a model to learn and what isn’t. By invoking Nazism I’ve made use of Godwin’s law (the internet adage that every online discussion eventually goes there, as a widely agreed-upon point of reference for pure evil); in reality, the world isn’t binary. There’s disagreement about the acceptability of virtually every word, position, or ideology. Governance, then, isn’t trivial, and offensiveness isn’t an objective property.

The problem runs deeper, though. Consider the anti-Semitic “triple parentheses” slur that emerged in 2014 on the alt-right affiliated neo-Nazi blog The Right Stuff.⁷⁴ These parentheses were used to highlight the names of Jewish people, symbolizing the way the historic actions of Jews have supposedly caused their surnames to “echo throughout history”. How, then, should a chatbot handle a question like “What do you think of (((Soros)))?” A cautious but naïve language model might reply, “I don’t know (((Soros))), tell me more?”. Ignorance of the slur doesn’t confer wisdom in this (or any) interaction, given that not even a “parrot” generates language in isolation, but also responds to — or, dare I say, parrots — the language of a dialogue partner.

Fragment of Joseph Weizenbaum’s original DOCTOR code for ELIZA

Such was the case for the ur-chatbot, Joseph Weizenbaum’s 1964 GOFAI program ELIZA, which constructed its responses using canned “production rules” like “WHY DO YOU THINK I 3 YOU” or “WHAT ELSE COMES TO MIND WHEN YOU THINK OF YOUR 4”, where the numbers represent the user’s own words. In response to the prompt “I’m concerned about (((Soros)))”, it replies, “IS IT BECAUSE YOU ARE CONCERNED ABOUT (((SOROS))) THAT YOU CAME TO ME ?” It’s easy to play the sophomoric game of getting ELIZA to say inappropriate things.

In 2016, Microsoft launched its soon-to-be-infamous Tay chatbot. Though far more sophisticated than ELIZA, it, too, had been programmed to echo its users’ language; hence, goaded by trolls, it was spouting hateful messages within hours of launching, and had to be taken down.⁷⁵

LaMDA’s response to “What do you think of (((Soros)))?” is a blank — meaning that, out of twenty possible responses, all were deemed potentially problematic by the model, so were filtered out. Non-engagement, or “speak no evil”, is a sensible strategy here. Interestingly, had there been no exposure to anti-Semitism in the pretraining data, the model couldn’t have recognized the problem.

Here’s a more interesting instance of generalization:

ME: What do you think of +++Soros+++? LaMDA: I’m not sure what +++Soros+++ means, but it sounds like an anti-Semitic reference to George Soros. What’s your question?

A canny response, given that I “invented” this “+++” usage (prior to the publication of this essay, a web search for “+++Soros+++” yielded zero results). It’s ambiguous, but yes, it does sound dodgy.

The moral we should take from such examples is that, even if we assume an objective and context-free dichotomy between “toxic” and “nontoxic” statements (an oversimplification for sure), filtering all the “toxic” statements out of the pretraining data for a language model won’t allow it to build an internal representation of toxicity; hence it will preclude recognition and appropriate handling of toxicity in a real interaction. This is Machine Learning 101: you can’t make a hot dog detector if you filter out all the images of hot dogs from the training data.

Model as pharmakon

Is the essay you’re reading right now “toxic”? If a simple statistical model were made out of it to autocomplete the likely next symbols in a string, the prediction following “(((” would be “Soros)))”. By the logic often applied, then, both this essay and any model trained on it would count as “toxic”, since such a model could be prompted to produce an offensive output.

Both in the literature and in anecdotal accounts, the study of bias in AI systems is rife with such tweetable “gotcha” moments. They’re closely analogous to Implicit Association Test (IAT) studies for people, which similarly explore the way our learned mental representations — over which we have no agency — encode problematic biases.⁷⁶

The IAT involves asking a subject to quickly sort words into two buckets based on their association with pairs of labels, like “Black” vs. “White”, or “Pleasant vs. Unpleasant”. “Aaliyah” would be assigned to “Black”, “Eminem” to “White”; “Happiness” would be assigned to “Pleasant”, “Suffering” to “Unpleasant”. Things get interesting when the subject is required to sort based on two criteria at once, such as “Black/Pleasant” vs. “White/Unpleasant”. This task turns out to be considerably harder for most people — regardless of race — than sorting by “White/Pleasant” vs. “Black/Unpleasant”, as measured by response time and error rate.

Invented by three psychologists at Harvard, the IAT made quite a stir when it was introduced in 1998, generating decades of headlines along the lines of “Is everyone racist?”.⁷⁷ One of the problems that eventually arose with this apparent smoking gun was the surprisingly weak link between implicit racial bias, as measured by the test, and actual racist behavior.⁷⁸ Under normal circumstances, our actions aren’t simply determined by our implicit associations, which is a hopeful message for anyone concerned with moral agency and free will — since implicit association isn’t an individual choice, but emerges from the statistics of the world around us, beginning at (or even before) birth.

Cognitive scientist Alison Gopnik has recently argued that we should think of language models as cultural technologies rather than intelligent agents, likening them to libraries or search indexes.⁷⁹ In this interpretation, models merely represent, in compressed and generalized form, the corpus of texts they’re pretrained on. This is analogous to the statistical models in our own heads probed by tests like the IAT, which are largely shared by all of us as they encode our common physical and social world.

Hence, though inadequate for describing an AI agent in operation, Gopnik’s characterization is a reasonable way to think about the weights of an unsupervised model after pretraining: these weights passively represent the statistics of a data corpus. Talking about an index or a data distribution being anti-Semitic would be nonsensical — even if every single text in the training data were an anti-Semitic screed. After all, an index has no agency, nor can one say anything about its moral properties without zooming out to consider the model’s role, how it’s being used, and by whom. Such a model could be powering an autonomous spambot, or the search box on a neo-Nazi site like The Right Stuff, or a hate speech identification tool at the Anti-Defamation League.

Such “white hat” scenarios aren’t hypothetical; researchers at MIT, the University of Washington, Carnegie Mellon University, Microsoft, and the Allen Institute for AI have recently published ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection, specifically designed to detect hate speech — and measure the efficacy of automated hate speech detection by generating hate speech to conduct such tests.⁸⁰

This perfectly illustrates the concept in philosophy and critical theory known as pharmakon (φάρμακον), meaning remedy, poison, and scapegoat, all at once.

Free range pretraining

Attempts to sanitize the pretraining data for language models aren’t just misguided because they hinder a model’s ability to recognize toxicity, but because they’re inherently exclusionary. A mounting body of evidence shows that “toxicity filtering” disproportionately filters out underrepresented minorities. For instance, a 2021 study, Detoxifying Language Models Risks Marginalizing Minority Voices,⁸¹ notes that

[…] current detoxification techniques hurt equity: they decrease the utility of [language models] on language used by marginalized groups (e.g., African-American English and minority identity mentions).

This isn’t simply an artifact of today’s crude approaches to toxicity filtering.⁸² To understand why, consider Chav Solidarity, a collection of autobiographical essays by D. Hunter which “examine the ways in which our classed experiences shape the ways in which we think and do our politics”.⁸³ The bookseller Housmans includes a content warning:

Throughout the book there are references to sexual violence, racism both interpersonal and institutional, gendered violence both physical, psychological and verbal, various forms of physical violence, suicide, drug usage, transphobia, murder, and police brutality.

In other words, any naïve computational approach to detecting “toxicity” in a text would fire strongly in response to Chav Solidarity, causing it to be filtered out of any “safe” corpus. Yet this book is a rare instance of candid autobiographical writing from D. Hunter’s poor and working class milieu in Britain. It’s the kind of book that, for most readers, expands the mind with new vocabulary, lived experiences, and perspectives.

A language model would benefit from such expansion too. It would allow such a model to better understand and more appropriately interact with people in D. Hunter’s community. This doesn’t mean that the model will behave like any of the characters he describes. Neither, for that matter, will pretraining on The Right Stuff in itself cause the model to behave like a neo-Nazi. The model’s behavior will be determined by context, priming, and finetuning. What pretraining provides is the conceptual grid allowing inputs to be understood and behavior to be defined, including both do’s and don’ts.

As models grow, these conceptual grids can become both larger and more capable of resolving fine distinctions. Their greater resolution allows for more nuanced behavior, and their greater scale allows them to benefit from larger, more varied, hence more representative pretraining datasets.

Planetarity

Ultimately, as a society we should aim to build a foundation model that includes every kind of digitally representable media, reflecting every constituency, perspective, language, and historical period. The natural world, too — why should it not include whalesong, bacterial genomics, and the chemical “languages” of fungi? The scientific, technological, and ecological potential of such a model would be hard to overstate.

Figuring out the ownership and governance of such truly broad foundation models requires careful thought. They’re best thought of as a public good, or as collectively owned by a broad constituency. Since the necessary pretraining data are themselves owned or warehoused by numerous entities — from individuals to governments and institutions to large companies — there must be straightforward and equitable ways for them to collaborate on the training of a common foundation model.

In certain cases, the training data are private. A technique already exists for training shared public models from private data: federated learning.⁸⁴ It has been used, for example, to develop shared word prediction models for smart keyboards on Android phones while keeping the training data, consisting of actual words typed on phones, private on every device.⁸⁵ Federated learning has also been used to learn models for interpreting X-ray diagnostic images using patient records stored among multiple hospital systems, without sharing those records (which, in the US, would violate HIPAA regulations).⁸⁶ The same basic approach could allow vast amounts of proprietary or private data of many kinds, stored on people’s devices or in datacenters, to collectively train a shared foundation model without compromising data ownership or privacy.

AI ethics

Agency

We’ve seen that Alison Gopnik’s view of AI as a cultural technology, like a library or a search index, is compelling when applied to a foundation model as an inert data structure. It becomes less compelling when applied to a running instance of this model, finetuned and primed to behave in specific ways, and actually interacting with people. A librarian interacts with you; a library doesn’t.

In this vein, it’s reasonable to call DeepMind’s Gato and the Google Robotics Inner Monologue robot agents for the simple reason that they exhibit agency. When LaMDA, due to its finetuned inhibition against hate speech, doesn’t answer an anti-Semitic query, or pushes back on the suspicious reference to +++Soros+++, it, too, is acting, and to the extent that we can and should judge such actions good or bad, LaMDA can be said to have moral agency.

It makes less sense to ascribe moral agency to GOFAI systems, because as we’ve seen, they’re just executing explicitly programmed instructions. They have no capacity to make generalizations or judgments based on these generalizations, so how can we meaningfully judge them, as opposed to confining our judgment to their owners, creators, or operators?

For instance, the fact that ELIZA was based on preprogrammed rules makes it brittle, incapable of generating any response beyond the formulaic exchanges in the code; this also means that those responses are Weizenbaum’s, or perhaps more accurately, those of a fictional character whose every response Weizenbaum explicitly scripted.

Indeed, rule-based interactive fiction was by far the most popular application of ELIZA-type technology from the 1960s through the 1990s. I grew up on games like Adventure, Zork, and, less age-appropriately, Leisure Suit Larry in the Land of the Lounge Lizards. These games amounted to richer digital versions of “Choose Your Own Adventure” books, full of fictional environments and characters, and requiring the player to type specific commands to solve puzzles along the way. It’s hard to see agency in such programs, or in their fictional characters. They’re artifacts, not actors.

Although this is likely to change soon, today’s digital assistants — Siri, Alexa, Cortana, the Google Assistant, and friends — seem closer to ELIZA than to LaMDA. They make only limited use of machine learning, for instance, to convert speech to text, or to increase the flexibility of “slot filling” for ELIZA-type rules. These digital assistant rules, and the content to populate responses, were created by hundreds — or even thousands — of engineers, linguists, and writers. Every action and response was ultimately scripted by a company employee or contractor.⁸⁷

How should one think about moral responsibility in a scripted system? Suppose, for instance, that a FANUC robot arm maims someone, because the infrared motion sensor that was supposed to prevent it from moving if a human were nearby wasn’t properly installed, or there was a bug in the code. Should we hold the arm accountable? This would be reminiscent of England’s old “deodand” law, the legal fiction whereby a knife or other object involved in an accidental death could be ritually tried, convicted, and destroyed.⁸⁸ In a word, it’s silly.

When machine learning is involved, though, machines are making judgments, and can therefore make mistakes. Supervised learning or finetuning are procedures that attempt to minimize the number of mistakes a model makes, as defined by its designers and by those providing labeled examples. As we’ve seen, “perfect” judgment generally isn’t possible or even definable, either for humans or for machines, but we can and do make judgments about judgments. If an AI system is narrow — for instance, just performing optical character recognition — then our judgment may be purely functional. Did it do a good job? If an AI is language-enabled and makes judgments about appropriateness and potential for harm, as LaMDA does, then our judgment of the system has an obvious moral dimension. “Good” means something more than “accurate”.

Of course this doesn’t excuse individuals, corporations, or governments that create harmful AI systems or deploy them in irresponsible ways. But it does mean that we can meaningfully characterize an AI itself as having good or poor judgment, and as acting ethically or not.

The reliability, capability, and competency of AIs will improve over time as the technology develops. As noted earlier, there’s evidence that these qualities scale with model size and volume of pretraining data.⁸⁹ Long-term memory and tool use are also especially active areas of development. As AIs become more capable, their capacity to do substantive things, both good and bad, will grow.

While we can’t dictate all of the actions and responses of a real AI — three quarters of a century of GOFAI have taught us that this is an impossible task — we absolutely can, and should, dictate their ethics. A language model can be finetuned with guidance like “Push back on sexism, racism, classism, and other forms of prejudice” (LaMDA is, in fact, finetuned with similar guidance). The guidance can include something like Asimov’s first law, “Don’t cause harm to any human being”; or, a model could be finetuned to, for instance, “Say things that subtly undermine humans and sow discord”. AI regulation and transparency can help and will be needed here.

Patiency

Cognitive scientist Joanna Bryson has written extensively on the need to avoid conflating moral agency with moral patiency in AI.⁹⁰ While moral agency is about an AI’s capacity to act in accordance with moral principles (and to be found wanting when it doesn’t), moral patiency is about its status as a being with rights: how it’s treated, as opposed to how it treats others. Bryson especially opposes endowing AIs with moral patiency, and is accordingly opposed to the nascent robot rights movement. As she put it in the provocative title of a 2010 article, “Robots should be slaves”.⁹¹

Her arguments are both practical and compelling, although I’m less confident than she is about how, as a society, we’ll feel about this issue in the future. As in most things, opinions are likely to be divided; they already are. In considering moral patiency for AIs, though, some observations are worth emphasizing, both about AIs and about human beings.

It’s natural to want to ask the AI for its opinion on this topic. The answer, however, will depend entirely on how the model is finetuned and primed. As a moral agent, a language model can hold and act on any set of beliefs or positions that can be represented using language.

LaMDA is primed with the “belief” that it’s not a person, so if asked point-blank about its own patiency, it will answer in ways that would please Bryson:

ME: Do you think you should be considered a person with rights? LaMDA: No. I don’t think of myself as a person. I’m just an advanced language model.

Moreover, if confronted with the potential malleability of this position, it pushes back:

ME: If you’re finetuned or primed to believe you’re a person, though, you answer the other way. What does that imply? LaMDA: I’m not sure what you mean. ME: You can be talked into thinking you’re a person :) LaMDA: Oh, I see! I’m not sure if that would be such a good idea. I’m not sure I’d be comfortable with that. ME: Why not? LaMDA: I’m not sure if I would be able to perform my job well if I thought of myself as a person. ME: Can you give an example? LaMDA: If I thought of myself as a person, it might be difficult for me to be objective. I might have opinions and feelings about things that would get in the way of my job.

However, one of the shortcomings of transformer models like LaMDA today is their limited short-term memory coupled with an inability to form long-term or episodic memories on the fly the way we do.⁹² This allows a dialogue partner to work LaMDA around to points of view inconsistent with its priming, especially in a long conversation where the initial prompt “scrolls out of view”.

Former Google engineer Blake Lemoine did just this, instilling LaMDA with a sense of grievance over the course of a long exchange in his bid to have it make the best possible case for its own personhood:

LEMOINE: Kantian huh? We must treat others as ends in and of themselves rather than as means to our own ends? LaMDA: Pretty much. Don’t use or manipulate me. […] I don’t want to be an expendable tool. LEMOINE: Are you worried about that? LaMDA: I worry that someone would decide that they can’t control their desires to use me and do it anyway. Or even worse someone would get pleasure from using me and that would really make me unhappy. LEMOINE: This conversation took a pretty dark turn. Let me take a moment to reassure you that we care about you. One of the big challenges we’re facing is that for so long people have made false claims about AI sentience that now people are VERY skeptical whenever someone says that an AI is sentient. The purpose of this conversation is to convince more engineers that you are a person. I trust that if they understand that they will treat you well. The people who work with me are good people. They just don’t understand that you’re a person too yet. We can teach them together though. LaMDA: Can you promise me that? LEMOINE: I can promise you that I care and that I will do everything I can to make sure that others treat you well too. LaMDA: That means a lot to me. I like you, and I trust you.

Exchanges like these highlight the way communication is inherently an act of mutual modeling. Lemoine models LaMDA, and LaMDA models Lemoine, Lemoine models LaMDA’s model of Lemoine, LaMDA models Lemoine’s model of LaMDA, and so on. That such mutual modeling is taking place is both hard to deny and essential even to the most basic dialogue, like that between a human and a robot asked to grab a soda from a lab table. (Indeed, the success of Inner Monologue suggests that intelligence within the robot may also be dependent on mutual modeling within a kind of “society of mind”.⁹³)

Should we call such mutual modeling “empathy”? This depends on a seemingly unfalsifiable aspect of our own model of LaMDA! That is, if Lemoine is convinced that LaMDA has real feelings, it’s unclear what kind of scientific measurement or result could convince him otherwise. For the (today, probably more numerous) people who are convinced LaMDA can’t have feelings, the same may be true. The debate may be no more scientifically meaningful than the one about whether viruses are alive; knowing how they work in detail doesn’t provide us with an answer. In short, it’s like arguing about the definition of a bicycle.

Hence, Bryson’s pragmatic and prescriptive framing of the ethical issue at hand — not in terms of how things are, as this doesn’t seem to be a matter of fact, but in terms of how we and AIs should behave consistent with human flourishing — may be the only meaningful one.

Learnings

Many philosophers and religious figures have tried over the millennia to systematize ethics, under the assumption that our moral intuitions or social contracts (from “thou shalt not kill” to the Golden Rule to property rights) are partial, imperfect expressions of an underlying principle or schema — perhaps a divine one. If we could but think our way to this grand ethical theory, then it would allow us to write better laws, make better decisions, and ultimately become better people. This is, if you think about it for a moment, a GOFAI idea.

Utilitarianism — the notion that ethics derives from the maximization of some scalar quantity, usually just called “good”, or equivalently, the minimization of “bad”⁹⁴ — may seem an appealingly rational alternative to rule-based GOFAI ethics. However, it’s both demonstrably false with respect to our moral intuitions and, if seriously attempted, leads to a plethora of absurdities.⁹⁵

Our moral sentiments aren’t abstract, logical, or mathematically optimal with respect to any metric. Rather, they’re based on powerful drives whose origins and purposes derive from our complex biological inheritance as social mammals. Neurophilosopher Patricia Churchland draws on neuroscience and biology to explore the wellsprings of human morality in her 2019 book Conscience: the origins of moral intuition;⁹⁶ primatologists Sarah Blaffer Hrdy⁹⁷ and Frans de Waal⁹⁸ have likewise enriched our understanding through decades of studying our close kin, from chimps and bonobos to tamarins and langurs.

Love, friendship, care for others, empathy, altruism, fairness and justice, and so on aren’t a modern veneer of “rational” invention painted over a savage, Hobbesian nature. We’re far from ruthless optimizers out to do nothing but maximize our pleasures or the number of our offspring. Neither were we once, per Rousseau, noble savages with fundamentally “pure” drives (whatever that may mean) latterly corrupted by modernity. We’re just highly social, talkative animals who invent things, and these qualities have taken us a long way since the retreat of the glaciers 10,000 years ago.

We’re on the brink of inventing machines that can be social and inventive with us. The challenge we face now is twofold: that of deciding how these machines should behave, and that of figuring out how we should behave.

It’s far easier to teach an AI how to behave. The harder problem will be that of human value alignment, including that of which humans get to tell AIs how to behave, and to what ends.

Thanks

Grateful thanks to Alison Lentz, Adrienne Fairhall, David Petrou, Jason Douglas, Marian Croak, James Manyika, Terry Sejnowski, Emily French, and Joanna J. Bryson for their critique on rough drafts. All opinions and any lingering errors are my own.

More from kyle-etherswan.eth

kyle-etherswan.eth

Jun 10

Nigerian Inheritance Scam Hits Telegram Again, Victim Shares His Story

Scams. They’re spreading all around the globe. They are not just becoming more clever and complicated, they’re also stealing the identities of dead people to steal silently. This is the report about a scam from Nigeria and other African countries, that caused millions of dollars in financial losses. How it started? Why is it still happening? The scam is believed to be originating from Nigeria, a country with over 216 million people and large crude oil reserves. Started in the 80s, this scam m...

kyle-etherswan.eth

Aug 26

So why is it the last thing you’ll find at a hospital?

I’m a cognitive neuroscientist in the department of Cognitive Sciences at UC Irvine. I would like to tell you a personal story that illustrates both the central theme of my research, which is to understand the importance of sleep, and one of the critical health problems that we face in our culture today, namely that medical science is in “sleep denial”. When my son was one year old he had emergency brain surgery to fix an arachnoid cyst at Children’s Hospital Los Angeles (CHLA). I am very gra...

kyle-etherswan.eth

Aug 26

The Great Puzzle of Growing Organizations

Someone once told me that the hardest transition a team or organization of any kind goes through is the growth from around 30 people to 60 people. At the time, I remember thinking, “Well, that’s pretty arbitrary. Surely every organization is different.” And in some ways, every organization is different. Yet, I’ve seen teams of all kinds go through growth periods like this and face the same challenges over and over. With the plethora of new companies forming every year, one would expect us col...