GPT-3, Bloviator: OpenAI’s language generator has no idea what it’s talking about

4

Since OpenAI first described its new AI language-generating system called GPT-3 in May, hundreds of media outlets (including MIT Technology Review) have written about the system and its capabilities. Twitter has been abuzz about its power and potential. The New York Times published an op-ed about it. Later this year, OpenAI will begin charging companies for access to GPT-3, hoping that its system can soon power a wide variety of AI products and services.

Is GPT-3 an important step toward artificial general intelligence-the kind that would allow a machine to reason broadly in a manner similar to humans without having to train for every specific task it encounters? OpenAI’s technical paper is fairly reserved on this larger question, but to many, the sheer fluency of the system feels as though it might be a significant advance.

We doubt it. At first glance, GPT-3 seems to have an impressive ability to produce human-like text. And we don’t doubt that it can used to produce entertaining surrealist fiction; other commercial applications may emerge as well. But accuracy is not its strong point. If you dig deeper, you discover that something’s amiss: although its output is grammatical, and even impressively idiomatic, its comprehension of the world is often seriously off, which means you can never really trust what it says.

Below are some illustrations of its lack of comprehension-all, as we will see later, prefigured in an earlier critique that one of us wrote about GPT-3’s predecessor.

Before proceeding, it’s also worth noting that OpenAI has thus far not allowed us research access to GPT-3, despite both the company’s name and the nonprofit status of its oversight organization. Instead, OpenAI put us off indefinitely despite repeated requests-even as it made access widely available to the media. Fortunately, our colleague Douglas Summers-Stay, who had access, generously offered to run the experiments for us.

OpenAI’s striking lack of openness seems to us to be a serious breach of scientific ethics, and a distortion of the goals of the associated nonprofit. Its decision forced us to limit our testing to a comparatively small number of examples, giving us less time to investigate than we would have liked, which means there may be more serious problems that we didn’t have a chance to discern. Even so, within the constraints of a small sample, many major issues with GPT-3 were immediately evident, in every domain of reasoning and comprehension that we tested.

In each example below, the prompt we generated is in ordinary font and the system’s continuation is in bold. Our comments are in italics. (A complete list of the experiments we did can be found here.)

Biological reasoning

Physical reasoning

Social reasoning

Object and individual tracking

Psychological reasoning

Non sequiturs

The depressing thing is that none of this is new. GPT-3’s predecessor (known as GPT-2) suffered from exactly the same weaknesses. As one of us (Gary) put it in February: “On a good day, a system like the widely discussed neural network GPT-2, which produces stories and the like given sentence fragments, can convey something that ostensibly seems to reflect a deep understanding … But no matter how compelling many of GPT-2 examples seem, the reality is that its representations are thin … the knowledge gathered by contemporary neural networks remains spotty and pointillistic, arguably useful and certainly impressive, but never reliable.”

Too little has changed. Adding a hundred times more input data has helped, but only a bit. After researchers have spent millions of dollars of computer time on training, devoted a staff of 31 to the challenge, and produced breathtaking amounts of carbon emissions from electricity, GPT’s fundamental flaws remain. Its performance is unreliable, causal understanding is shaky, and incoherence is a constant companion. GPT-2 had problems with biological, physical, psychological, and social reasoning, and a general tendency toward incoherence and non sequiturs. GPT-3 does, too.

More data makes for a better, more fluent approximation to language; it does not make for trustworthy intelligence.

Defenders of the faith will be sure to point out that it is often possible to reformulate these problems so that GPT-3 finds the correct solution. For instance, you can get GPT-3 to give the correct answer to the cranberry/grape juice problem if you give it the following long-winded frame as a prompt:

GPT-3’s continuation to that prompt is, correctly: “B. This is a safe mixture.”

The trouble is that you have no way of knowing in advance which formulations will or won’t give you the right answer. To an optimist, any hint of success means that there must be a pony in here somewhere. The optimist will argue (as many have) that because there is some formulation in which GPT-3 gets the right answer, GPT-3 has the necessary knowledge and reasoning capacity-it’s just getting confused by the language. But the problem is not with GPT-3’s syntax (which is perfectly fluent) but with its semantics: it can produce words in perfect English, but it has only the dimmest sense of what those words mean, and no sense whatsoever about how those words relate to the world.

To understand why, it helps to think about what systems like GPT-3 do. They don’t learn about the world-they learn about text and how people use words in relation to other words. What it does is something like a massive act of cutting and pasting, stitching variations on text that it has seen, rather than digging deeply for the concepts that underlie those texts.

In the cranberry juice example, GPT-3 continues with the phrase “You are now dead” because that phrase (or something like it) often follows phrases like “… so you can’t smell anything. You are very thirsty. So you drink it.” A genuinely intelligent agent would do something entirely different: draw inferences about the potential safety of mixing cranberry juice with grape juice.

All GPT-3 really has is a tunnel-vision understanding of how words relate to one another; it does not, from all those words, ever infer anything about the blooming, buzzing world. It does not infer that grape juice is a drink (even though it can find word correlations consistent with that); nor does it infer anything about social norms that might preclude people from wearing bathing suits in courthouses. It learns correlations between words, and nothing more. The empiricist’s dream is to acquire a rich understanding of the world from sensory data, but GPT-3 never does that, even with half a terabyte of input data.

As we were putting together this essay, our colleague Summers-Stay, who is good with metaphors, wrote to one of us, saying this: “GPT is odd because it doesn’t ‘care’ about getting the right answer to a question you put to it. It’s more like an improv actor who is totally dedicated to their craft, never breaks character, and has never left home but only read about the world in books. Like such an actor, when it doesn’t know something, it will just fake it. You wouldn’t trust an improv actor playing a doctor to give you medical advice.”

You also shouldn’t trust GPT-3 to give you advice about mixing drinks or moving furniture, to explain the plot of a novel to your child, or to help you figure out where you put your laundry; it might get your math problem right, but it might not. It’s a fluent spouter of bullshit, but even with 175 billion parameters and 450 gigabytes of input data, it’s not a reliable interpreter of the world.

Gary Marcus is founder and CEO of and was founder and CEO of Geometric Intelligence, which was acquired by Uber. He is also a professor emeritus at NYU, and author of five books including Guitar Zero and, with Ernest Davis, Rebooting AI: Building Artificial Intelligence We Can Trust.

Ernest Davis is a professor of computer science at New York University. He has authored four books, including Representations of Commonsense Knowledge.