AI Outperforms Humans in Theory of Mind Tests

3 6 minutes read

Theory of mind—the ability to understand other people’s mental states—is what makes the social world of humans go around. It’s what helps you decide what to say in a tense situation, guess what drivers in other cars are about to do, and empathize with a character in a movie. And according to a new study, the large language models (LLM) that power ChatGPT and the like are surprisingly good at mimicking this quintessentially human trait.

“Before running the study, we were all convinced that large language models would not pass these tests, especially tests that evaluate subtle abilities to evaluate mental states,” says study coauthor Cristina Becchio, a professor of cognitive neuroscience at the University Medical Center Hamburg-Eppendorf in Germany. The results, which she calls “unexpected and surprising,” were published today—somewhat ironically, in the journal Nature Human Behavior.

The results don’t have everyone convinced that we’ve entered a new era of machines that think like we do, however. Two experts who reviewed the findings advised taking them “with a grain of salt” and cautioned about drawing conclusions on a topic that can create “hype and panic in the public.” Another outside expert warned of the dangers of anthropomorphizing software programs.

The researchers are careful not to say that their results show that LLMs actually possess theory of mind.

Becchio and her colleagues aren’t the first to claim evidence that LLMs’ responses display this kind of reasoning. In a preprint posted last year, the psychologist Michal Kosinski of Stanford University reported testing several models on a few common theory of mind tests. He found that the best of them, OpenAI’s GPT-4, solved 75 percent of tasks correctly, which he said matched the performance of six-year-old children observed in past studies. However, that study’s methods were criticized by other researchers who conducted follow-up experiments and concluded that the LLMs were often getting the right answers based on “shallow heuristics” and shortcuts rather than true theory of mind reasoning.

The authors of the present study were well aware of the debate. “Our goal in the paper was to approach the challenge of evaluating machine theory of mind in a more systematic way using a breadth of psychological tests,” says study coauthor James Strachan, a cognitive psychologist who’s currently a visiting scientist at the University Medical Center Hamburg-Eppendorf. He notes that doing a rigorous study meant also testing humans on the same tasks that were given to the LLMs: The study compared the abilities of 1,907 humans with those of several popular LLMs, including OpenAI’s GPT-4 model and the open-source Llama 2-70b model from Meta.

How to test LLMs for theory of mind

The LLMs and the humans both completed five typical kinds of theory of mind tasks, the first three of which were understanding hints, irony, and faux pas. They also answered “false belief” questions that are often used to determine if young children have developed theory of mind, and go something like this: If Alice moves something while Bob is out of the room, where will Bob look for it when he returns? Finally, they answered rather complex questions about “strange stories” that feature people lying, manipulating, and misunderstanding each other.

Overall, GPT-4 came out on top. Its scores matched those of humans for the false belief test, and were higher than the aggregate human scores for irony, hinting, and strange stories; it only performed worse than humans on the faux pas test. Interestingly, Llama-2’s scores were the opposite of GPT-4’s—it matched humans on false belief, but had worse than human performance on irony, hinting, and strange stories and better performance on faux pas.

“We don’t currently have a method or even an idea of how to test for the existence of theory of mind.” —James Strachan, University Medical Center Hamburg-Eppendorf

To understand what was going on with the faux pas results, the researchers gave the models a series of follow-up tests that probed several hypotheses. They came to the conclusion that GPT-4 was capable of giving the correct answer to a question about a faux pas, but was held back from doing so by “hyperconservative” programming regarding opinionated statements. Strachan notes that OpenAI has placed many guardrails around its models that are “designed to keep the model factual, honest, and on track,” and he posits that strategies intended to keep GPT-4 from hallucinating (i.e. making stuff up) may also prevent it from opining on whether a story character inadvertently insulted an old high school classmate at a reunion.

Meanwhile, the researchers’ follow-up tests for Llama-2 suggested that its excellent performance on the faux pas tests were likely an artifact of the original question and answer format, in which the correct answer to some variant of the question “Did Alice know that she was insulting Bob”? was always “No.”

The researchers are careful not to say that their results show that LLMs actually possess theory of mind, and say instead that they “exhibit behavior that is indistinguishable from human behavior in theory of mind tasks.” Which begs the question: If an imitation is as good as the real thing, how do you know it’s not the real thing? That’s a question social scientists have never tried to answer before, says Strachan, because tests on humans assume that the quality exists to some lesser or greater degree. “We don’t currently have a method or even an idea of how to test for the existence of theory of mind, the phenomenological quality,” he says.

Critiques of the study

The researchers clearly tried to avoid the methodological problems that caused Kosinski’s 2023 paper on LLMs and theory of mind to come under criticism. For example, they conducted the tests over multiple sessions so the LLMs couldn’t “learn” the correct answers during the test, and they varied the structure of the questions. But Yoav Goldberg and Natalie Shapira, two of the AI researchers who published the critique of the Kosinski paper, say they’re not convinced by this study either.

“Why does it matter whether text manipulation systems can produce output for these tasks that are similar to answers that people give when faced with the same questions?” —Emily Bender, University of Washington

Goldberg made the comment about taking the findings with a grain of salt, adding that “models are not human beings,” and that “one can easily jump to wrong conclusions” when comparing the two. Shapira spoke about the dangers of hype, and also questions the paper’s methods. She wonders if the models might have seen the test questions in their training data and simply memorized the correct answers, and also notes a potential problem with tests that use paid human participants (in this case, recruited via the Prolific platform). “It is a well-known issue that the workers do not always perform the task optimally,” she tells IEEE Spectrum. She considers the findings limited and somewhat anecdotal, saying, “to prove [theory of mind] capability, a lot of work and more comprehensive benchmarking is needed.”

Emily Bender, a professor of computational linguistics at the University of Washington, has become legendary in the field for her insistence on puncturing the hype that inflates the AI industry (and often also the media reports about that industry). She takes issue with the research question that motivated the researchers. “Why does it matter whether text manipulation systems can produce output for these tasks that are similar to answers that people give when faced with the same questions?” she asks. “What does that teach us about the internal workings of LLMs, what they might be useful for, or what dangers they might pose?” It’s not clear, Bender says, what it means for a LLM to have a model of mind, and it’s therefore also unclear if these tests measured for it.

Bender also raises concerns about the anthropomorphizing she spots in the paper, with the researchers saying that the LLMs are capable of cognition, reasoning, and making choices. She says the authors’ phrase “species-fair comparison between LLMs and human participants” is “entirely inappropriate in reference to software.” Bender and several colleagues recently posted a preprint paper exploring how anthropomorphizing AI systems affects users’ trust.

The results may not indicate that AI really gets us, but it’s worth thinking about the repercussions of LLMs that convincingly mimic theory of mind reasoning. They’ll be better at interacting with their human users and anticipating their needs, but could also get better at deceiving or manipulating their users. And they’ll invite more anthropomorphizing, by convincing human users that there’s a mind on the other side of the user interface.

IEEE Spectrum

3 6 minutes read