This AI knew the answers but didn’t understand the questions

Psychologists have long debated whether the human mind can be explained by a single, unified theory or if different functions such as attention and memory must be studied separately. Now, artificial intelligence (AI) is entering that debate, offering a new way to explore how the mind works.

In July 2025, a study published in Nature introduced an AI model called “Centaur.” Built on standard large language models and refined using data from psychological experiments, Centaur was designed to simulate human cognitive behavior. It reportedly performed well across 160 tasks, including decision-making, executive control, and other mental processes. The results drew widespread attention and were seen as a possible step toward AI systems that could replicate human thinking more broadly.

New Research Raises Doubts

A more recent study published in National Science Open challenges those claims. Researchers from Zhejiang University argue that Centaur’s apparent success may come from overfitting. In other words, instead of understanding the tasks, the model may have learned to recognize patterns in the training data and reproduce expected answers.

To test this idea, the researchers created several new evaluation scenarios. In one example, they replaced the original multiple-choice prompts, which described specific psychological tasks, with the instruction “Please choose option A.” If the model truly understood the task, it should have consistently selected option A. Instead, Centaur continued to choose the “correct answers” from the original dataset.

This behavior suggests that the model was not interpreting the meaning of the questions. Rather, it relied on learned statistical patterns to “guess” answers. The researchers compared this to a student who scores well by memorizing test formats without actually understanding the material.

Why This Matters for AI Evaluation

The findings highlight the need for caution when assessing the abilities of large language models. While these systems can be highly effective at fitting data, their “black-box” nature makes it difficult to know how they arrive at their outputs. This can lead to issues such as hallucinations or misinterpretations. Careful and varied testing is essential to determine whether a model truly has the skills it appears to demonstrate.

The Real Challenge: Language Understanding

Although Centaur was presented as a model capable of simulating cognition, its biggest limitation appears to be in language comprehension. Specifically, it struggles to recognize and respond to the intent behind questions. The study suggests that achieving true language understanding may be one of the most important challenges in developing AI systems that can model human cognition more fully.