A Chess Player Who Can’t Think – The LLM Intelligence Illusion

checkers-game-background

True intelligence requires more than pattern recognition – it demands reasoning, adaptability and a deeper understanding of the world. The future of AI will not be defined by scaling up existing models but by reimagining how machines learn and think. The next decade will reveal whether LLMs can transcend their current constraints or remain impressive yet fundamentally incomplete.


Imagine a chess player who has memorised thousands of games but cannot think beyond them. Every time they make a move, they do so based on pattern recognition, not genuine strategy. If the board setup is familiar, they perform brilliantly. But introduce an unusual position – one that has never appeared before – and they flounder, unable to formulate a novel plan. This is the predicament of large language models (LLMs) today. Despite their fluency, they are not truly reasoning machines, but sophisticated pattern-matchers constrained by their training data.

A growing body of research suggests that LLMs, including state-of-the-art models like GPT-4, face fundamental limitations in multi-step compositional reasoning. Much like the chess player who cannot think beyond memorised positions, these models reduce complex reasoning to subgraph matching – a brute-force approach that lacks true problem-solving ability. This article explores why LLMs struggle with complex reasoning, how their fundamental architecture limits them, and whether new approaches can break this ceiling.

When Imitation Fails

Humans use shortcuts all the time. A seasoned driver doesn’t consciously compute the physics of stopping a car at a red light; they react instinctively. However, the key difference between humans and LLMs is knowing when a shortcut is appropriate.

LLMs, trained via next-token prediction, learn to exploit statistical patterns in text, often guessing their way through problems rather than truly solving them. A study at the Allen Institute for AI demonstrated this flaw: when presented with complex logic puzzles, LLMs performed well only when the solutions closely resembled examples they had encountered in training. When faced with novel, truly unseen problems, their success rate plummeted.

Think of it as a student who aces practice exams by recognising question templates but fails when given a test requiring actual understanding. The model doesn’t ‘think’ through problems – it recognises familiar structures and extrapolates patterns.

Pattern-Matching is Not Reasoning

One way to visualise an LLM’s reasoning process is as a vast, interconnected web of precomputed pathways. When asked a question, it doesn’t solve it from first principles but instead finds the closest pre-existing path in its training data and follows it.

Researchers have used computational graph analysis to show that GPT-4’s successes in reasoning tasks correlate with how often the required partial computations appeared in its training data. This suggests that LLMs do not genuinely decompose complex problems – they reassemble past solutions, much like a puzzle solver fitting together familiar pieces rather than constructing a new picture from scratch.

This is why they fail spectacularly at problems requiring deep compositional reasoning. If an LLM encounters a scenario it has never seen before, it has no mechanism for abstract problem-solving – it merely tries to map it onto a known template, often producing incorrect or nonsensical results.

The Brute-Force Fallacy

Some argue that scaling up LLMs – giving them more training data and larger neural networks – will eventually lead to genuine reasoning. But history warns us otherwise.

Consider the history of chess AI. Early chess engines, like IBM’s Deep Blue, relied on brute-force search, evaluating millions of positions per second to defeat human grandmasters. But they did not ‘think’ like humans. Later engines, like AlphaZero, took a different approach: they learned principles of play rather than memorising positions. This led to vastly superior performance with significantly less computational overhead.

The same lesson applies to LLMs. Current models rely on brute-force techniques, such as Chain-of-Thought (CoT) prompting combined with Reinforcement Learning (RL) search. This involves generating thousands (or even millions) of reasoning trajectories and scoring them – a far cry from human cognition. While this approach boosts performance on standardised benchmarks, it remains computationally wasteful and fundamentally different from how humans reason.

Rich Sutton’s ‘bitter lesson’ in AI states that general-purpose methods leveraging massive computation tend to outperform hand-designed solutions. However, this assumes that the method itself is scalable. In fields like mathematics or programming – where answers are well-defined – search and RL can work well. But in open-ended domains like writing, medicine, or law, where correctness is ambiguous and context-dependent, brute-force methods falter.

Can Text Alone Teach True Intelligence?

A key limitation of LLMs is their reliance on text as the sole training medium. But human intelligence is not derived purely from language – we learn from sensory experience, trial and error, and interaction with the physical world.

This is why current LLMs struggle with tasks requiring an intuitive grasp of reality. They can pass the U.S. Bar Exam yet fail at basic arithmetic. They can generate convincing legal arguments but struggle with simple physics problems. This suggests that language alone is insufficient to build robust cognitive models.

Researchers are now exploring multimodal models that integrate vision, audio, and even robotics into AI training. By incorporating sensory inputs, these systems may overcome some of the blind spots inherent in text-only training. But even then, the fundamental challenge remains: current AI architectures are designed for pattern-matching, not true reasoning.

The Future – Agent-Based LLMs and Tool Use?

If brute-force reasoning is unsustainable, what’s the alternative? One promising direction is augmenting LLMs with external tools. Rather than trying to ‘teach’ a model everything, why not let it delegate tasks to specialised systems?

For example, an LLM doesn’t need to memorise complex mathematical formulas – it could simply call a calculator. It doesn’t need to store vast legal precedents – it could query a legal database. This agent-based approach is gaining traction, with companies like OpenAI actively developing models that can interact with external tools.

Another promising approach is hybrid reasoning systems that combine symbolic logic with deep learning. Traditional AI methods, such as knowledge graphs and rule-based reasoning, excel at handling structured data and logical inference. Integrating these capabilities with LLMs could lead to more reliable, interpretable, and genuinely intelligent systems.

Further reading:“Have o1 Models Cracked Human Reasoning?” by Nouha Dziri on Substack.

Leave us a Comment