Are LLMs Evolving into Reasoning Models or Simply Mimicking

The future of AI may hinge not just on bigger models, but on building true reasoning—beyond clever imitation.
AI is graduating from recognition to reasoning—and organizations must follow suit by scaling their computing power with purpose-built AI infrastructure. As AI systems that learn by imitating the mechanisms of the human brain continue to advance, we’re witnessing an evolution in models from rote regurgitation to genuine reasoning or Large Reasoning Models (LRMs). This marks a new chapter in the development of AI—and what enterprises can gain from it. To tap into this enormous potential, organizations will need to ensure they have the right infrastructure and computational resources to support the advancing technology.
LRMs are Commercially Lucrative
Some companies are betting big on LRMs as being the basis for AI assistants that are commercially lucrative. OpenAI, for example, has released its best LRMs and its associated “Deep Research” tool to subscribers paying $200 per month, and is said to be considering charging up to $20,000 per month for reasoning models that can carry out “PhD-level research.” Some researchers, however, are skeptical about LRMs, questioning if they truly think and reason or just mimic human reasoning to succeed on specific benchmarks.
A slew of large reasoning models featuring diverse strategies have made the LRM (Large Reasoning Model) landscape rather exciting. For instance, OpenAI’s O1/O3 series utilizes reinforcement learning for versatile problem-solving; DeepSeek R1 employs a Mixture-of-Experts architecture for efficiency in coding; while Google’s Gemini 2.0 integrates multimodal inputs with a “thinking out loud” mode for enhanced reasoning; while open-source projects like QwQ and Sky-T1 showcase advanced reasoning through curated data and innovative fine-tuning.
The Reasoning Milestone
Reasoning capabilities are also a milestone in the proliferation of agentic AI systems: autonomous applications that perform tasks on behalf of users, such as scheduling appointments or booking travel itineraries. Whether you’re asking AI to make a reservation, provide a literature summary, fold a towel, or pick up a piece of rock, it needs to first be able to understand the environment—what we call perception—comprehend the instructions and then move into a planning and decision-making phase,
Illusion of Thinking
But a recent research paper titled; “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity authored by
Parshin Shojaee , Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar raised serious doubts about the claimed capabilities of LRMs. The authors wrote; “While these models demonstrate improved performance on reasoning benchmarks, their fundamental capabilities, scaling properties, and limitations remain insufficiently understood. Current evaluations primarily focus on established mathematical and coding benchmarks, emphasizing final answer accuracy. How ever, this evaluation paradigm often suffers from data contamination and does not provide insights into the reasoning traces’ structure and quality.”
A group of researchers used GSM8K benchmark to evaluate the reasoning capabilities of LLMs to figure out how much have these evolved towards LRMs. GSM8K is used to assess the mathematical reasoning of models on grade-school-level questions. It appears that the performance of LLMs have been anything but remarkable. “Specifically, the performance of all models declines when only the numerical values in the question are altered in the GSM-Symbolic benchmark… their performance significantly deteriorates as the number of clauses in a question increases,” the researchers observed.
Imagine AI math whizzes as chefs who’ve mastered following recipes perfectly. They ace tests like GSM8K (a popular math quiz for AIs) by memorizing ingredient combinations. But what if we tweak the recipe? The study reveals these chefs stumble when simple changes occur—like swapping “5 apples” for “7 apples”—even when the math stays identical. Their scores drop, exposing they’re not truly reasoning, just mimicking steps. The paper notes: “While the performance of LLMs on GSM8K has significantly improved in recent years, it remains unclear whether their mathematical reasoning capabilities have genuinely advanced, raising questions about the reliability of the reported metrics.”