New Breakthrough Research Training with Only 1000 Questions Upends Compute Costs of OpenAI, DeepSeek

Futuristic robot artificial intelligence concept.

Team from Stanford University and the University of Washington finds that test-time scaling can yield competitive results with a fraction of the compute cost used by models like OpenAI’s GPT-4

A group of AI researchers has unveiled a new approach that could significantly reduce the costs of developing advanced language models while improving their reasoning capabilities. The method, called test-time scaling, allows models to enhance their performance after training by allocating extra computational power when generating responses. This breakthrough challenges the traditional paradigm followed by AI giants such as OpenAI and DeepSeek, which rely on expensive large-scale training.

According to the research paper, authored by a team from Stanford University and the University of Washington, test-time scaling can yield competitive results with a fraction of the compute cost used by models like OpenAI’s GPT-4. “Our approach demonstrates that supervised fine-tuning on just 1,000 high-quality examples, combined with test-time budget forcing, is sufficient to match or exceed the performance of more complex models,” the authors state​.

While traditional AI is like a blitz chess player forced to move within 10 seconds, no matter how hard the position is, to ensure quick responses but increases the risk of blunders, Test-time scaling allows the AI to allocate more time selectively, spending a few extra seconds on an easy problem but significantly more on a harder one – like a chess player choosing when to slow down and analyse deeper.

Though it can slow down inference, but the trade-off is higher accuracy on complex tasks. The research paper demonstrates that this extra reasoning time can dramatically improve results, outperforming even models trained with far more data. The real breakthrough is that this approach doesn’t require millions of dollars in extra training–it only modifies how the model reasons after training. You don’t need a bigger model, just a smarter way to use it.

A Challenge to AI Heavyweights

Traditionally, improving AI performance has relied on training models on increasingly massive datasets, requiring hundreds of millions of dollars in computing resources. OpenAI’s O1 model, for example, is reported to have been trained using large-scale reinforcement learning with human feedback (RLHF), though its methodology remains closed-source. DeepSeek’s R1 model, an open-weight alternative, also uses RLHF but requires millions of examples and multiple training stages to reach its reasoning capabilities​.

By contrast, the test-time scaling approach bypasses the need for such extensive training. Instead, it fine-tunes a base model–Qwen2.5-32B-Instruct in this case–on a small dataset and extends its reasoning process dynamically during inference using a method called budget forcing.

Budget forcing allows researchers to control the amount of thinking time allocated to a model. When the model tries to conclude an answer prematurely, the process forces it to rethink by appending prompts such as “Wait,” leading to improved accuracy and better reasoning. The authors describe this method as a sequential scaling process that mirrors human cognitive behaviour: “Suppressing the end-of-thinking token delimiter and appending ‘Wait’ encourages deeper exploration and self-correction”​.

Drastically Lowering Costs

One of the most striking findings from the study is the cost reduction associated with this method. Unlike OpenAI and DeepSeek, whose training approaches require thousands of high-performance GPUs over weeks or months, the test-time scaling model was trained in just 26 minutes on 16 NVIDIA H100 GPUs. “Training on our full dataset of 59K examples took 394 GPU hours, while our 1K selection required just seven hours,” the paper notes​.

This dramatic efficiency gain raises questions about whether the industry’s obsession with large-scale pretraining is sustainable. The research suggests that more compute-efficient methods could democratise AI development, making it accessible to smaller labs and researchers who lack the vast resources of OpenAI or Google DeepMind.

Comparing Performance

Despite its low-resource training, the test-time scaling model (s1-32B) achieves comparable or superior performance on key benchmarks. According to the paper, it exceeded OpenAI’s O1-preview model by up to 27% on competition-level math problems and showed promising results on scientific reasoning tasks. The results indicate that even a relatively small dataset, when carefully curated, can lead to highly effective reasoning models​.

However, the method is not without limitations. Unlike RLHF-based models, which align AI outputs with human preferences through reinforcement learning, test-time scaling lacks explicit preference tuning. Furthermore, increasing test-time compute can slow down real-time applications, making it less suitable for use cases requiring immediate responses.

Hybrid Models

The authors argue that future AI research should focus on hybrid models, integrating test-time scaling with reinforcement learning approaches to balance cost, efficiency, and reasoning ability. They also suggest that future iterations could use adaptive budgeting, where the model dynamically adjusts its compute allocation based on task complexity.

As AI models become increasingly central to applications ranging from scientific discovery to automated customer support, test-time scaling offers a compelling alternative to traditional approaches. If widely adopted, it could disrupt the economic model of AI development, making advanced reasoning capabilities available without the astronomical costs of large-scale training.

For now, the model, dataset, and code are open-source and available on GitHub, allowing researchers to experiment with this new paradigm. “This work defines the right metrics–Control, Scaling, and Performance–to enable future research on extrapolating test-time compute,” the authors conclude​.

Read the full research paper here:

Blog Attachment

Leave us a Comment