A team of international researchers from leading academic institutions and tech companies upended the AI reasoning landscape on Wednesday with a new model that matched—and occasionally surpassed—one of China’s most sophisticated AI systems: DeepSeek.
OpenThinker-32B, developed by the Open Thoughts consortium, achieved a 90.6% accuracy score on the MATH500 benchmark, edging past DeepSeek’s 89.4%.
The model also outperformed DeepSeek on general problem-solving tasks, scoring 61.6 on the GPQA-Diamond benchmark compared to DeepSeek’s 57.6. On the LCBv2 benchmark, it hit a solid 68.9, showing strong performance across diverse testing scenarios.
In other words, it’s better than a similarly-sized version of DeepSeek R1 at general scientific knowledge (GPQA-Diamond). It also beat DeepSeek at MATH500 while losing at the AIME benchmarks—both of which try to measure math proficiency.
It’s also a bit worse than DeepSeek at coding, scoring 68.9 points vs 71.2, but since the model is open source, all these scores can drastically get better once people start improving upon it.
What set this achievement apart was its efficiency: OpenThinker required only 114,000 training examples to reach these results, while DeepSeek used 800,000.
The OpenThoughts-114k dataset came packed with detailed metadata for each problem: ground truth solutions, test cases for code problems, starter code where needed, and domain-specific information.
Its custom Curator framework validated code solutions against test cases, while an AI judge handled math verification.
The team reported it used four nodes equipped with eight H100 GPUs, completing in approximately 90 hours. A separate dataset with 137,000 unverified samples, trained on Italy’s Leonardo Supercomputer, burned through 11,520 A100 hours in just 30 hours.
“Verification serves to maintain quality while scaling up diversity and size of training prompts,” the team noted in their documentation. The research indicated that even unverified versions performed well, though they did not match the verified model’s peak results.
The model was built on top of Alibaba’s Qwen2.5-32B-Instruct LLM and supports a modest 16,000-token context window—enough to handle complex mathematical proofs and lengthy coding problems but a lot less than the current standards.
This release arrives amid intensifying competition in AI reasoning capabilities, which seems to be happening at the speed of thought. OpenAI announced on February 12 that all models following GPT-5 would feature reasoning capabilities. One day later, Elon Musk hyped up xAI’s Grok-3’s enhanced problem-solving capabilities, promising it would be the best reasoning model to date, and just a few hours ago, Nous Research released another open-source reasoning model, DeepHermes, based on Meta’s Llama 3.1.
The field gained momentum after DeepSeek demonstrated comparable performance to OpenAI’s o1 at significantly reduced costs. DeepSeek R1 is free to download, use, and modify, with the training techniques also revealed.
However, unlike Open Thoughts, which decided to open source everything, the DeepSeek development team kept its training data private.
This key difference means developers may have an easier time understanding OpenThinker and reproducing its results from scratch than they would have with DeepSeek because they have access to all the pieces of the puzzle.
For the broader AI community, this release demonstrates once again the viability of building competitive models without massive proprietary datasets. Also, it may be a more trusty competitor for Western developers who are still unsure about using a Chinese model—open source or not.
OpenThinker is available for download at HuggingFace. A smaller, less powerful 7B parameter model is also available for lower-end devices.
The Open Thoughts team pulled together researchers from different American universities, including Stanford, Berkeley, and UCLA, alongside Germany’s Juelich Supercomputing Center. The US-based Toyota Research Institute and other players in the EU AI scene also back it.
Edited by Josh Quittner and Sebastian Sinclair
Generally Intelligent Newsletter
A weekly AI journey narrated by Gen, a generative AI model.