Microsoft's new rStar-Math method refines small models to outperform OpenAI o1's prediction of math problems


Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. learn more


Microsoft is doubling the capacity of small language models (SLMs) with rStar-Math publicationa new reasoning method that can be applied to small models to increase their performance on mathematical problems using reasoning techniques – performance similar to, and in some cases higher than, an o1-forecast model OpenAI.

While it is still in the research stage – as described in a paper published on the peer-review site arXiv.org and credit to eight authors at Microsoft, Peking University and Tsinghua University in China – the approach has been applied to several smaller open source models including Microsoft's own Phi-3 mini, Qwen- Alibaba's 1.5B (1.5-billion parameter model), and Qwen-7B (7-billion parameter model). It showed better performance on all of them, even higher than OpenAI's previous most advanced model at the GOOD (word problem solving) third-party benchmark of 12,500 questions covering various branches such as geometry and algebra, and all levels of difficulty.

Finally, according to s post on Hugging Facethe researchers plan to make their code and data available on Github at https://github.com/microsoft/rStaralthough one of the paper's authors, Li Lyna Zhang, wrote in the comments on Hugging Face's post that the team is “still going through the internal review process for the open source release.” So, “the source remains private for now. Please stick with it!”

Community members expressed enthusiasm, calling the innovations “impressive” and praising the combination of Monte Carlo Tree Search (MCTS) with step-by-step reasoning. One commenter pointed out the simplicity of using Q values ​​for degree scoring, while others discussed future applications in geometric proofs and symbolic reasoning.

This news follows closely on the heels of an open opening Microsoft Model Phi-4, 14 billion less AI system now available on Hugging Face under licensed MIT license.

While the Phi-4 release has expanded access to small high-performance models, rStar-Math presents a unique approach: using smaller AI systems to achieve novel results in mathematical reasoning .

rStar-Math works by using several different modules and components to help a small target model 'self-grow'

The key to rStar-Math is that it accelerates Monte Carlo Tree Search (MCTS), a method that mimics human “deep thinking” by refining step-by-step solutions to mathematical problems.

The researchers used MCTS because it “breaks down complex mathematical problems into simpler one-step generative functions, reducing the difficulty” for smaller models.

However, they did not apply MCTS alone as other researchers have done. Instead, in a stroke of brilliance, they also ask the model they trained to output their “chain-of-thinking” reasoning steps as both natural language descriptions. and Code for python.

They ordered that the model include the natural language responses as Python code comments, and only those results using Python would be used to train the model.

The researchers also trained a “policy model” to generate mathematical reasoning steps and a process preference model (PPM) to select the most promising steps to solve the problems, and developed both over four rounds of “self-evolving,” with each model. development on the other side.

For their preliminary data, the researchers said they used “747,000 math word problems from publicly available sources,” along with their solutions, but created new steps to solve them with the two models explained above.

Results not recorded

After four rounds of self-development, rStar-Math achieved important milestones:

• On the A GOOD benchmarkthe accuracy of the Qwen2.5-Math-7B model jumped from 58.8% to 90.0%, outperforming OpenAI o1-preview.

• On the American Invitational Mathematics Examination (AIME)he solved 53.3% of problems, placing in the top 20% of high school competitors.

These results highlight the power of SLM in handling complex mathematical reasoning, traditionally dominated by larger systems.

Is less better?

In recent years, AI innovation has been largely driven by improving language models, with limitations increasingly seen as a way to improve performance. However, the high costs associated with these large models, from computing resources to energy consumption, have raised questions about scalability.

Microsoft offers another way, focusing on efficiency. The release of rStar-Math reinforces this promise by demonstrating how SLMs can compete with – and in some cases surpass – the capabilities of their larger peers.

Microsoft's dual releases of Phi-4 and the rStar-Math paper suggest that compact, specialized models can provide powerful alternatives to the industry's largest systems.

What's more, by outperforming larger rivals in key benchmarks, these models challenge the notion that bigger is always better. They open doors for medium-sized organizations and academic researchers to access cutting-edge capabilities without the financial or environmental burden of large models.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *