arxiv:2501.04519

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

Published on Jan 8

· Submitted by

lynazhang on Jan 9

#1 Paper of the day

Upvote

Authors:

Li Lyna Zhang ,

Ning Shang ,

Abstract

We present rStar-Math to demonstrate that small language models (SLMs) can rival or even surpass the math reasoning capability of OpenAI o1, without distillation from superior models. rStar-Math achieves this by exercising "deep thinking" through Monte Carlo Tree Search (MCTS), where a math policy SLM performs test-time search guided by an SLM-based process reward model. rStar-Math introduces three innovations to tackle the challenges in training the two SLMs: (1) a novel code-augmented CoT data sythesis method, which performs extensive MCTS rollouts to generate step-by-step verified reasoning trajectories used to train the policy SLM; (2) a novel process reward model training method that avoids na\"ive step-level score annotation, yielding a more effective process preference model (PPM); (3) a self-evolution recipe in which the policy SLM and PPM are built from scratch and iteratively evolved to improve reasoning capabilities. Through 4 rounds of self-evolution with millions of synthesized solutions for 747k math problems, rStar-Math boosts SLMs' math reasoning to state-of-the-art levels. On the MATH benchmark, it improves Qwen2.5-Math-7B from 58.8% to 90.0% and Phi3-mini-3.8B from 41.4% to 86.4%, surpassing o1-preview by +4.5% and +0.9%. On the USA Math Olympiad (AIME), rStar-Math solves an average of 53.3% (8/15) of problems, ranking among the top 20% the brightest high school math students. Code and data will be available at https://github.com/microsoft/rStar.

View arXiv page View PDF Add to collection

Community

lynazhang

Paper author Paper submitter about 16 hours ago

We present rStar-Math to demonstrate that small language models (SLMs, 1.5B-7B) can rival or even surpass the math reasoning capability of OpenAI o1

jack8841

about 14 hours ago

holy... shit?

ajaiasai2022

about 12 hours ago

github link isn't working.

lynazhang

Paper author about 9 hours ago

As we are still undergoing the internal review process for open-source release, the repository remains private for now. Please stay tuned!

duinamit

about 11 hours ago

very impressive, I love the simplicity of using Q values as annotations! you mention 64 trajectories as some sort of saturation bound, is that right or have you just not tried scaling this approach even more?

lynazhang

Paper author about 9 hours ago

Thank you! On challenging math benchmarks such as AIME, performance nearly saturates with 64 trajectories. For college math, performance continues to improve steadily; however, we did not scale beyond 64 due to the increased search cost. We believe AIME performance can be further improved by synthesizing additional Olympiad-level math problems to improve both the policy model and the process reward model. We leave this as our future work.

f0ster

about 5 hours ago

Thank you for sharing this work. I appreciate the blend of Monte Carlo Tree Search with smaller models to address step-by-step math reasoning. The idea of generating self-verified solutions rather than relying on a larger teacher model is promising, and it is good to see how you handle the complexity of code-based rollouts. I am curious how this approach might adapt to tasks that involve geometric proofs or more symbolic reasoning. It would also be interesting to learn about the practical limits when problems become highly intricate. Overall, this is a thoughtful piece of research, and I look forward to any future expansions into broader math domains.