Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

🖥️Code | 🤗Data | 📄Paper

This repo contains the DeepSeekMath-RL-Step-DPO model. It is obtained by performing Step-DPO on DeepSeekMath-RL.

Step-DPO is a simple, effective, and data-efficient method for boosting the mathematical reasoning ability of LLMs. Notably, Step-DPO, when applied to Qwen2-72B-Instruct, achieves scores of 70.8% and 94.0% on the test sets of MATH and GSM8K without bells and wistles, respectively, surpassing a series of closed-source models, including GPT-4-1106, Claude-3-Opus, and Gemini-1.5-Pro.

Contact

Please submit an issue here or send me an email here.

Downloads last month
16
Safetensors
Model size
6.91B params
Tensor type
BF16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for xinlai/DeepSeekMath-RL-Step-DPO

Quantizations
1 model

Collection including xinlai/DeepSeekMath-RL-Step-DPO