Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,14 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
---
|
4 |
+
# Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs
|
5 |
+
|
6 |
+
🖥️[Code](https://github.com/dvlab-research/Step-DPO) | 🤗[Data](https://huggingface.co/datasets/xinlai/Math-Step-DPO-10K) | 📄[Paper](https://arxiv.org/pdf/2406.18629)
|
7 |
+
|
8 |
+
This repo contains the **Qwen2-7B-SFT-Step-DPO** model. It is obtained by performing **Step-DPO** on [**Qwen2-7B-SFT**](https://huggingface.co/xinlai/Qwen2-7B-SFT).
|
9 |
+
|
10 |
+
**Step-DPO** is a simple, effective, and data-efficient method for boosting the mathematical reasoning ability of LLMs. Notably, Step-DPO, when applied to Qwen2-72B-Instruct, achieves scores of **70.8%** and **94.0%** on the test sets of **MATH** and **GSM8K** without bells and wistles, respectively, surpassing a series of closed-source models, including GPT-4-1106, Claude-3-Opus, and Gemini-1.5-Pro.
|
11 |
+
|
12 |
+
## Contact
|
13 |
+
|
14 |
+
Please submit an issue [here](https://github.com/dvlab-research/Step-DPO) or send me an email [here](mailto:xinlai@cse.cuhk.edu.hk).
|