File size: 2,662 Bytes
8b659e6
 
 
 
 
a6ace1a
8b659e6
a6ace1a
8b659e6
 
a6ace1a
8b659e6
a6ace1a
 
 
 
 
 
8b659e6
a6ace1a
8b659e6
a6ace1a
 
 
 
 
 
 
 
8b659e6
a6ace1a
8b659e6
 
a6ace1a
8b659e6
a6ace1a
8b659e6
a6ace1a
8b659e6
a6ace1a
8b659e6
a6ace1a
8b659e6
a6ace1a
 
 
 
 
 
 
 
8b659e6
a6ace1a
8b659e6
a6ace1a
8b659e6
a6ace1a
 
 
 
 
 
 
 
 
 
8b659e6
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
---
library_name: transformers
tags: []
---

This is a outcome-supervised reward (ORM) trained on Mistral-generated data from the project [RLHFlow/RLHF-Reward-Modeling](https://github.com/RLHFlow/RLHF-Reward-Modeling)

The model is trained from [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) on [RLHFlow/Mistral-ORM-Data](https://huggingface.co/datasets/RLHFlow/Mistral-ORM-Data) for 1 epochs. We use a global batch size of 32 and a learning rate of 2e-6, where we pack the samples and split them into chunks of 8192 token. See more training details at https://github.com/RLHFlow/Online-RLHF/blob/main/math/llama-3.1-prm.yaml .


## BoN evaluation result for Mistral generator:

| Model      | Method     | GSM8K     | MATH |
| ------------- | ------------- | ------------- | -------- |
| Mistral-7B | Pass@1 | 77.9 |  28.4   |
| Mistral-7B | Majority Voting@1024 | 84.2 | 36.8  |
| Mistral-7B | Mistral-ORM@1024 | 90.1 | 43.6 |
| Mistral-7B | Mistral-PRM@1024 | 92.4 | 46.3 |

## Scaling the inference sampling to N=1024 for Deepseek generator:

| Model         | Method                    | GSM8K | MATH |
| ------------- | ------------- | ------------- | -------- |
| Deepseek-7B | Pass@1 | 83.9 | 38.4 |
| Deepseek-7B | Majority Voting@1024 | 89.7 | 57.4  |
| Deepseek-7B | Deepseek-ORM@1024 | 93.4 | 52.4 |
| Deepseek-7B | Deepseek-PRM@1024 | 93.0 | 58.1 |
| Deepseek-7B | Mistral-ORM@1024 (OOD) | 90.3 | 54.9 |
| Deepseek-7B | Mistral-PRM@1024 (OOD) | 91.9 | 56.9 |

## Visualization


![image/png](https://cdn-uploads.huggingface.co/production/uploads/643e59806db6ba8c5ee123f3/i622m76fvKv8drLmwl8Q3.png)

## Usage 

See https://github.com/RLHFlow/RLHF-Reward-Modeling/tree/main/math for detailed examples. 

## Citation

The automatic annotation was proposed in the Math-shepherd paper:

```
@inproceedings{wang2024math,
  title={Math-shepherd: Verify and reinforce llms step-by-step without human annotations},
  author={Wang, Peiyi and Li, Lei and Shao, Zhihong and Xu, Runxin and Dai, Damai and Li, Yifei and Chen, Deli and Wu, Yu and Sui, Zhifang},
  booktitle={Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  pages={9426--9439},
  year={2024}
}

```

If you find the training recipe useful, please consider cite it as follows.

```
@misc{xiong2024rlhflowmath,
      author={Wei Xiong and Hanning Zhang and Nan Jiang and Tong Zhang},
  title = {An Implementation of Generative PRM},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/RLHFlow/RLHF-Reward-Modeling}}
}
```