File size: 2,662 Bytes
8b659e6 a6ace1a 8b659e6 a6ace1a 8b659e6 a6ace1a 8b659e6 a6ace1a 8b659e6 a6ace1a 8b659e6 a6ace1a 8b659e6 a6ace1a 8b659e6 a6ace1a 8b659e6 a6ace1a 8b659e6 a6ace1a 8b659e6 a6ace1a 8b659e6 a6ace1a 8b659e6 a6ace1a 8b659e6 a6ace1a 8b659e6 a6ace1a 8b659e6 a6ace1a 8b659e6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 |
---
library_name: transformers
tags: []
---
This is a outcome-supervised reward (ORM) trained on Mistral-generated data from the project [RLHFlow/RLHF-Reward-Modeling](https://github.com/RLHFlow/RLHF-Reward-Modeling)
The model is trained from [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) on [RLHFlow/Mistral-ORM-Data](https://huggingface.co/datasets/RLHFlow/Mistral-ORM-Data) for 1 epochs. We use a global batch size of 32 and a learning rate of 2e-6, where we pack the samples and split them into chunks of 8192 token. See more training details at https://github.com/RLHFlow/Online-RLHF/blob/main/math/llama-3.1-prm.yaml .
## BoN evaluation result for Mistral generator:
| Model | Method | GSM8K | MATH |
| ------------- | ------------- | ------------- | -------- |
| Mistral-7B | Pass@1 | 77.9 | 28.4 |
| Mistral-7B | Majority Voting@1024 | 84.2 | 36.8 |
| Mistral-7B | Mistral-ORM@1024 | 90.1 | 43.6 |
| Mistral-7B | Mistral-PRM@1024 | 92.4 | 46.3 |
## Scaling the inference sampling to N=1024 for Deepseek generator:
| Model | Method | GSM8K | MATH |
| ------------- | ------------- | ------------- | -------- |
| Deepseek-7B | Pass@1 | 83.9 | 38.4 |
| Deepseek-7B | Majority Voting@1024 | 89.7 | 57.4 |
| Deepseek-7B | Deepseek-ORM@1024 | 93.4 | 52.4 |
| Deepseek-7B | Deepseek-PRM@1024 | 93.0 | 58.1 |
| Deepseek-7B | Mistral-ORM@1024 (OOD) | 90.3 | 54.9 |
| Deepseek-7B | Mistral-PRM@1024 (OOD) | 91.9 | 56.9 |
## Visualization
![image/png](https://cdn-uploads.huggingface.co/production/uploads/643e59806db6ba8c5ee123f3/i622m76fvKv8drLmwl8Q3.png)
## Usage
See https://github.com/RLHFlow/RLHF-Reward-Modeling/tree/main/math for detailed examples.
## Citation
The automatic annotation was proposed in the Math-shepherd paper:
```
@inproceedings{wang2024math,
title={Math-shepherd: Verify and reinforce llms step-by-step without human annotations},
author={Wang, Peiyi and Li, Lei and Shao, Zhihong and Xu, Runxin and Dai, Damai and Li, Yifei and Chen, Deli and Wu, Yu and Sui, Zhifang},
booktitle={Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
pages={9426--9439},
year={2024}
}
```
If you find the training recipe useful, please consider cite it as follows.
```
@misc{xiong2024rlhflowmath,
author={Wei Xiong and Hanning Zhang and Nan Jiang and Tong Zhang},
title = {An Implementation of Generative PRM},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/RLHFlow/RLHF-Reward-Modeling}}
}
```
|