RLHFlow
/

Llama3.1-8B-ORM-Deepseek-Data

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Llama3.1-8B-ORM-Deepseek-Data / README.md

weqweasdas's picture

Update README.md

cae1e28 verified 11 days ago

|

history blame contribute delete

2.66 kB

	---
	library_name: transformers
	tags: []
	---

	This is a process-supervised reward (PRM) trained on Mistral-generated data from the project [RLHFlow/RLHF-Reward-Modeling](https://github.com/RLHFlow/RLHF-Reward-Modeling)

	The model is trained from [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) on [RLHFlow/Deepseek-ORM-Data](https://huggingface.co/datasets/RLHFlow/Deepseek-ORM-Data) for 1 epochs. We use a global batch size of 32 and a learning rate of 2e-6, where we pack the samples and split them into chunks of 8192 token. See more training details at https://github.com/RLHFlow/Online-RLHF/blob/main/math/llama-3.1-prm.yaml .


	## BoN evaluation result for Mistral generator:

	\| Model \| Method \| GSM8K \| MATH \|
	\| ------------- \| ------------- \| ------------- \| -------- \|
	\| Mistral-7B \| Pass@1 \| 77.9 \| 28.4 \|
	\| Mistral-7B \| Majority Voting@1024 \| 84.2 \| 36.8 \|
	\| Mistral-7B \| Mistral-ORM@1024 \| 90.1 \| 43.6 \|
	\| Mistral-7B \| Mistral-PRM@1024 \| 92.4 \| 46.3 \|

	## Scaling the inference sampling to N=1024 for Deepseek generator:

	\| Model \| Method \| GSM8K \| MATH \|
	\| ------------- \| ------------- \| ------------- \| -------- \|
	\| Deepseek-7B \| Pass@1 \| 83.9 \| 38.4 \|
	\| Deepseek-7B \| Majority Voting@1024 \| 89.7 \| 57.4 \|
	\| Deepseek-7B \| Deepseek-ORM@1024 \| 93.4 \| 52.4 \|
	\| Deepseek-7B \| Deepseek-PRM@1024 \| 93.0 \| 58.1 \|
	\| Deepseek-7B \| Mistral-ORM@1024 (OOD) \| 90.3 \| 54.9 \|
	\| Deepseek-7B \| Mistral-PRM@1024 (OOD) \| 91.9 \| 56.9 \|

	## Visualization


	![image/png](https://cdn-uploads.huggingface.co/production/uploads/643e59806db6ba8c5ee123f3/i622m76fvKv8drLmwl8Q3.png)

	## Usage

	See https://github.com/RLHFlow/RLHF-Reward-Modeling/tree/main/math for detailed examples.

	## Citation

	The automatic annotation was proposed in the Math-shepherd paper:

	```
	@inproceedings{wang2024math,
	title={Math-shepherd: Verify and reinforce llms step-by-step without human annotations},
	author={Wang, Peiyi and Li, Lei and Shao, Zhihong and Xu, Runxin and Dai, Damai and Li, Yifei and Chen, Deli and Wu, Yu and Sui, Zhifang},
	booktitle={Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
	pages={9426--9439},
	year={2024}
	}

	```

	If you find the training recipe useful, please consider cite it as follows.

	```
	@misc{xiong2024rlhflowmath,
	author={Wei Xiong and Hanning Zhang and Nan Jiang and Tong Zhang},
	title = {An Implementation of Generative PRM},
	year = {2024},
	publisher = {GitHub},
	journal = {GitHub repository},
	howpublished = {\url{https://github.com/RLHFlow/RLHF-Reward-Modeling}}
	}
	```