This is a Vanilla BT based Reward model based on Gemma-2-9B. The recipes are from RLHF Workflow. We have the reward-bench result: Chat: 98.04 Chat Hard: 65.35 Safety: 89.54 Reasoning: 92.31 Please refer to ```bibtex @misc{dong2024rlhf, title={RLHF Workflow: From Reward Modeling to Online RLHF}, author={Hanze Dong and Wei Xiong and Bo Pang and Haoxiang Wang and Han Zhao and Yingbo Zhou and Nan Jiang and Doyen Sahoo and Caiming Xiong and Tong Zhang}, year={2024}, eprint={2405.07863}, archivePrefix={arXiv}, primaryClass={cs.LG} } ```