Haoxiang-Wang commited on
Commit
113ad53
·
verified ·
1 Parent(s): e27e178

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -2
README.md CHANGED
@@ -1,12 +1,15 @@
1
  ---
2
  license: llama3
3
  ---
 
 
 
 
 
4
  This preference model is trained from [LLaMA3-8B-it](meta-llama/Meta-Llama-3-8B-Instruct) with the training script at [Reward Modeling](https://github.com/RLHFlow/RLHF-Reward-Modeling/tree/pm_dev/pair-pm).
5
 
6
  The dataset is RLHFlow/pair_preference_model_dataset. It achieves Chat-98.6, Char-hard 65.8, Safety 89.6, and reasoning 94.9 in reward bench.
7
 
8
- See our paper [RLHF Workflow: From Reward Modeling to Online RLHF](https://arxiv.org/abs/2405.07863) for more details of this model.
9
-
10
  ## Service the RM
11
 
12
  Here is an example to use the Preference Model to rank a pair. For n>2 responses, it is recommened to use the tournament style ranking strategy to get the best response so that the complexity is linear in n.
 
1
  ---
2
  license: llama3
3
  ---
4
+
5
+ * **Paper**: [RLHF Workflow: From Reward Modeling to Online RLHF](https://arxiv.org/pdf/2405.07863) (Published in TMLR, 2024)
6
+ * **Authors**: Hanze Dong*, Wei Xiong*, Bo Pang*, Haoxiang Wang*, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, Tong Zhang
7
+ * **Code**: https://github.com/RLHFlow/RLHF-Reward-Modeling/
8
+
9
  This preference model is trained from [LLaMA3-8B-it](meta-llama/Meta-Llama-3-8B-Instruct) with the training script at [Reward Modeling](https://github.com/RLHFlow/RLHF-Reward-Modeling/tree/pm_dev/pair-pm).
10
 
11
  The dataset is RLHFlow/pair_preference_model_dataset. It achieves Chat-98.6, Char-hard 65.8, Safety 89.6, and reasoning 94.9 in reward bench.
12
 
 
 
13
  ## Service the RM
14
 
15
  Here is an example to use the Preference Model to rank a pair. For n>2 responses, it is recommened to use the tournament style ranking strategy to get the best response so that the complexity is linear in n.