NeMo
English
nvidia
llama3.1
reward model
zhilinw commited on
Commit
77dd3fe
·
verified ·
1 Parent(s): 73e4ee7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +36 -3
README.md CHANGED
@@ -26,8 +26,29 @@ For the same prompt, a response with higher reward score has higher quality than
26
 
27
  A HuggingFace Transformers compatible version converted from this model is available at [https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward-HF](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward-HF)
28
 
29
- Try hosted inference for free at [build.nvidia.com](https://build.nvidia.com/nvidia/llama-3_1-nemotron-70b-reward) - it comes with an OpenAI-compatible API interface!
30
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
 
32
 
33
  ## Terms of use
@@ -37,7 +58,7 @@ By accessing this model, you are agreeing to the LLama 3.1 terms and conditions
37
 
38
  ## RewardBench Primary Dataset LeaderBoard
39
 
40
- As of 30 Sept 2024, Llama-3.1-Nemotron-70B-Reward performs best Overall on RewardBench as well as with strong performance in Chat, Safety and Reasoning categories among the models below.
41
 
42
  | Model | Type of Data Used For Training | Overall | Chat | Chat-Hard | Safety | Reasoning |
43
  |:-----------------------------|:----------------|:-----|:----------|:-------|:----------|:-----------------------|
@@ -128,6 +149,16 @@ E-Mail: [Zhilin Wang](mailto:zhilinw@nvidia.com)
128
  If you find this model useful, please cite the following works
129
 
130
  ```bibtex
 
 
 
 
 
 
 
 
 
 
131
  @misc{wang2024helpsteer2,
132
  title={HelpSteer2: Open-source dataset for training top-performing reward models},
133
  author={Zhilin Wang and Yi Dong and Olivier Delalleau and Jiaqi Zeng and Gerald Shen and Daniel Egert and Jimmy J. Zhang and Makesh Narsimhan Sreedhar and Oleksii Kuchaiev},
@@ -140,6 +171,7 @@ If you find this model useful, please cite the following works
140
 
141
  ## References(s):
142
 
 
143
  * [HelpSteer2](https://arxiv.org/abs/2406.08673)
144
  * [HelpSteer](https://arxiv.org/abs/2311.09528)
145
  * [SteerLM method](https://arxiv.org/abs/2310.05344)
@@ -199,4 +231,5 @@ v1.0
199
  ## Ethical Considerations:
200
  NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
201
 
202
- Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
 
 
26
 
27
  A HuggingFace Transformers compatible version converted from this model is available at [https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward-HF](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward-HF)
28
 
29
+ Try hosted inference for free at [build.nvidia.com](https://build.nvidia.com/nvidia/llama-3_1-nemotron-70b-reward) - it comes with an OpenAI-compatible API interface and simply signing up gets you 100k free API calls to this model.
30
 
31
+ Using this reward model for RLHF (specifically, REINFORCE), we were able to tune a Llama-3.1-70B-Instruct model to reach [AlpacaEval 2 LC](https://tatsu-lab.github.io/alpaca_eval/) of 57.6, [Arena Hard](https://github.com/lmarena/arena-hard-auto) of 85.0 and [GPT-4-Turbo MT-Bench](https://github.com/lm-sys/FastChat/pull/3158) of 8.98, which are known to be predictive of [LMSys Chatbot Arena Elo](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard)
32
+
33
+ As of 1 Oct 2024, this model is #1 on all three automatic alignment benchmarks, edging out strong frontier models such as GPT-4o and Claude 3.5 Sonnet.
34
+
35
+ See details on our paper at [https://arxiv.org/abs/2410.01257](https://arxiv.org/abs/2410.01257) - as a preview, this model can correctly the question ```How many r in strawberry?``` without specialized prompting or additional reasoning tokens:
36
+
37
+ ```
38
+ A sweet question!
39
+ Let’s count the “R”s in “strawberry”:
40
+ 1. S
41
+ 2. T
42
+ 3. R
43
+ 4. A
44
+ 5. W
45
+ 6. B
46
+ 7. E
47
+ 8. R
48
+ 9. R
49
+ 10. Y
50
+ There are **3 “R”s** in the word “strawberry”.
51
+ ```
52
 
53
 
54
  ## Terms of use
 
58
 
59
  ## RewardBench Primary Dataset LeaderBoard
60
 
61
+ As of 1 Oct 2024, Llama-3.1-Nemotron-70B-Reward performs best Overall on RewardBench as well as with strong performance in Chat, Safety and Reasoning categories among the models below.
62
 
63
  | Model | Type of Data Used For Training | Overall | Chat | Chat-Hard | Safety | Reasoning |
64
  |:-----------------------------|:----------------|:-----|:----------|:-------|:----------|:-----------------------|
 
149
  If you find this model useful, please cite the following works
150
 
151
  ```bibtex
152
+ @misc{wang2024helpsteer2preferencecomplementingratingspreferences,
153
+ title={HelpSteer2-Preference: Complementing Ratings with Preferences},
154
+ author={Zhilin Wang and Alexander Bukharin and Olivier Delalleau and Daniel Egert and Gerald Shen and Jiaqi Zeng and Oleksii Kuchaiev and Yi Dong},
155
+ year={2024},
156
+ eprint={2410.01257},
157
+ archivePrefix={arXiv},
158
+ primaryClass={cs.LG},
159
+ url={https://arxiv.org/abs/2410.01257},
160
+ }
161
+
162
  @misc{wang2024helpsteer2,
163
  title={HelpSteer2: Open-source dataset for training top-performing reward models},
164
  author={Zhilin Wang and Yi Dong and Olivier Delalleau and Jiaqi Zeng and Gerald Shen and Daniel Egert and Jimmy J. Zhang and Makesh Narsimhan Sreedhar and Oleksii Kuchaiev},
 
171
 
172
  ## References(s):
173
 
174
+ * [HelpSteer2-Preference](https://arxiv.org/abs/2410.01257)
175
  * [HelpSteer2](https://arxiv.org/abs/2406.08673)
176
  * [HelpSteer](https://arxiv.org/abs/2311.09528)
177
  * [SteerLM method](https://arxiv.org/abs/2310.05344)
 
231
  ## Ethical Considerations:
232
  NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
233
 
234
+ Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
235
+