Update README.md
Browse files
README.md
CHANGED
@@ -24,7 +24,29 @@ For the same prompt, a response with higher reward score has higher quality than
|
|
24 |
|
25 |
Llama-3.1-Nemotron-70B-Reward-HF has been converted from [Llama-3.1-Nemotron-70B-Reward](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward) to support it in the HuggingFace Transformers codebase. Please note that evaluation results might be slightly different from the [Llama-3.1-Nemotron-70B-Reward](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward) as evaluated in NeMo-Aligner, which the evaluation results below are based on.
|
26 |
|
27 |
-
Try
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
28 |
|
29 |
|
30 |
## Terms of use
|
@@ -34,7 +56,7 @@ By accessing this model, you are agreeing to the LLama 3.1 terms and conditions
|
|
34 |
|
35 |
## RewardBench Primary Dataset LeaderBoard
|
36 |
|
37 |
-
As of
|
38 |
|
39 |
| Model | Type of Data Used For Training | Overall | Chat | Chat-Hard | Safety | Reasoning |
|
40 |
|:-----------------------------|:----------------|:-----|:----------|:-------|:----------|:-----------------------|
|
@@ -107,6 +129,16 @@ E-Mail: [Zhilin Wang](mailto:zhilinw@nvidia.com)
|
|
107 |
If you find this model useful, please cite the following works
|
108 |
|
109 |
```bibtex
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
110 |
@misc{wang2024helpsteer2,
|
111 |
title={HelpSteer2: Open-source dataset for training top-performing reward models},
|
112 |
author={Zhilin Wang and Yi Dong and Olivier Delalleau and Jiaqi Zeng and Gerald Shen and Daniel Egert and Jimmy J. Zhang and Makesh Narsimhan Sreedhar and Oleksii Kuchaiev},
|
@@ -119,6 +151,7 @@ If you find this model useful, please cite the following works
|
|
119 |
|
120 |
## References(s):
|
121 |
|
|
|
122 |
* [HelpSteer2](https://arxiv.org/abs/2406.08673)
|
123 |
* [HelpSteer](https://arxiv.org/abs/2311.09528)
|
124 |
* [SteerLM method](https://arxiv.org/abs/2310.05344)
|
|
|
24 |
|
25 |
Llama-3.1-Nemotron-70B-Reward-HF has been converted from [Llama-3.1-Nemotron-70B-Reward](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward) to support it in the HuggingFace Transformers codebase. Please note that evaluation results might be slightly different from the [Llama-3.1-Nemotron-70B-Reward](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward) as evaluated in NeMo-Aligner, which the evaluation results below are based on.
|
26 |
|
27 |
+
Try hosted inference for free at [build.nvidia.com](https://build.nvidia.com/nvidia/llama-3_1-nemotron-70b-reward) - it comes with an OpenAI-compatible API interface and simply signing up gets you 100k free API calls to this model.
|
28 |
+
|
29 |
+
Using this reward model for RLHF (specifically, REINFORCE), we were able to tune a Llama-3.1-70B-Instruct model to reach [AlpacaEval 2 LC](https://tatsu-lab.github.io/alpaca_eval/) of 57.6, [Arena Hard](https://github.com/lmarena/arena-hard-auto) of 85.0 and [GPT-4-Turbo MT-Bench](https://github.com/lm-sys/FastChat/pull/3158) of 8.98, which are known to be predictive of [LMSys Chatbot Arena Elo](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard)
|
30 |
+
|
31 |
+
As of 1 Oct 2024, this model is #1 on all three automatic alignment benchmarks, edging out strong frontier models such as GPT-4o and Claude 3.5 Sonnet.
|
32 |
+
|
33 |
+
See details on our paper at [https://arxiv.org/abs/2410.01257](https://arxiv.org/abs/2410.01257) - as a preview, this model can correctly the question ```How many r in strawberry?``` without specialized prompting or additional reasoning tokens:
|
34 |
+
|
35 |
+
```
|
36 |
+
A sweet question!
|
37 |
+
Let’s count the “R”s in “strawberry”:
|
38 |
+
1. S
|
39 |
+
2. T
|
40 |
+
3. R
|
41 |
+
4. A
|
42 |
+
5. W
|
43 |
+
6. B
|
44 |
+
7. E
|
45 |
+
8. R
|
46 |
+
9. R
|
47 |
+
10. Y
|
48 |
+
There are **3 “R”s** in the word “strawberry”.
|
49 |
+
```
|
50 |
|
51 |
|
52 |
## Terms of use
|
|
|
56 |
|
57 |
## RewardBench Primary Dataset LeaderBoard
|
58 |
|
59 |
+
As of 1 Oct 2024, Llama-3.1-Nemotron-70B-Reward performs best Overall on RewardBench as well as with strong performance in Chat, Safety and Reasoning categories among the models below.
|
60 |
|
61 |
| Model | Type of Data Used For Training | Overall | Chat | Chat-Hard | Safety | Reasoning |
|
62 |
|:-----------------------------|:----------------|:-----|:----------|:-------|:----------|:-----------------------|
|
|
|
129 |
If you find this model useful, please cite the following works
|
130 |
|
131 |
```bibtex
|
132 |
+
@misc{wang2024helpsteer2preferencecomplementingratingspreferences,
|
133 |
+
title={HelpSteer2-Preference: Complementing Ratings with Preferences},
|
134 |
+
author={Zhilin Wang and Alexander Bukharin and Olivier Delalleau and Daniel Egert and Gerald Shen and Jiaqi Zeng and Oleksii Kuchaiev and Yi Dong},
|
135 |
+
year={2024},
|
136 |
+
eprint={2410.01257},
|
137 |
+
archivePrefix={arXiv},
|
138 |
+
primaryClass={cs.LG},
|
139 |
+
url={https://arxiv.org/abs/2410.01257},
|
140 |
+
}
|
141 |
+
|
142 |
@misc{wang2024helpsteer2,
|
143 |
title={HelpSteer2: Open-source dataset for training top-performing reward models},
|
144 |
author={Zhilin Wang and Yi Dong and Olivier Delalleau and Jiaqi Zeng and Gerald Shen and Daniel Egert and Jimmy J. Zhang and Makesh Narsimhan Sreedhar and Oleksii Kuchaiev},
|
|
|
151 |
|
152 |
## References(s):
|
153 |
|
154 |
+
* [HelpSteer2-Preference](https://arxiv.org/abs/2410.01257)
|
155 |
* [HelpSteer2](https://arxiv.org/abs/2406.08673)
|
156 |
* [HelpSteer](https://arxiv.org/abs/2311.09528)
|
157 |
* [SteerLM method](https://arxiv.org/abs/2310.05344)
|