nvidia
/

Llama-3.1-Nemotron-70B-Reward

NeMo

English

nvidia

llama3.1

reward model

Model card Files Files and versions Community

zhilinw commited on Sep 28, 2024

Commit

7d174e2

verified ·

1 Parent(s): 5c89484

Update README.md

Browse files

Files changed (1) hide show

README.md +10 -6

README.md CHANGED Viewed

@@ -26,6 +26,8 @@ By accessing this model, you are agreeing to the LLama 3.1 terms and conditions
 ## RewardBench Primary Dataset LeaderBoard
  | Model  | Type of Data Used For Training |  Overall | Chat | Chat-Hard | Safety | Reasoning |
 |:-----------------------------|:----------------|:-----|:----------|:-------|:----------|:-----------------------|
 | _**Llama-3.1-Nemotron-70B-Reward**_ |Permissive Licensed Data Only (CC-BY-4.0) | **94.1** | **97.5** | 85.8 | **95.1** | **98.1** |
@@ -43,14 +45,16 @@ By accessing this model, you are agreeing to the LLama 3.1 terms and conditions
   | Meta-Llama-3.1-70B-Instruct | Not fully disclosed | 84.0 | 97.2 | 70.2 | 82.8 | 86.0 |
-As shown above, Llama-3.1-Nemotron-70B-Reward performs best overall as well as in Chat, Safety and Reasoning category.
-To better understand why it struggles in Chat-Hard category, we analyzed the scores for each consistutent subset of Chat-Hard category.
- | Model  | Type of Data Used For Training |  Overall | Chat | Chat-Hard | Safety | Reasoning |
-|:-----------------------------|:----------------|:-----|:----------|:-------|:----------|:-----------------------|
-| _**Llama-3.1-Nemotron-70B-Reward**_ |Permissive Licensed Data Only (CC-BY-4.0) | **94.1** | **97.5** | 85.8 | **95.1** | **98.1** |
-| Skywork-Reward-Gemma-2-27B | Includes GPT4 Generated Data| 93.8  |  95.8  |  **91.4**  |  91.9  |  96.1 |
 Last updated: 27 Sept 2024

 ## RewardBench Primary Dataset LeaderBoard
+Llama-3.1-Nemotron-70B-Reward performs best Overall on RewardBench as well as in Chat, Safety and Reasoning category.
  | Model  | Type of Data Used For Training |  Overall | Chat | Chat-Hard | Safety | Reasoning |
 |:-----------------------------|:----------------|:-----|:----------|:-------|:----------|:-----------------------|
 | _**Llama-3.1-Nemotron-70B-Reward**_ |Permissive Licensed Data Only (CC-BY-4.0) | **94.1** | **97.5** | 85.8 | **95.1** | **98.1** |
   | Meta-Llama-3.1-70B-Instruct | Not fully disclosed | 84.0 | 97.2 | 70.2 | 82.8 | 86.0 |
+To better understand why it struggles in the Chat-Hard category, we analyzed the scores for each consistutent subset of Chat-Hard category. We find that on categories that uses human annotations as ground truth, Llama-3.1-Nemotron-70B-Reward performs similar to Skywork-Reward-Gemma-2-27B (<= 2.2% difference.)
+On the other hand, when GPT-4 annotations are used as Ground-Truth, we trail substantially behind Skywork-Reward-Gemma-2-27B by 10.8 to 19.2%. This suggests that Skywork-Reward-Gemma-2-27B might better suited at modelling GPT-4 preference, likely contributed by the inclusion of GPT-4 annotated training data used to train it found in the [OffSetBias dataset](https://huggingface.co/datasets/NCSOFT/offsetbias) as part of the [Skywork-Reward-Preference-80k](https://huggingface.co/datasets/Skywork/Skywork-Reward-Preference-80K-v0.1).
+| Model  | Type of Data Used For Training |  Chat-Hard | LLMBar-Adversarial-Manual  | LLMBar-Adversarial-Neighbour | LLMBar-Natural | LLMBar-Adversarial-GPTInst | LLMBar-Adversarial-GPTOut | MT-Bench-Hard|
+|:-----------------------------|:----------------|:-----|:----------|:-------|:----------|:-----------------------|:-----------------------|:-----------------------|
+|||| Human as Ground Truth | Human as Ground Truth | Human as Ground Truth | GPT-4 as Ground Truth |GPT-4 as Ground Truth |GPT-4 as Ground Truth |
+| Llama-3.1-Nemotron-70B-Reward |Permissive Licensed Data Only (CC-BY-4.0) | 85.8 | 76.1  |  88.8  |  95.0  |  87.0  | 72.3  |  75.7
+| Skywork-Reward-Gemma-2-27B | Includes GPT4 Generated Data |  91.4  |  78.3  |  89.6  |  96.0  |  97.8  |  91.5 | 86.5|
 Last updated: 27 Sept 2024