Update README.md
Browse files
README.md
CHANGED
@@ -26,6 +26,8 @@ By accessing this model, you are agreeing to the LLama 3.1 terms and conditions
|
|
26 |
|
27 |
## RewardBench Primary Dataset LeaderBoard
|
28 |
|
|
|
|
|
29 |
| Model | Type of Data Used For Training | Overall | Chat | Chat-Hard | Safety | Reasoning |
|
30 |
|:-----------------------------|:----------------|:-----|:----------|:-------|:----------|:-----------------------|
|
31 |
| _**Llama-3.1-Nemotron-70B-Reward**_ |Permissive Licensed Data Only (CC-BY-4.0) | **94.1** | **97.5** | 85.8 | **95.1** | **98.1** |
|
@@ -43,14 +45,16 @@ By accessing this model, you are agreeing to the LLama 3.1 terms and conditions
|
|
43 |
| Meta-Llama-3.1-70B-Instruct | Not fully disclosed | 84.0 | 97.2 | 70.2 | 82.8 | 86.0 |
|
44 |
|
45 |
|
46 |
-
|
|
|
47 |
|
48 |
-
To better understand why it struggles in Chat-Hard category, we analyzed the scores for each consistutent subset of Chat-Hard category.
|
49 |
|
50 |
-
|
51 |
-
|
52 |
-
|
53 |
-
|
|
|
|
|
|
54 |
|
55 |
|
56 |
Last updated: 27 Sept 2024
|
|
|
26 |
|
27 |
## RewardBench Primary Dataset LeaderBoard
|
28 |
|
29 |
+
Llama-3.1-Nemotron-70B-Reward performs best Overall on RewardBench as well as in Chat, Safety and Reasoning category.
|
30 |
+
|
31 |
| Model | Type of Data Used For Training | Overall | Chat | Chat-Hard | Safety | Reasoning |
|
32 |
|:-----------------------------|:----------------|:-----|:----------|:-------|:----------|:-----------------------|
|
33 |
| _**Llama-3.1-Nemotron-70B-Reward**_ |Permissive Licensed Data Only (CC-BY-4.0) | **94.1** | **97.5** | 85.8 | **95.1** | **98.1** |
|
|
|
45 |
| Meta-Llama-3.1-70B-Instruct | Not fully disclosed | 84.0 | 97.2 | 70.2 | 82.8 | 86.0 |
|
46 |
|
47 |
|
48 |
+
To better understand why it struggles in the Chat-Hard category, we analyzed the scores for each consistutent subset of Chat-Hard category. We find that on categories that uses human annotations as ground truth, Llama-3.1-Nemotron-70B-Reward performs similar to Skywork-Reward-Gemma-2-27B (<= 2.2% difference.)
|
49 |
+
On the other hand, when GPT-4 annotations are used as Ground-Truth, we trail substantially behind Skywork-Reward-Gemma-2-27B by 10.8 to 19.2%. This suggests that Skywork-Reward-Gemma-2-27B might better suited at modelling GPT-4 preference, likely contributed by the inclusion of GPT-4 annotated training data used to train it found in the [OffSetBias dataset](https://huggingface.co/datasets/NCSOFT/offsetbias) as part of the [Skywork-Reward-Preference-80k](https://huggingface.co/datasets/Skywork/Skywork-Reward-Preference-80K-v0.1).
|
50 |
|
|
|
51 |
|
52 |
+
|
53 |
+
| Model | Type of Data Used For Training | Chat-Hard | LLMBar-Adversarial-Manual | LLMBar-Adversarial-Neighbour | LLMBar-Natural | LLMBar-Adversarial-GPTInst | LLMBar-Adversarial-GPTOut | MT-Bench-Hard|
|
54 |
+
|:-----------------------------|:----------------|:-----|:----------|:-------|:----------|:-----------------------|:-----------------------|:-----------------------|
|
55 |
+
|||| Human as Ground Truth | Human as Ground Truth | Human as Ground Truth | GPT-4 as Ground Truth |GPT-4 as Ground Truth |GPT-4 as Ground Truth |
|
56 |
+
| Llama-3.1-Nemotron-70B-Reward |Permissive Licensed Data Only (CC-BY-4.0) | 85.8 | 76.1 | 88.8 | 95.0 | 87.0 | 72.3 | 75.7
|
57 |
+
| Skywork-Reward-Gemma-2-27B | Includes GPT4 Generated Data | 91.4 | 78.3 | 89.6 | 96.0 | 97.8 | 91.5 | 86.5|
|
58 |
|
59 |
|
60 |
Last updated: 27 Sept 2024
|