NeMo
English
nvidia
llama3.1
reward model
zhilinw commited on
Commit
4213126
1 Parent(s): 838ba71

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -45,14 +45,14 @@ As of 27 Sept 2024, Llama-3.1-Nemotron-70B-Reward performs best Overall on Rewar
45
  | Meta-Llama-3.1-70B-Instruct | Not fully disclosed | 84.0 | 97.2 | 70.2 | 82.8 | 86.0 |
46
 
47
 
48
- To better understand why it struggles in the Chat-Hard category, we analyzed the scores for each consistutent subset of Chat-Hard category. We find that on categories that uses human annotations as ground truth, Llama-3.1-Nemotron-70B-Reward performs similar to Skywork-Reward-Gemma-2-27B (<= 2.2% difference.)
49
- On the other hand, when GPT-4 annotations are used as Ground-Truth, we trail substantially behind Skywork-Reward-Gemma-2-27B by 10.8 to 19.2%. This suggests that Skywork-Reward-Gemma-2-27B might better suited at modelling GPT-4 preference, likely contributed by the inclusion of GPT-4 annotated training data used to train it found in the [OffSetBias dataset](https://huggingface.co/datasets/NCSOFT/offsetbias) as part of the [Skywork-Reward-Preference-80k](https://huggingface.co/datasets/Skywork/Skywork-Reward-Preference-80K-v0.1).
50
 
51
 
52
 
53
  | Model | Type of Data Used For Training | Chat-Hard | LLMBar-Adversarial-Manual | LLMBar-Adversarial-Neighbour | LLMBar-Natural | LLMBar-Adversarial-GPTInst | LLMBar-Adversarial-GPTOut | MT-Bench-Hard|
54
  |:-----------------------------|:----------------|:-----|:----------|:-------|:----------|:-----------------------|:-----------------------|:-----------------------|
55
- |||| Human as Ground Truth | Human as Ground Truth | Human as Ground Truth | GPT-4 as Ground Truth |GPT-4 as Ground Truth |GPT-4 as Ground Truth |
56
  | Llama-3.1-Nemotron-70B-Reward |Permissive Licensed Data Only (CC-BY-4.0) | 85.8 | 76.1 | 88.8 | 95.0 | 87.0 | 72.3 | 75.7
57
  | Skywork-Reward-Gemma-2-27B | Includes GPT4 Generated Data | 91.4 | 78.3 | 89.6 | 96.0 | 97.8 | 91.5 | 86.5|
58
 
 
45
  | Meta-Llama-3.1-70B-Instruct | Not fully disclosed | 84.0 | 97.2 | 70.2 | 82.8 | 86.0 |
46
 
47
 
48
+ To better understand why Llama-3.1-Nemotron-70B-Reward does less well in the Chat-Hard category, we analyze the scores for each consistutent subset under the Chat-Hard category. We find that on categories that uses human annotations as ground truth, Llama-3.1-Nemotron-70B-Reward performs similar to Skywork-Reward-Gemma-2-27B (<= 2.2% difference).
49
+ On the other hand, when GPT-4 annotations are used as Ground-Truth, Llama-3.1-Nemotron-70B-Reward trails substantially behind Skywork-Reward-Gemma-2-27B (10.8 to 19.2%). This suggests that Skywork-Reward-Gemma-2-27B can better modelling GPT-4 preference (than human-annotated preferences), likely contributed by the inclusion of GPT-4 annotated training data used to train it found in the [OffSetBias dataset](https://huggingface.co/datasets/NCSOFT/offsetbias) as part of the [Skywork-Reward-Preference-80k](https://huggingface.co/datasets/Skywork/Skywork-Reward-Preference-80K-v0.1).
50
 
51
 
52
 
53
  | Model | Type of Data Used For Training | Chat-Hard | LLMBar-Adversarial-Manual | LLMBar-Adversarial-Neighbour | LLMBar-Natural | LLMBar-Adversarial-GPTInst | LLMBar-Adversarial-GPTOut | MT-Bench-Hard|
54
  |:-----------------------------|:----------------|:-----|:----------|:-------|:----------|:-----------------------|:-----------------------|:-----------------------|
55
+ |||| Human as Ground Truth | Human as Ground Truth | Human as Ground Truth | _GPT-4 as Ground Truth_ |_GPT-4 as Ground Truth_ | _GPT-4 as Ground Truth_ |
56
  | Llama-3.1-Nemotron-70B-Reward |Permissive Licensed Data Only (CC-BY-4.0) | 85.8 | 76.1 | 88.8 | 95.0 | 87.0 | 72.3 | 75.7
57
  | Skywork-Reward-Gemma-2-27B | Includes GPT4 Generated Data | 91.4 | 78.3 | 89.6 | 96.0 | 97.8 | 91.5 | 86.5|
58