Update README.md
Browse files
README.md
CHANGED
@@ -55,7 +55,7 @@ As of 27 Sept 2024, Llama-3.1-Nemotron-70B-Reward performs best Overall on Rewar
|
|
55 |
|
56 |
|
57 |
To better understand why Llama-3.1-Nemotron-70B-Reward does less well in the Chat-Hard category, we analyze the scores for each consistutent subset under the Chat-Hard category. We find that on categories that uses human annotations as ground truth, Llama-3.1-Nemotron-70B-Reward performs similar to Skywork-Reward-Gemma-2-27B (<= 2.2% difference).
|
58 |
-
On the other hand, when GPT-4 annotations are used as Ground-Truth, Llama-3.1-Nemotron-70B-Reward trails substantially behind Skywork-Reward-Gemma-2-27B (10.8 to 19.2%). This suggests that Skywork-Reward-Gemma-2-27B can better modelling GPT-4
|
59 |
|
60 |
|
61 |
|
|
|
55 |
|
56 |
|
57 |
To better understand why Llama-3.1-Nemotron-70B-Reward does less well in the Chat-Hard category, we analyze the scores for each consistutent subset under the Chat-Hard category. We find that on categories that uses human annotations as ground truth, Llama-3.1-Nemotron-70B-Reward performs similar to Skywork-Reward-Gemma-2-27B (<= 2.2% difference).
|
58 |
+
On the other hand, when GPT-4 annotations are used as Ground-Truth, Llama-3.1-Nemotron-70B-Reward trails substantially behind Skywork-Reward-Gemma-2-27B (by 10.8 to 19.2%). This suggests that Skywork-Reward-Gemma-2-27B can better modelling GPT-4 preferences (but not human-annotated preferences), likely contributed by the inclusion of GPT-4 annotated training data used to train it found in the [OffSetBias dataset](https://huggingface.co/datasets/NCSOFT/offsetbias) as part of the [Skywork-Reward-Preference-80k](https://huggingface.co/datasets/Skywork/Skywork-Reward-Preference-80K-v0.1).
|
59 |
|
60 |
|
61 |
|