Update README.md
Browse files
README.md
CHANGED
@@ -22,8 +22,8 @@ This is a 8B reward model used for PPO training trained on the UltraFeedback dat
|
|
22 |
For more details, read the paper:
|
23 |
[Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback](https://arxiv.org/abs/2406.09279).
|
24 |
|
25 |
-
|
26 |
-
|
27 |
|
28 |
## Performance
|
29 |
|
|
|
22 |
For more details, read the paper:
|
23 |
[Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback](https://arxiv.org/abs/2406.09279).
|
24 |
|
25 |
+
Note this model is finetuned from Llama 3.1, released under the Meta Llama 3.1 community license, included here under `llama_3_license.txt`.
|
26 |
+
|
27 |
|
28 |
## Performance
|
29 |
|