Text Classification
Transformers
Safetensors
English
llama
text-generation-inference
Inference Endpoints
hamishivi commited on
Commit
8921e92
·
verified ·
1 Parent(s): d53ea25

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -0
README.md CHANGED
@@ -22,6 +22,16 @@ This is a 70B reward model used for PPO training trained on the UltraFeedback da
22
  For more details, read the paper:
23
  [Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback](https://arxiv.org/abs/2406.09279).
24
 
 
 
 
 
 
 
 
 
 
 
25
 
26
  ## .Model description
27
 
 
22
  For more details, read the paper:
23
  [Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback](https://arxiv.org/abs/2406.09279).
24
 
25
+ ## Performance
26
+
27
+ We evaluate the model on [RewardBench](https://github.com/allenai/reward-bench):
28
+
29
+ | Model | Score | Chat | Chat Hard | Safety | Reasoning | Prior Sets (0.5 weight) |
30
+ |------------------|-------|-------|-----------|--------|-----------|-------------------------|
31
+ | [Llama 3 Tulu 2 8b UF RM](https://huggingface.co/allenai/llama-3-tulu-2-8b-uf-mean-rm) | 66.3 | 96.6 | 59.4 | 61.4 | 80.7 | |
32
+ | **[Llama 3 Tulu 2 70b UF RM](https://huggingface.co/allenai/llama-3-tulu-2-70b-uf-mean-rm) (this model)** | | | | | | |
33
+
34
+
35
 
36
  ## .Model description
37