allenai
/

tulu-v2.5-ppo-13b-uf-mean-70b-uf-rm

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

hamishivi commited on Jun 13

Commit

c9c93e5

•

1 Parent(s): 4b62b90

Update README.md

Files changed (1) hide show

README.md +14 -0

README.md CHANGED Viewed

@@ -40,6 +40,20 @@ For more details, read the paper:
 - **Reward Model:** The reward model used during PPO training can be found [here](https://huggingface.co/allenai/tulu-v2.5-70b-uf-rm), and the data used to train it [here](https://huggingface.co/datasets/allenai/tulu-2.5-preference-data) - specifically the `ultrafeedback_mean_aspects` split.
 - **Value Model:** The value model trained during PPO training can be found [here](https://huggingface.co/allenai/tulu-v2.5-ppo-13b-uf-mean-70b-uf-rm-value).
 ## Input Format
 The model is trained to use the following format (note the newlines):

 - **Reward Model:** The reward model used during PPO training can be found [here](https://huggingface.co/allenai/tulu-v2.5-70b-uf-rm), and the data used to train it [here](https://huggingface.co/datasets/allenai/tulu-2.5-preference-data) - specifically the `ultrafeedback_mean_aspects` split.
 - **Value Model:** The value model trained during PPO training can be found [here](https://huggingface.co/allenai/tulu-v2.5-ppo-13b-uf-mean-70b-uf-rm-value).
+## Results
+Tulu V2.5 PPO is trained to be a generalist model, and matches or outperforms Tulu 2+DPO 13B.
+It even beats Tulu 2+DPO 70B in some cases, although it loses out in harder reasoning tasks.
+For details on training and evaluation, read [our paper](https://link.todo)!
+| Model | Size | Alignment | AlpacaEval 2 Winrate (LC) | GSM8k 8-shot CoT Acc. | Average Perf. across Categories |
+|-|-|-|-|-|-|
+| **Tulu V2.5 PPO 13B (this model)** | 13B | PPO with 70B RM | 58.0 | **26.7** | 62.8 |
+| **Tulu V2 DPO 13B** | 13B | DPO | 50.5 | 16.0 | 61.0 |
+| **Tulu V2 SFT 13B** | 13B | - | 46.0 | 10.4 | 62.8 |
+| **Tulu V2 DPO 70B** | 13B | DPO | **71.5** | 21.2 | **69.4** |
 ## Input Format
 The model is trained to use the following format (note the newlines):