Update README.md
Browse files
README.md
CHANGED
@@ -40,6 +40,20 @@ For more details, read the paper:
|
|
40 |
- **Reward Model:** The reward model used during PPO training can be found [here](https://huggingface.co/allenai/tulu-v2.5-70b-uf-rm), and the data used to train it [here](https://huggingface.co/datasets/allenai/tulu-2.5-preference-data) - specifically the `ultrafeedback_mean_aspects` split.
|
41 |
- **Value Model:** The value model trained during PPO training can be found [here](https://huggingface.co/allenai/tulu-v2.5-ppo-13b-uf-mean-70b-uf-rm-value).
|
42 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
43 |
## Input Format
|
44 |
|
45 |
The model is trained to use the following format (note the newlines):
|
|
|
40 |
- **Reward Model:** The reward model used during PPO training can be found [here](https://huggingface.co/allenai/tulu-v2.5-70b-uf-rm), and the data used to train it [here](https://huggingface.co/datasets/allenai/tulu-2.5-preference-data) - specifically the `ultrafeedback_mean_aspects` split.
|
41 |
- **Value Model:** The value model trained during PPO training can be found [here](https://huggingface.co/allenai/tulu-v2.5-ppo-13b-uf-mean-70b-uf-rm-value).
|
42 |
|
43 |
+
## Results
|
44 |
+
|
45 |
+
Tulu V2.5 PPO is trained to be a generalist model, and matches or outperforms Tulu 2+DPO 13B.
|
46 |
+
It even beats Tulu 2+DPO 70B in some cases, although it loses out in harder reasoning tasks.
|
47 |
+
For details on training and evaluation, read [our paper](https://link.todo)!
|
48 |
+
|
49 |
+
|
50 |
+
| Model | Size | Alignment | AlpacaEval 2 Winrate (LC) | GSM8k 8-shot CoT Acc. | Average Perf. across Categories |
|
51 |
+
|-|-|-|-|-|-|
|
52 |
+
| **Tulu V2.5 PPO 13B (this model)** | 13B | PPO with 70B RM | 58.0 | **26.7** | 62.8 |
|
53 |
+
| **Tulu V2 DPO 13B** | 13B | DPO | 50.5 | 16.0 | 61.0 |
|
54 |
+
| **Tulu V2 SFT 13B** | 13B | - | 46.0 | 10.4 | 62.8 |
|
55 |
+
| **Tulu V2 DPO 70B** | 13B | DPO | **71.5** | 21.2 | **69.4** |
|
56 |
+
|
57 |
## Input Format
|
58 |
|
59 |
The model is trained to use the following format (note the newlines):
|