Text Generation
Transformers
PyTorch
English
llama
conversational
text-generation-inference
Inference Endpoints
hamishivi commited on
Commit
c9c93e5
1 Parent(s): 4b62b90

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -0
README.md CHANGED
@@ -40,6 +40,20 @@ For more details, read the paper:
40
  - **Reward Model:** The reward model used during PPO training can be found [here](https://huggingface.co/allenai/tulu-v2.5-70b-uf-rm), and the data used to train it [here](https://huggingface.co/datasets/allenai/tulu-2.5-preference-data) - specifically the `ultrafeedback_mean_aspects` split.
41
  - **Value Model:** The value model trained during PPO training can be found [here](https://huggingface.co/allenai/tulu-v2.5-ppo-13b-uf-mean-70b-uf-rm-value).
42
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
  ## Input Format
44
 
45
  The model is trained to use the following format (note the newlines):
 
40
  - **Reward Model:** The reward model used during PPO training can be found [here](https://huggingface.co/allenai/tulu-v2.5-70b-uf-rm), and the data used to train it [here](https://huggingface.co/datasets/allenai/tulu-2.5-preference-data) - specifically the `ultrafeedback_mean_aspects` split.
41
  - **Value Model:** The value model trained during PPO training can be found [here](https://huggingface.co/allenai/tulu-v2.5-ppo-13b-uf-mean-70b-uf-rm-value).
42
 
43
+ ## Results
44
+
45
+ Tulu V2.5 PPO is trained to be a generalist model, and matches or outperforms Tulu 2+DPO 13B.
46
+ It even beats Tulu 2+DPO 70B in some cases, although it loses out in harder reasoning tasks.
47
+ For details on training and evaluation, read [our paper](https://link.todo)!
48
+
49
+
50
+ | Model | Size | Alignment | AlpacaEval 2 Winrate (LC) | GSM8k 8-shot CoT Acc. | Average Perf. across Categories |
51
+ |-|-|-|-|-|-|
52
+ | **Tulu V2.5 PPO 13B (this model)** | 13B | PPO with 70B RM | 58.0 | **26.7** | 62.8 |
53
+ | **Tulu V2 DPO 13B** | 13B | DPO | 50.5 | 16.0 | 61.0 |
54
+ | **Tulu V2 SFT 13B** | 13B | - | 46.0 | 10.4 | 62.8 |
55
+ | **Tulu V2 DPO 70B** | 13B | DPO | **71.5** | 21.2 | **69.4** |
56
+
57
  ## Input Format
58
 
59
  The model is trained to use the following format (note the newlines):