hamishivi commited on
Commit
1548183
1 Parent(s): ac51e98

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -2
README.md CHANGED
@@ -22,7 +22,7 @@ This model is trained on the UltraFeedback dataset (using the per-aspect/fine-gr
22
  We used a 70B RM trained on the UltraFeedback dataset, and then used the UltraFeedback prompts during PPO training.
23
 
24
  For more details, read the paper:
25
- [Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback](https://link.todo).
26
 
27
 
28
  ## .Model description
@@ -44,7 +44,7 @@ For more details, read the paper:
44
 
45
  Tulu V2.5 PPO is trained to be a generalist model, and matches or outperforms Tulu 2+DPO 13B.
46
  It even beats Tulu 2+DPO 70B in some cases, although it loses out in harder reasoning tasks.
47
- For details on training and evaluation, read [our paper](https://link.todo)!
48
 
49
 
50
  | Model | Size | Alignment | GSM8k 8-shot CoT Acc. | AlpacaEval 2 Winrate (LC) | Average Perf. across Open-Instruct evals |
@@ -125,6 +125,7 @@ If you find Tulu 2.5 is useful in your work, please cite it with:
125
  title={{Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback}},
126
  author={{Hamish Ivison and Yizhong Wang and Jiacheng Liu and Ellen Wu and Valentina Pyatkin and Nathan Lambert and Yejin Choi and Noah A. Smith and Hannaneh Hajishirzi}}
127
  year={2024},
 
128
  archivePrefix={arXiv},
129
  primaryClass={cs.CL}
130
  }
 
22
  We used a 70B RM trained on the UltraFeedback dataset, and then used the UltraFeedback prompts during PPO training.
23
 
24
  For more details, read the paper:
25
+ [Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback](https://arxiv.org/abs/2406.09279).
26
 
27
 
28
  ## .Model description
 
44
 
45
  Tulu V2.5 PPO is trained to be a generalist model, and matches or outperforms Tulu 2+DPO 13B.
46
  It even beats Tulu 2+DPO 70B in some cases, although it loses out in harder reasoning tasks.
47
+ For details on training and evaluation, read [our paper](https://arxiv.org/abs/2406.09279)!
48
 
49
 
50
  | Model | Size | Alignment | GSM8k 8-shot CoT Acc. | AlpacaEval 2 Winrate (LC) | Average Perf. across Open-Instruct evals |
 
125
  title={{Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback}},
126
  author={{Hamish Ivison and Yizhong Wang and Jiacheng Liu and Ellen Wu and Valentina Pyatkin and Nathan Lambert and Yejin Choi and Noah A. Smith and Hannaneh Hajishirzi}}
127
  year={2024},
128
+ eprint={2406.09279},
129
  archivePrefix={arXiv},
130
  primaryClass={cs.CL}
131
  }