|
This model is trained with Iterative DPO in OpenRLHF |
|
|
|
Datasets and Hyperparameters |
|
|
|
- Reward Model:https://huggingface.co/OpenLLMAI/Llama-3-8b-rm-700k |
|
- SFT Model: https://huggingface.co/OpenLLMAI/Llama-3-8b-sft-mixture |
|
- Prompt Dataset: https://huggingface.co/datasets/OpenLLMAI/prompt-collection-v0.1 |
|
|
|
``` |
|
Max Prompt Length: 2048 |
|
Max Response Length: 2048 |
|
best_of_n: 2 (2 samples for each prompt) |
|
Learning Rate: 5e-7 |
|
Beta: 0.1 |
|
Scheduler: Cosine with Warmup (0.03) and MinLR (0.1 * init_lr) |
|
Rollout Batch Size: 20000 |
|
Training Batch Size: 256 |
|
Number of Iterations: 9 |
|
``` |
|
|
|
Evaluation |
|
``` |
|
########## First turn ########## |
|
score |
|
model turn |
|
Llama3-iter-dpo 1 8.55 |
|
########## Second turn ########## |
|
score |
|
model turn |
|
Llama3-iter-dpo 2 7.95625 |
|
########## Average ########## |
|
score |
|
model |
|
Llama3-iter-dpo 8.253125 |
|
Llama3-sft-baseline 7.69 |
|
``` |