chuyi777's picture
Update README.md
70715a4 verified
|
raw
history blame
560 Bytes

This model is trained with Iterative DPO in OpenRLHF

Datasets and Hyperparameters

Max Prompt Length: 2048
Max Response Length: 2048
best_of_n: 2 (2 samples for each prompt)
Learning Rate: 5e-7
Beta: 0.1
Scheduler: Cosine with Warmup (0.03) and MinLR (0.1)
Rollout Batch Size: 20000
Training Batch Size: 256
Number of Iterations: 9