This model is trained with Iterative DPO in OpenRLHF Datasets and Hyperparameters ``` Reward Model:https://huggingface.co/OpenLLMAI/Llama-3-8b-rm-700k SFT Model: https://huggingface.co/OpenLLMAI/Llama-3-8b-sft-mixture Prompt Dataset: https://huggingface.co/datasets/OpenLLMAI/prompt-collection-v0.1 best_of_n: 2 (2 samples for each prompt) Learning Rate: 5e-7 Beta: 0.1 Scheduler: Cosine with Warmup and MinLR Rollout Batch Size: 20000 Training Batch Size: 256 Number of Iterations: 9 ```