OpenRLHF
/

Llama-3-8b-iter-dpo-179k

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Llama-3-8b-iter-dpo-179k / README.md

chuyi777's picture

Update README.md

a9a5d18 verified 4 months ago

|

history blame contribute delete

928 Bytes

	This model is trained with Iterative DPO in OpenRLHF

	Datasets and Hyperparameters

	- Reward Model:https://huggingface.co/OpenLLMAI/Llama-3-8b-rm-700k
	- SFT Model: https://huggingface.co/OpenLLMAI/Llama-3-8b-sft-mixture
	- Prompt Dataset: https://huggingface.co/datasets/OpenLLMAI/prompt-collection-v0.1

	```
	Max Prompt Length: 2048
	Max Response Length: 2048
	best_of_n: 2 (2 samples for each prompt)
	Learning Rate: 5e-7
	Beta: 0.1
	Scheduler: Cosine with Warmup (0.03) and MinLR (0.1 * init_lr)
	Rollout Batch Size: 20000
	Training Batch Size: 256
	Number of Iterations: 9
	```

	Evaluation
	```
	########## First turn ##########
	score
	model turn
	Llama3-iter-dpo 1 8.55
	########## Second turn ##########
	score
	model turn
	Llama3-iter-dpo 2 7.95625
	########## Average ##########
	score
	model
	Llama3-iter-dpo 8.253125
	Llama3-sft-baseline 7.69
	```