llama3-dpo-lora

This model is a fine-tuned version of princeton-nlp/Llama-3-Base-8B-SFT on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6275	0.1047	100	0.6122	0.2594	-0.0099	0.6920	0.2693	-276.7753	-289.9533	-0.5582	-0.5619
0.5726	0.2094	200	0.5529	-0.0787	-0.6353	0.7040	0.5565	-283.0293	-293.3344	-0.5103	-0.5266
0.5429	0.3141	300	0.5380	-0.1730	-0.8455	0.7260	0.6725	-285.1317	-294.2773	-0.4689	-0.4910
0.5054	0.4187	400	0.5332	-0.0870	-0.8469	0.7240	0.7599	-285.1459	-293.4173	-0.4261	-0.4535
0.5508	0.5234	500	0.5267	-0.0207	-0.8088	0.7180	0.7881	-284.7646	-292.7540	-0.4045	-0.4335
0.5338	0.6281	600	0.5263	0.1981	-0.5901	0.7300	0.7882	-282.5771	-290.5659	-0.4002	-0.4304
0.5064	0.7328	700	0.5175	-0.2007	-1.0076	0.7300	0.8068	-286.7521	-294.5546	-0.3761	-0.4080
0.5349	0.8375	800	0.5197	0.0149	-0.7896	0.7200	0.8045	-284.5727	-292.3984	-0.3853	-0.4161
0.4775	0.9422	900	0.5181	0.0150	-0.7988	0.7260	0.8139	-284.6649	-292.3968	-0.3842	-0.4151