Llama-2-7b-hf-DPO-LookAhead-5_TTree1.4_TT0.9_TP0.7_TE0.2_V5

This model is a fine-tuned version of meta-llama/Llama-2-7b-hf on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.7395	0.3010	73	0.6468	0.0134	-0.0847	0.9000	0.0981	-142.2149	-145.7866	0.3794	0.3670
0.7285	0.6021	146	0.6128	0.0518	-0.1414	0.7000	0.1932	-142.7814	-145.4018	0.3432	0.3316
0.5488	0.9031	219	0.5896	0.0505	-0.2094	0.8000	0.2599	-143.4620	-145.4151	0.3212	0.3092
0.4181	1.2041	292	0.7451	-0.5895	-1.0121	0.7000	0.4226	-151.4888	-151.8154	0.2582	0.2463
0.6666	1.5052	365	0.6292	-0.4920	-0.8706	0.5	0.3786	-150.0739	-150.8403	0.2068	0.1950
0.5649	1.8062	438	0.6652	-0.6961	-1.0296	0.6000	0.3335	-151.6640	-152.8809	0.1043	0.0914
0.3129	2.1072	511	0.8072	-1.2644	-1.5342	0.6000	0.2698	-156.7100	-158.5638	0.0071	-0.0060
0.0785	2.4082	584	1.0289	-2.0249	-2.2745	0.6000	0.2496	-164.1127	-166.1691	-0.1558	-0.1700
0.1698	2.7093	657	1.0059	-1.9822	-2.2494	0.6000	0.2673	-163.8624	-165.7420	-0.1662	-0.1805