Llama-2-7b-hf-DPO-LookAhead-0_TTree1.4_TT0.9_TP0.7_TE0.2_V5

This model is a fine-tuned version of meta-llama/Llama-2-7b-hf on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.7826	0.2993	66	0.6590	0.0849	0.0090	0.8000	0.0759	-140.7556	-143.3033	0.0847	0.0794
0.639	0.5986	132	0.6196	0.1097	-0.0511	0.9000	0.1607	-141.3567	-143.0557	0.0753	0.0696
0.5359	0.8980	198	0.6393	0.0423	-0.0866	0.8000	0.1290	-141.7119	-143.7288	0.0629	0.0567
0.2727	1.1973	264	0.8080	-1.1508	-1.3039	0.6000	0.1532	-153.8851	-155.6598	-0.0274	-0.0343
0.3407	1.4966	330	0.6648	-0.9615	-1.1845	0.7000	0.2230	-152.6907	-153.7668	-0.0764	-0.0838
0.3991	1.7959	396	0.7534	-1.2141	-1.2811	0.6000	0.0670	-153.6568	-156.2932	-0.1934	-0.2005
0.1309	2.0952	462	0.8973	-1.9586	-1.8725	0.4000	-0.0861	-159.5707	-163.7383	-0.3197	-0.3272
0.0603	2.3946	528	1.0892	-2.8596	-2.5458	0.3000	-0.3138	-166.3034	-172.7478	-0.4837	-0.4920
0.1481	2.6939	594	1.1046	-3.0656	-2.7656	0.4000	-0.2999	-168.5022	-174.8080	-0.5326	-0.5412
0.2564	2.9932	660	1.0897	-2.9914	-2.7155	0.4000	-0.2759	-168.0010	-174.0661	-0.5254	-0.5339