Llama-2-7b-hf-DPO-LookAhead-0_TTree1.4_TT0.9_TP0.7_TE0.2_V6

This model is a fine-tuned version of meta-llama/Llama-2-7b-hf on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6832	0.3007	69	0.6916	-0.1597	-0.1889	0.4000	0.0292	-114.5179	-85.0475	0.6233	0.6414
0.7529	0.6013	138	0.6560	-0.2047	-0.3472	0.5	0.1425	-116.1010	-85.4976	0.6177	0.6354
0.693	0.9020	207	0.6636	0.0291	-0.0598	0.5	0.0889	-113.2271	-83.1593	0.6143	0.6331
0.4049	1.2026	276	0.6820	-0.9628	-1.4793	0.5	0.5166	-127.4224	-93.0781	0.5148	0.5312
0.3698	1.5033	345	0.6524	-1.3282	-1.9360	0.6000	0.6078	-131.9892	-96.7321	0.4151	0.4326
0.3176	1.8039	414	0.7491	-1.8527	-2.3707	0.6000	0.5180	-136.3361	-101.9771	0.3469	0.3652
0.361	2.1046	483	0.8110	-2.2972	-2.7632	0.5	0.4660	-140.2609	-106.4225	0.2734	0.2932
0.3286	2.4052	552	0.9465	-2.7604	-3.1816	0.6000	0.4212	-144.4454	-111.0542	0.1886	0.2099
0.0545	2.7059	621	0.9591	-2.8498	-3.2567	0.6000	0.4069	-145.1960	-111.9480	0.1780	0.1994