Llama-2-7b-hf-DPO-LookAhead-0_TTree1.4_TT0.9_TP0.7_TE0.2_V3

This model is a fine-tuned version of meta-llama/Llama-2-7b-hf on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6606	0.3035	78	0.6743	0.0166	-0.0081	0.75	0.0247	-125.3963	-73.3299	0.5935	0.6202
0.5493	0.6070	156	0.6634	-0.1831	-0.2415	0.6667	0.0585	-127.7309	-75.3266	0.5586	0.5847
0.5705	0.9105	234	0.5848	-0.3315	-0.6168	0.6667	0.2853	-131.4834	-76.8105	0.4949	0.5208
0.405	1.2140	312	0.5806	-0.8206	-1.3076	0.5833	0.4870	-138.3913	-81.7017	0.4210	0.4471
0.5029	1.5175	390	0.5738	-1.0140	-1.5365	0.6667	0.5225	-140.6803	-83.6359	0.3256	0.3525
0.1719	1.8210	468	0.6151	-1.3642	-1.9154	0.75	0.5512	-144.4694	-87.1375	0.1929	0.2203
0.3565	2.1245	546	0.6575	-1.6573	-2.3806	0.75	0.7233	-149.1218	-90.0692	0.1075	0.1363
0.4206	2.4280	624	0.7578	-2.2884	-3.1134	0.6667	0.8250	-156.4492	-96.3796	0.0126	0.0427
0.3123	2.7315	702	0.7312	-2.2500	-3.0688	0.6667	0.8189	-156.0040	-95.9953	0.0075	0.0375