Llama-2-7b-hf-DPO-LookAhead-5_TTree1.4_TT0.9_TP0.7_TE0.2_V3

This model is a fine-tuned version of meta-llama/Llama-2-7b-hf on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.7369	0.3016	76	0.7203	0.0395	0.0671	0.5	-0.0275	-101.2117	-87.5782	0.3635	0.3676
0.7346	0.6032	152	0.7779	0.0272	0.0797	0.4167	-0.0525	-101.0857	-87.7017	0.3423	0.3459
0.5938	0.9048	228	0.7684	-0.0888	-0.0049	0.25	-0.0839	-101.9318	-88.8616	0.3591	0.3628
0.2822	1.2063	304	0.9053	-0.5756	-0.3819	0.5	-0.1937	-105.7012	-93.7292	0.3058	0.3097
0.1938	1.5079	380	0.9300	-0.8880	-0.7094	0.3333	-0.1786	-108.9764	-96.8538	0.2048	0.2097
0.6894	1.8095	456	1.0636	-1.7609	-1.5117	0.3333	-0.2492	-116.9998	-105.5827	0.0583	0.0644
0.2845	2.1111	532	0.9900	-1.5299	-1.4094	0.3333	-0.1206	-115.9760	-103.2727	-0.0017	0.0048
0.0617	2.4127	608	1.1950	-2.1986	-1.9633	0.25	-0.2353	-121.5159	-109.9597	-0.1517	-0.1451
0.1181	2.7143	684	1.2551	-2.5518	-2.2604	0.25	-0.2914	-124.4866	-113.4918	-0.2003	-0.1939