Llama-2-7b-hf-DPO-LookAhead-0_TTree1.4_TT0.9_TP0.7_TE0.2_V2

This model is a fine-tuned version of meta-llama/Llama-2-7b-hf on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.7354	0.3029	78	0.7015	-0.0064	0.0037	0.6667	-0.0101	-118.2320	-153.7661	0.5634	0.5426
0.6583	0.6058	156	0.7087	-0.0202	-0.0023	0.5833	-0.0178	-118.2927	-153.9037	0.5270	0.5061
0.723	0.9087	234	0.7499	-0.3620	-0.3783	0.5	0.0163	-122.0522	-157.3222	0.4964	0.4745
0.229	1.2117	312	0.7914	-0.9616	-1.0299	0.5833	0.0683	-128.5688	-163.3184	0.3901	0.3669
0.603	1.5146	390	0.7363	-1.3393	-1.5502	0.5	0.2109	-133.7717	-167.0953	0.3080	0.2854
0.1335	1.8175	468	0.7920	-1.5465	-1.6888	0.4167	0.1423	-135.1577	-169.1670	0.1816	0.1612
0.1427	2.1204	546	0.7712	-1.7940	-2.0501	0.5	0.2561	-138.7705	-171.6423	0.1192	0.0991
0.2443	2.4233	624	0.8586	-2.4320	-2.8184	0.5	0.3864	-146.4533	-178.0219	-0.0246	-0.0443
0.0228	2.7262	702	0.8499	-2.3527	-2.7258	0.5	0.3731	-145.5276	-177.2292	-0.0232	-0.0429