dpo-llama-chat-without-none

This model is a fine-tuned version of meta-llama/Llama-2-7b-chat-hf on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
6.3	0.24	100	6.1290	3.4767	3.2110	0.5920	0.2657	-56.9286	-62.0606	-0.2723	-0.2654
5.5843	0.48	200	5.8936	3.6904	3.2305	0.6520	0.4599	-56.7330	-59.9230	0.2517	0.2475
5.757	0.72	300	5.6694	3.9164	3.1893	0.7253	0.7271	-57.1450	-57.6631	0.3505	0.3418
5.5385	0.96	400	5.4629	4.1466	3.1351	0.7600	1.0115	-57.6871	-55.3611	0.2059	0.1970
5.2301	1.2	500	5.2891	4.3324	3.0305	0.7880	1.3020	-58.7338	-53.5027	0.1063	0.0968
5.0115	1.44	600	5.1601	4.4582	2.9458	0.8213	1.5124	-59.5800	-52.2452	-0.1082	-0.1154
4.9893	1.68	700	5.0431	4.5787	2.9142	0.8413	1.6645	-59.8968	-51.0404	-0.1716	-0.1829
5.0292	1.92	800	4.9770	4.6501	2.8827	0.8427	1.7673	-60.2111	-50.3266	-0.1929	-0.2042
4.331	2.16	900	4.9577	4.6724	2.8191	0.8480	1.8534	-60.8478	-50.1027	-0.2005	-0.2121
4.5481	2.4	1000	4.9481	4.6795	2.8189	0.8547	1.8606	-60.8495	-50.0326	-0.2216	-0.2323