model_shp3_dpo1

This model is a fine-tuned version of meta-llama/Llama-2-7b-chat-hf on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.0698	2.67	100	1.0791	-4.0199	-4.9604	0.5900	0.9405	-316.3408	-282.7867	-0.9293	-0.9672
0.0004	5.33	200	1.5283	-7.2654	-8.3476	0.5800	1.0822	-350.2130	-315.2426	-0.9409	-0.9510
0.0001	8.0	300	1.9265	-9.7069	-11.2491	0.5700	1.5421	-379.2276	-339.6573	-0.9224	-0.9298
0.0001	10.67	400	1.9667	-9.9054	-11.4492	0.5700	1.5439	-381.2295	-341.6420	-0.9181	-0.9248
0.0001	13.33	500	1.9959	-10.0523	-11.6165	0.5700	1.5642	-382.9025	-343.1115	-0.9137	-0.9198
0.0	16.0	600	2.0035	-10.1182	-11.7116	0.5700	1.5934	-383.8533	-343.7699	-0.9121	-0.9185
0.0001	18.67	700	2.0159	-10.1627	-11.7547	0.5700	1.5920	-384.2843	-344.2155	-0.9115	-0.9169
0.0	21.33	800	2.0163	-10.1740	-11.7677	0.5700	1.5937	-384.4142	-344.3281	-0.9103	-0.9160
0.0	24.0	900	2.0220	-10.1842	-11.7817	0.5700	1.5976	-384.5541	-344.4297	-0.9106	-0.9160
0.0001	26.67	1000	2.0149	-10.1791	-11.7765	0.5700	1.5974	-384.5022	-344.3792	-0.9104	-0.9161