zephyr-7b-dpo-full-beta-0.2

This model is a fine-tuned version of HuggingFaceH4/mistral-7b-sft-beta on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.5631	0.26	500	0.5260	0.0288	-1.2082	0.75	1.2371	-251.9833	-298.3453	-2.9467	-2.9577
0.5432	0.52	1000	0.5888	-0.0335	-1.8482	0.7540	1.8147	-255.1831	-298.6568	-2.8465	-2.8476
0.5368	0.77	1500	0.5860	-0.4836	-2.3300	0.7619	1.8464	-257.5920	-300.9073	-2.8455	-2.8445
0.0615	1.03	2000	0.6024	-0.5971	-2.6919	0.7778	2.0948	-259.4018	-301.4749	-2.8687	-2.8639
0.0817	1.29	2500	0.6655	-1.3554	-3.8426	0.7738	2.4872	-265.1552	-305.2667	-2.8257	-2.8254
0.0617	1.55	3000	0.6421	-1.2552	-3.7613	0.75	2.5062	-264.7488	-304.7651	-2.7744	-2.7683
0.0765	1.81	3500	0.6582	-1.1492	-4.0394	0.7659	2.8902	-266.1391	-304.2354	-2.7403	-2.7389
0.0178	2.07	4000	0.6797	-1.8485	-5.2549	0.7619	3.4064	-272.2166	-307.7317	-2.7310	-2.7273
0.0165	2.32	4500	0.7359	-2.2096	-6.0498	0.7817	3.8401	-276.1910	-309.5376	-2.7006	-2.7001
0.0094	2.58	5000	0.7864	-2.8828	-6.8542	0.7738	3.9713	-280.2130	-312.9036	-2.7185	-2.7196
0.0094	2.84	5500	0.7953	-3.1897	-7.3009	0.7579	4.1112	-282.4464	-314.4378	-2.6987	-2.7012