llama-7b-SFT-qlora-wiki_DPO_ds_RM_top_2_1024_r_64_alpha_16

This model is a fine-tuned version of dhmeltzer/llama-7b-SFT_ds_wiki65k_1024_r_64_alpha_16_merged on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6925	0.1	19	0.6761	-0.1021	-0.1593	0.5697	0.0573	-202.1919	-206.8013	1.1506	1.1664
0.6754	0.21	38	0.6738	-0.4156	-0.5460	0.5701	0.1303	-206.0580	-209.9368	1.1257	1.1406
0.6799	0.31	57	0.6666	-0.0458	-0.1454	0.5932	0.0996	-202.0523	-206.2388	1.1176	1.1327
0.6618	0.42	76	0.6637	-0.1458	-0.2745	0.5971	0.1286	-203.3434	-207.2391	1.1195	1.1333
0.6706	0.52	95	0.6607	-0.0386	-0.1827	0.5971	0.1440	-202.4252	-206.1670	1.1334	1.1484
0.668	0.63	114	0.6596	-0.1615	-0.2945	0.6035	0.1330	-203.5434	-207.3955	1.1500	1.1661
0.6712	0.73	133	0.6597	-0.1703	-0.2905	0.5979	0.1202	-203.5037	-207.4840	1.1515	1.1672
0.6715	0.84	152	0.6588	-0.1516	-0.2745	0.6100	0.1229	-203.3436	-207.2964	1.1532	1.1691
0.673	0.94	171	0.6572	-0.1473	-0.2755	0.6128	0.1282	-203.3539	-207.2538	1.1534	1.1690