zephyr-7b-dpo-qlora

This model is a fine-tuned version of /opt/data/private/xgq/alignment-handbook/data/Qwen-1.5b-sft-qlora on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

Loss: 0.5487
Rewards/chosen: -1.1270
Rewards/rejected: -1.7889
Rewards/accuracies: 0.7380
Rewards/margins: 0.6620
Logps/rejected: -483.5314
Logps/chosen: -460.1111
Logits/rejected: -1.4133
Logits/chosen: -1.4624

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-06
train_batch_size: 2
eval_batch_size: 2
seed: 42
distributed_type: multi-GPU
num_devices: 3
gradient_accumulation_steps: 4
total_train_batch_size: 24
total_eval_batch_size: 6
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6873	0.0393	100	0.6862	0.0421	0.0275	0.6587	0.0146	-301.8899	-343.2013	-1.9289	-1.9903
0.6613	0.0785	200	0.6587	-0.0475	-0.1358	0.6707	0.0882	-318.2146	-352.1655	-1.9013	-1.9595
0.6358	0.1178	300	0.6395	-0.2454	-0.3991	0.6871	0.1537	-344.5503	-371.9539	-1.8154	-1.8744
0.6277	0.1570	400	0.6237	-0.5205	-0.7427	0.6976	0.2222	-378.9102	-399.4672	-1.8111	-1.8699
0.5933	0.1963	500	0.6018	-0.6962	-1.0371	0.6931	0.3410	-408.3534	-417.0301	-1.7721	-1.8287
0.5665	0.2355	600	0.5955	-0.6340	-1.0330	0.6931	0.3989	-407.9362	-410.8186	-1.7701	-1.8241
0.5322	0.2748	700	0.5795	-0.7405	-1.2137	0.7111	0.4732	-426.0080	-421.4653	-1.7116	-1.7650
0.616	0.3141	800	0.5720	-0.7566	-1.2468	0.7186	0.4902	-429.3149	-423.0749	-1.6310	-1.6828
0.6129	0.3533	900	0.5755	-0.4970	-0.9648	0.7290	0.4677	-401.1144	-397.1149	-1.6471	-1.6991
0.5308	0.3926	1000	0.5657	-1.1354	-1.7018	0.7186	0.5664	-474.8171	-460.9562	-1.5510	-1.6002
0.589	0.4318	1100	0.5631	-1.1476	-1.7335	0.7201	0.5859	-477.9911	-462.1784	-1.5444	-1.5931
0.5694	0.4711	1200	0.5629	-1.0450	-1.6220	0.7246	0.5770	-466.8436	-451.9160	-1.5333	-1.5828
0.5809	0.5104	1300	0.5587	-0.9745	-1.5915	0.7275	0.6170	-463.7866	-444.8671	-1.4997	-1.5489
0.5597	0.5496	1400	0.5535	-1.1201	-1.7240	0.7380	0.6039	-477.0389	-459.4294	-1.4968	-1.5439
0.5964	0.5889	1500	0.5565	-0.8900	-1.4799	0.7350	0.5899	-452.6324	-436.4146	-1.4828	-1.5311
0.5329	0.6281	1600	0.5533	-1.0959	-1.7399	0.7365	0.6440	-478.6324	-457.0049	-1.4628	-1.5115
0.5701	0.6674	1700	0.5520	-1.1059	-1.7733	0.7425	0.6673	-481.9651	-458.0073	-1.4578	-1.5061
0.5522	0.7066	1800	0.5523	-1.0511	-1.7159	0.7380	0.6648	-476.2304	-452.5267	-1.4461	-1.4951
0.5659	0.7459	1900	0.5553	-0.9300	-1.5725	0.7365	0.6425	-461.8892	-440.4130	-1.4492	-1.4980
0.5375	0.7852	2000	0.5503	-1.1096	-1.7660	0.7440	0.6564	-481.2357	-458.3737	-1.4278	-1.4768
0.5836	0.8244	2100	0.5494	-1.1522	-1.8216	0.7395	0.6694	-486.8011	-462.6367	-1.4142	-1.4632
0.5282	0.8637	2200	0.5488	-1.1628	-1.8230	0.7365	0.6602	-486.9384	-463.6924	-1.4117	-1.4607
0.5604	0.9029	2300	0.5487	-1.1347	-1.7969	0.7380	0.6621	-484.3240	-460.8886	-1.4144	-1.4635
0.5365	0.9422	2400	0.5488	-1.1196	-1.7811	0.7380	0.6615	-482.7509	-459.3745	-1.4142	-1.4633
0.5135	0.9815	2500	0.5488	-1.1271	-1.7888	0.7380	0.6617	-483.5208	-460.1232	-1.4135	-1.4626

Framework versions

PEFT 0.12.0
Transformers 4.44.2
Pytorch 2.1.2
Datasets 3.0.0
Tokenizers 0.19.1

Flowersea37
/

zephyr-7b-dpo-qlora

zephyr-7b-dpo-qlora

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for Flowersea37/zephyr-7b-dpo-qlora

Dataset used to train Flowersea37/zephyr-7b-dpo-qlora

Evaluation results