zephyr-dpo-qlora-gpt4-5e-6-epoch3

This model is a fine-tuned version of alignment-handbook/zephyr-7b-sft-full on the generation/GPT4 dataset. It achieves the following results on the evaluation set:

Loss: 1.8389
Rewards/chosen: -16.6323
Rewards/rejected: -19.5704
Rewards/accuracies: 0.6905
Rewards/margins: 2.9381
Rewards/margins Max: 11.9980
Rewards/margins Min: -5.2949
Rewards/margins Std: 7.7821
Logps/rejected: -2216.2207
Logps/chosen: -1948.4508
Logits/rejected: -1.4331
Logits/chosen: -1.5188

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-06
train_batch_size: 2
eval_batch_size: 4
seed: 42
distributed_type: multi-GPU
num_devices: 8
total_train_batch_size: 16
total_eval_batch_size: 32
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 3

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Rewards/margins Max	Rewards/margins Min	Rewards/margins Std	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.4777	0.28	100	0.6755	-0.3600	-0.4244	0.6032	0.0644	0.3559	-0.2210	0.2529	-301.6254	-321.2222	-2.6312	-2.6710
0.1416	0.56	200	0.9053	-6.7040	-7.2161	0.6270	0.5121	2.7698	-1.7984	2.0239	-980.7882	-955.6170	-1.4055	-1.4608
0.0426	0.85	300	0.9213	-7.5636	-8.6200	0.6786	1.0563	4.2652	-2.1614	2.8565	-1121.1776	-1041.5824	-1.6508	-1.7101
0.0537	1.13	400	1.1419	-12.1996	-13.1820	0.6468	0.9824	5.4879	-3.0621	3.7889	-1577.3877	-1505.1829	-1.5926	-1.6576
0.0197	1.41	500	1.6844	-17.1495	-18.8730	0.6667	1.7235	9.4195	-5.1462	6.5774	-2146.4797	-2000.1663	-1.4330	-1.5026
0.0029	1.69	600	1.9743	-14.5461	-17.4661	0.6865	2.9200	12.4008	-5.7167	8.1643	-2005.7900	-1739.8331	-1.4547	-1.5331
0.018	1.97	700	1.8030	-16.5306	-19.1782	0.6786	2.6476	11.2308	-5.2715	7.4338	-2177.0017	-1938.2783	-1.4133	-1.4978
0.0014	2.25	800	1.8519	-16.7236	-19.4930	0.6746	2.7694	11.6630	-5.3047	7.6237	-2208.4844	-1957.5789	-1.4433	-1.5266
0.0034	2.54	900	1.6799	-16.1476	-18.7797	0.6865	2.6322	10.7631	-4.8758	7.0339	-2137.1570	-1899.9781	-1.4489	-1.5324
0.0118	2.82	1000	1.8351	-16.6710	-19.6029	0.6825	2.9319	11.9629	-5.2899	7.7662	-2219.4746	-1952.3245	-1.4296	-1.5156

Framework versions

PEFT 0.7.1
Transformers 4.39.0.dev0
Pytorch 2.1.2+cu121
Datasets 2.14.6
Tokenizers 0.15.2

just1nseo
/

zephyr-dpo-qlora-gpt4-5e-6-epoch3

zephyr-dpo-qlora-gpt4-5e-6-epoch3

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for just1nseo/zephyr-dpo-qlora-gpt4-5e-6-epoch3

Evaluation results