zephyr-7b-gpo-log-i1

This model is a fine-tuned version of DUAL-GPO/zephyr-7b-gpo-log-i0 on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

Loss: 0.7084
Rewards/chosen: -0.3387
Rewards/rejected: -0.3762
Rewards/accuracies: 0.4641
Rewards/margins: 0.0375
Logps/rejected: -284.1953
Logps/chosen: -296.7821
Logits/rejected: -1.6524
Logits/chosen: -1.8037

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-06
train_batch_size: 2
eval_batch_size: 2
seed: 42
distributed_type: multi-GPU
num_devices: 3
gradient_accumulation_steps: 2
total_train_batch_size: 12
total_eval_batch_size: 6
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6748	0.04	200	0.7007	-0.3675	-0.3814	0.4446	0.0139	-284.7155	-299.6654	-1.8001	-1.9625
0.6724	0.08	400	0.7027	-0.3184	-0.3527	0.4940	0.0344	-281.8482	-294.7475	-1.7890	-1.9524
0.6749	0.12	600	0.7100	-0.3255	-0.3594	0.4760	0.0339	-282.5139	-295.4615	-1.6820	-1.8358
0.6719	0.16	800	0.7050	-0.3022	-0.3372	0.4775	0.0350	-280.2988	-293.1357	-1.7259	-1.8834
0.6777	0.2	1000	0.7025	-0.2948	-0.3142	0.4461	0.0194	-277.9926	-292.3886	-1.7123	-1.8681
0.6724	0.24	1200	0.7089	-0.4249	-0.4720	0.4865	0.0471	-293.7763	-305.4027	-1.7346	-1.8939
0.6763	0.28	1400	0.7065	-0.3751	-0.4179	0.4746	0.0428	-288.3666	-300.4254	-1.6995	-1.8560
0.6729	0.32	1600	0.7084	-0.3379	-0.3600	0.4641	0.0221	-282.5755	-296.7008	-1.7340	-1.8920
0.6734	0.36	1800	0.7037	-0.3077	-0.3258	0.4521	0.0182	-279.1587	-293.6775	-1.7089	-1.8649
0.6754	0.4	2000	0.7073	-0.4076	-0.4418	0.4671	0.0342	-290.7584	-303.6719	-1.7361	-1.8949
0.679	0.44	2200	0.7075	-0.4434	-0.4787	0.4611	0.0353	-294.4463	-307.2497	-1.6814	-1.8362
0.6692	0.48	2400	0.7067	-0.3067	-0.3478	0.4716	0.0411	-281.3559	-293.5765	-1.6761	-1.8305
0.6778	0.52	2600	0.7036	-0.2610	-0.2905	0.4626	0.0294	-275.6222	-289.0128	-1.7120	-1.8687
0.6687	0.56	2800	0.7113	-0.4071	-0.4423	0.4626	0.0353	-290.8080	-303.6171	-1.6930	-1.8484
0.6741	0.6	3000	0.7067	-0.3261	-0.3614	0.4671	0.0354	-282.7206	-295.5167	-1.6692	-1.8222
0.674	0.64	3200	0.7085	-0.3171	-0.3556	0.4716	0.0384	-282.1313	-294.6258	-1.6840	-1.8385
0.6712	0.68	3400	0.7083	-0.3545	-0.3873	0.4626	0.0329	-285.3080	-298.3568	-1.6600	-1.8125
0.6738	0.72	3600	0.7078	-0.4016	-0.4475	0.4805	0.0458	-291.3219	-303.0744	-1.6368	-1.7870
0.6748	0.76	3800	0.7085	-0.3558	-0.4037	0.4746	0.0478	-286.9418	-298.4960	-1.6370	-1.7875
0.6746	0.8	4000	0.7097	-0.3549	-0.3943	0.4641	0.0394	-286.0046	-298.4026	-1.6465	-1.7977
0.6772	0.84	4200	0.7088	-0.3280	-0.3650	0.4611	0.0369	-283.0742	-295.7155	-1.6640	-1.8161
0.6718	0.88	4400	0.7082	-0.3267	-0.3617	0.4566	0.0349	-282.7410	-295.5824	-1.6550	-1.8062
0.6737	0.92	4600	0.7085	-0.3416	-0.3797	0.4656	0.0381	-284.5475	-297.0699	-1.6499	-1.8009
0.6742	0.96	4800	0.7085	-0.3387	-0.3765	0.4716	0.0378	-284.2217	-296.7780	-1.6508	-1.8018
0.6708	1.0	5000	0.7084	-0.3387	-0.3762	0.4641	0.0375	-284.1953	-296.7821	-1.6524	-1.8037

Framework versions

PEFT 0.7.1
Transformers 4.36.2
Pytorch 2.1.2+cu121
Datasets 2.14.6
Tokenizers 0.15.2

DUAL-GPO
/

zephyr-7b-gpo-log-i1

zephyr-7b-gpo-log-i1

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for DUAL-GPO/zephyr-7b-gpo-log-i1

Dataset used to train DUAL-GPO/zephyr-7b-gpo-log-i1

Evaluation results