mistral-dpo

This model is a fine-tuned version of TheBloke/OpenHermes-2-Mistral-7B-GPTQ on the None dataset. It achieves the following results on the evaluation set:

Loss: 0.8911
Rewards/chosen: 0.5387
Rewards/rejected: 0.4878
Rewards/accuracies: 0.5096
Rewards/margins: 0.0509
Logps/rejected: -174.3804
Logps/chosen: -178.5185
Logits/rejected: -2.5028
Logits/chosen: -2.5350

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0002
train_batch_size: 1
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 2
training_steps: 250
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6703	0.0	10	0.6842	-0.0001	-0.0268	0.5865	0.0267	-179.5257	-183.9063	-2.4290	-2.4720
0.7119	0.0	20	0.6751	0.1584	0.0990	0.5769	0.0594	-178.2678	-182.3211	-2.4542	-2.4988
0.647	0.0	30	0.6702	0.3569	0.2540	0.5769	0.1029	-176.7180	-180.3367	-2.4886	-2.5306
0.6748	0.0	40	0.6712	0.3439	0.2229	0.5288	0.1210	-177.0292	-180.4664	-2.5206	-2.5581
0.6513	0.0	50	0.6707	0.4403	0.2838	0.5577	0.1565	-176.4200	-179.5021	-2.5608	-2.5853
0.6103	0.0	60	0.6695	0.6831	0.4769	0.5577	0.2063	-174.4892	-177.0740	-2.5719	-2.5933
1.0313	0.01	70	0.6724	0.7062	0.5084	0.5577	0.1978	-174.1739	-176.8436	-2.5543	-2.5843
0.6876	0.01	80	0.6804	0.6995	0.5144	0.5385	0.1850	-174.1135	-176.9104	-2.5443	-2.5829
0.9661	0.01	90	0.6828	0.7118	0.5376	0.5385	0.1742	-173.8821	-176.7873	-2.5479	-2.5846
0.7354	0.01	100	0.6757	0.6765	0.5039	0.5577	0.1726	-174.2186	-177.1401	-2.5399	-2.5758
1.0127	0.01	110	0.7129	0.6089	0.4855	0.5288	0.1234	-174.4033	-177.8165	-2.5464	-2.5760
1.0366	0.01	120	0.7440	0.6068	0.4946	0.5481	0.1122	-174.3115	-177.8369	-2.5516	-2.5804
1.2145	0.01	130	0.7564	0.6521	0.5396	0.5673	0.1125	-173.8620	-177.3846	-2.5608	-2.5878
0.8342	0.01	140	0.7649	0.6639	0.5519	0.5385	0.1119	-173.7388	-177.2668	-2.5547	-2.5828
0.7402	0.01	150	0.7991	0.5831	0.4883	0.5	0.0948	-174.3747	-178.0745	-2.5498	-2.5775
0.7162	0.01	160	0.8396	0.6134	0.5474	0.5096	0.0659	-173.7835	-177.7718	-2.5445	-2.5713
0.9396	0.01	170	0.8573	0.5700	0.5144	0.5288	0.0556	-174.1144	-178.2057	-2.5326	-2.5629
0.5958	0.01	180	0.8708	0.5526	0.5017	0.5288	0.0509	-174.2406	-178.3789	-2.5227	-2.5540
0.7588	0.02	190	0.8865	0.5428	0.4977	0.5288	0.0450	-174.2806	-178.4775	-2.5207	-2.5493
0.7811	0.02	200	0.8933	0.5797	0.5429	0.5192	0.0368	-173.8286	-178.1080	-2.5171	-2.5434
0.5735	0.02	210	0.8907	0.5577	0.5174	0.5288	0.0403	-174.0838	-178.3279	-2.5069	-2.5366
0.7709	0.02	220	0.8886	0.5602	0.5167	0.5192	0.0435	-174.0907	-178.3035	-2.5041	-2.5361
0.4914	0.02	230	0.8884	0.5237	0.4766	0.5192	0.0471	-174.4924	-178.6684	-2.5050	-2.5375
0.739	0.02	240	0.8910	0.5281	0.4796	0.5192	0.0485	-174.4621	-178.6240	-2.5027	-2.5351
0.5743	0.02	250	0.8911	0.5387	0.4878	0.5096	0.0509	-174.3804	-178.5185	-2.5028	-2.5350

Framework versions

PEFT 0.7.1
Transformers 4.36.0
Pytorch 2.0.1+cu117
Datasets 2.15.0
Tokenizers 0.15.0

thobuiq
/

mistral-dpo

mistral-dpo

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for thobuiq/mistral-dpo

Evaluation results