Edit model card

zephyr-7b-dpo-qlora

This model is a fine-tuned version of alignment-handbook/zephyr-7b-sft-qlora on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

  • Loss: 0.5325
  • Rewards/chosen: -1.2325
  • Rewards/rejected: -2.0565
  • Rewards/accuracies: 0.7656
  • Rewards/margins: 0.8240
  • Logps/rejected: -457.4398
  • Logps/chosen: -373.4022
  • Logits/rejected: 0.7596
  • Logits/chosen: 0.5001

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-06
  • train_batch_size: 4
  • eval_batch_size: 8
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 8
  • total_train_batch_size: 32
  • total_eval_batch_size: 64
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_ratio: 0.1
  • num_epochs: 1

Training results

Training Loss Epoch Step Validation Loss Rewards/chosen Rewards/rejected Rewards/accuracies Rewards/margins Logps/rejected Logps/chosen Logits/rejected Logits/chosen
0.6916 0.05 100 0.6912 0.0059 0.0019 0.6484 0.0041 -251.6075 -249.5596 -2.2040 -2.2621
0.655 0.1 200 0.6498 -0.0559 -0.1762 0.7070 0.1203 -269.4106 -255.7421 -2.1011 -2.1614
0.6342 0.16 300 0.6146 -0.3407 -0.6269 0.7031 0.2862 -314.4839 -284.2224 -1.9037 -1.9793
0.6121 0.21 400 0.5946 -0.4657 -0.8916 0.7031 0.4259 -340.9551 -296.7203 -1.8717 -1.9543
0.5973 0.26 500 0.5938 -0.3681 -0.7766 0.7305 0.4085 -329.4522 -286.9666 -1.8440 -1.9282
0.5473 0.31 600 0.5774 -0.6893 -1.2264 0.7344 0.5371 -374.4341 -319.0812 -1.6815 -1.7726
0.5792 0.37 700 0.5709 -0.6635 -1.2100 0.7578 0.5465 -372.7989 -316.5072 -1.4783 -1.5775
0.5194 0.42 800 0.5590 -1.0208 -1.6453 0.7461 0.6245 -416.3269 -352.2357 -0.3791 -0.5486
0.5367 0.47 900 0.5492 -1.1477 -1.8521 0.7266 0.7044 -437.0040 -364.9276 -0.0908 -0.2899
0.5575 0.52 1000 0.5450 -1.1704 -1.9048 0.7344 0.7344 -442.2755 -367.1964 0.2761 0.0498
0.5507 0.58 1100 0.5429 -1.1040 -1.8671 0.7422 0.7631 -438.5026 -360.5551 0.5339 0.2877
0.5305 0.63 1200 0.5366 -1.1557 -1.9243 0.7578 0.7686 -444.2217 -365.7241 0.7350 0.4755
0.5171 0.68 1300 0.5304 -1.3741 -2.1678 0.7656 0.7937 -468.5735 -387.5681 0.7686 0.5029
0.4875 0.73 1400 0.5321 -1.3228 -2.1513 0.7578 0.8285 -466.9267 -382.4329 0.8566 0.5926
0.5216 0.78 1500 0.5326 -1.2006 -2.0034 0.7617 0.8028 -452.1298 -370.2103 0.7189 0.4630
0.4894 0.84 1600 0.5327 -1.2300 -2.0556 0.7656 0.8256 -457.3565 -373.1585 0.7405 0.4828
0.5179 0.89 1700 0.5326 -1.2313 -2.0558 0.7656 0.8245 -457.3720 -373.2860 0.7604 0.5012
0.5534 0.94 1800 0.5325 -1.2309 -2.0558 0.7656 0.8249 -457.3779 -373.2437 0.7550 0.4957
0.5539 0.99 1900 0.5325 -1.2325 -2.0565 0.7656 0.8240 -457.4398 -373.4022 0.7596 0.5001

Framework versions

  • PEFT 0.7.1
  • Transformers 4.36.2
  • Pytorch 2.1.2+cu121
  • Datasets 2.14.6
  • Tokenizers 0.15.0
Downloads last month
0
Unable to determine this model’s pipeline type. Check the docs .

Adapter for

Dataset used to train lewtun/zephyr-7b-dpo-qlora