Edit model card

zephyr-7b-dpo-qlora

This model is a fine-tuned version of /opt/data/private/xgq/alignment-handbook/data/Qwen-1.5b-sft-qlora on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

  • Loss: 0.5487
  • Rewards/chosen: -1.1270
  • Rewards/rejected: -1.7889
  • Rewards/accuracies: 0.7380
  • Rewards/margins: 0.6620
  • Logps/rejected: -483.5314
  • Logps/chosen: -460.1111
  • Logits/rejected: -1.4133
  • Logits/chosen: -1.4624

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-06
  • train_batch_size: 2
  • eval_batch_size: 2
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 3
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 24
  • total_eval_batch_size: 6
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_ratio: 0.1
  • num_epochs: 1

Training results

Training Loss Epoch Step Validation Loss Rewards/chosen Rewards/rejected Rewards/accuracies Rewards/margins Logps/rejected Logps/chosen Logits/rejected Logits/chosen
0.6873 0.0393 100 0.6862 0.0421 0.0275 0.6587 0.0146 -301.8899 -343.2013 -1.9289 -1.9903
0.6613 0.0785 200 0.6587 -0.0475 -0.1358 0.6707 0.0882 -318.2146 -352.1655 -1.9013 -1.9595
0.6358 0.1178 300 0.6395 -0.2454 -0.3991 0.6871 0.1537 -344.5503 -371.9539 -1.8154 -1.8744
0.6277 0.1570 400 0.6237 -0.5205 -0.7427 0.6976 0.2222 -378.9102 -399.4672 -1.8111 -1.8699
0.5933 0.1963 500 0.6018 -0.6962 -1.0371 0.6931 0.3410 -408.3534 -417.0301 -1.7721 -1.8287
0.5665 0.2355 600 0.5955 -0.6340 -1.0330 0.6931 0.3989 -407.9362 -410.8186 -1.7701 -1.8241
0.5322 0.2748 700 0.5795 -0.7405 -1.2137 0.7111 0.4732 -426.0080 -421.4653 -1.7116 -1.7650
0.616 0.3141 800 0.5720 -0.7566 -1.2468 0.7186 0.4902 -429.3149 -423.0749 -1.6310 -1.6828
0.6129 0.3533 900 0.5755 -0.4970 -0.9648 0.7290 0.4677 -401.1144 -397.1149 -1.6471 -1.6991
0.5308 0.3926 1000 0.5657 -1.1354 -1.7018 0.7186 0.5664 -474.8171 -460.9562 -1.5510 -1.6002
0.589 0.4318 1100 0.5631 -1.1476 -1.7335 0.7201 0.5859 -477.9911 -462.1784 -1.5444 -1.5931
0.5694 0.4711 1200 0.5629 -1.0450 -1.6220 0.7246 0.5770 -466.8436 -451.9160 -1.5333 -1.5828
0.5809 0.5104 1300 0.5587 -0.9745 -1.5915 0.7275 0.6170 -463.7866 -444.8671 -1.4997 -1.5489
0.5597 0.5496 1400 0.5535 -1.1201 -1.7240 0.7380 0.6039 -477.0389 -459.4294 -1.4968 -1.5439
0.5964 0.5889 1500 0.5565 -0.8900 -1.4799 0.7350 0.5899 -452.6324 -436.4146 -1.4828 -1.5311
0.5329 0.6281 1600 0.5533 -1.0959 -1.7399 0.7365 0.6440 -478.6324 -457.0049 -1.4628 -1.5115
0.5701 0.6674 1700 0.5520 -1.1059 -1.7733 0.7425 0.6673 -481.9651 -458.0073 -1.4578 -1.5061
0.5522 0.7066 1800 0.5523 -1.0511 -1.7159 0.7380 0.6648 -476.2304 -452.5267 -1.4461 -1.4951
0.5659 0.7459 1900 0.5553 -0.9300 -1.5725 0.7365 0.6425 -461.8892 -440.4130 -1.4492 -1.4980
0.5375 0.7852 2000 0.5503 -1.1096 -1.7660 0.7440 0.6564 -481.2357 -458.3737 -1.4278 -1.4768
0.5836 0.8244 2100 0.5494 -1.1522 -1.8216 0.7395 0.6694 -486.8011 -462.6367 -1.4142 -1.4632
0.5282 0.8637 2200 0.5488 -1.1628 -1.8230 0.7365 0.6602 -486.9384 -463.6924 -1.4117 -1.4607
0.5604 0.9029 2300 0.5487 -1.1347 -1.7969 0.7380 0.6621 -484.3240 -460.8886 -1.4144 -1.4635
0.5365 0.9422 2400 0.5488 -1.1196 -1.7811 0.7380 0.6615 -482.7509 -459.3745 -1.4142 -1.4633
0.5135 0.9815 2500 0.5488 -1.1271 -1.7888 0.7380 0.6617 -483.5208 -460.1232 -1.4135 -1.4626

Framework versions

  • PEFT 0.12.0
  • Transformers 4.44.2
  • Pytorch 2.1.2
  • Datasets 3.0.0
  • Tokenizers 0.19.1
Downloads last month
28
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Model tree for Flowersea37/zephyr-7b-dpo-qlora

Base model

Qwen/Qwen2-1.5B
Adapter
(1309)
this model

Dataset used to train Flowersea37/zephyr-7b-dpo-qlora