pythia410m-dpo-tldr
This model is a fine-tuned version of mnoukhov/pythia410m-sft-tldr on an unknown dataset. It achieves the following results on the evaluation set:
- Loss: 0.5395
- Rewards/chosen: -1.3883
- Rewards/rejected: -1.9858
- Rewards/accuracies: 0.7226
- Rewards/margins: 0.5975
- Logps/rejected: -98.0320
- Logps/chosen: -98.0320
- Logps/ref Rejected: -63.5119
- Logps/ref Chosen: -70.2656
Model description
More information needed
Intended uses & limitations
More information needed
Training and evaluation data
More information needed
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 3e-05
- train_batch_size: 16
- eval_batch_size: 8
- seed: 42
- distributed_type: multi-GPU
- gradient_accumulation_steps: 4
- total_train_batch_size: 64
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- num_epochs: 1.0
Training results
Training Loss | Epoch | Step | Logps/chosen | Logps/ref Chosen | Logps/ref Rejected | Logps/rejected | Validation Loss | Rewards/accuracies | Rewards/chosen | Rewards/margins | Rewards/rejected |
---|---|---|---|---|---|---|---|---|---|---|---|
0.5961 | 0.2 | 291 | -93.0907 | -70.2656 | -63.5119 | -93.0907 | 0.5659 | 0.7036 | -1.1413 | 0.4667 | -1.6079 |
0.5574 | 0.4 | 582 | 0.5405 | -1.6195 | -2.2373 | 0.7216 | 0.6178 | -102.6558 | -102.6558 | -63.5119 | -70.2656 |
0.5418 | 0.6 | 873 | 0.5373 | -1.4908 | -2.1191 | 0.7226 | 0.6283 | -100.0813 | -100.0813 | -63.5119 | -70.2656 |
0.5339 | 0.8 | 1164 | 0.5395 | -1.3883 | -1.9858 | 0.7226 | 0.5975 | -98.0320 | -98.0320 | -63.5119 | -70.2656 |
Framework versions
- PEFT 0.10.0
- Transformers 4.38.2
- Pytorch 2.1.2+cu121
- Datasets 2.17.0
- Tokenizers 0.15.2
- Downloads last month
- 0
Model tree for mnoukhov/pythia410m-dpo-tldr
Base model
EleutherAI/pythia-410m-deduped
Finetuned
mnoukhov/pythia410m-sft-tldr