lapp0's picture
End of training
1394416 verified
metadata
base_model: gpt2
library_name: Distily
license: mit
tags:
  - generated_from_trainer
model-index:
  - name: istily_bench_gpt2_simple_objectives
    results: []

distily_bench_gpt2_simple_objectives

This student model is distilled from the teacher model gpt2 using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

  • eval_enwikippl: 495.9718
  • eval_frwikippl: 3345.0957
  • eval_zhwikippl: 2696.0598
  • eval_loss: 40.5622
  • eval_runtime: 34.3051
  • eval_samples_per_second: 58.3
  • eval_steps_per_second: 7.288

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • distillation_objective: MultiObjective(logits_weight=1, logits_loss_fn=(fn:kl_divergence_loss()), activations_weight=0.2, activations_loss_fn=(fn:mse_loss()), attentions_weight=0, attentions_loss_fn=(fn:mse_loss()))
  • train_embeddings: True
  • learning_rate: 4e-05
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: constant
  • num_epochs: 1.0

Resource Usage

Peak GPU Memory: 8.0893 GB

Eval-Phase Metrics

step epoch enwikippl frwikippl loss runtime samples_per_second steps_per_second zhwikippl
teacher eval 30.2086 57.2728 18.1784
0 0 54069.2930 57285.3438 133.3560 34.3024 58.305 7.288 54227.1016
1000 0.0404 1437.1931 8480.3613 43.3980 34.5438 57.898 7.237 74395.6406
2000 0.0808 997.8790 5260.8130 42.3937 34.3703 58.19 7.274 33120.7461
3000 0.1212 830.5260 5152.1616 41.7965 34.348 58.228 7.278 11334.5342
4000 0.1616 745.0864 4422.4756 41.2595 34.4519 58.052 7.256 5651.1323
5000 0.2020 644.0798 4158.1821 41.1632 34.4631 58.033 7.254 4903.9395
6000 0.2424 592.7726 3791.3215 40.8778 34.3097 58.293 7.287 4353.2559
7000 0.2828 545.3409 3490.1353 40.8020 34.4207 58.105 7.263 3123.4839
8000 0.3232 519.2236 3238.8032 40.6310 34.2625 58.373 7.297 1952.6049
9000 0.3636 495.9718 3345.0957 40.5622 34.3051 58.3 7.288 2696.0598
10000 0.4040 482.7110 3048.2520 40.4688 34.3828 58.169 7.271 2027.5375
11000 0.4444 453.9180 2860.8340 40.3758 34.2822 58.339 7.292 2861.5081
12000 0.4848 441.7129 2985.2966 40.2887 34.2175 58.45 7.306 2510.5007
13000 0.5253 429.0357 2882.7014 40.1765 34.4175 58.11 7.264 6012.3589
14000 0.5657 416.6578 2756.2913 40.1022 34.4762 58.011 7.251 12478.4199
15000 0.6061 406.1163 2797.8003 40.0135 34.5042 57.964 7.246 6068.8252
16000 0.6465 405.6435 2525.5491 39.9328 34.3124 58.288 7.286 4309.2979
17000 0.6869 394.7977 2709.6606 39.9735 34.3165 58.281 7.285 2797.2800
18000 0.7273 397.4739 2544.8535 39.7368 34.4016 58.137 7.267 9888.5605
19000 0.7677 387.6284 2540.5505 39.7513 34.3493 58.225 7.278 5071.7769
20000 0.8081 378.9675 2503.9182 39.6105 34.4198 58.106 7.263 3492.3926
21000 0.8485 376.9130 2442.8845 39.5590 34.343 58.236 7.28 10077.8555
22000 0.8889 374.1136 2348.3101 39.5182 34.2953 58.317 7.29 3595.5537
23000 0.9293 368.7203 2389.3955 39.4282 34.6197 57.771 7.221 11663.1113
24000 0.9697 365.7831 2363.9253 39.4065 34.6468 57.725 7.216 5269.2183
24750 1.0 363.6872 2441.5068 39.3040 34.7181 57.607 7.201 2566.7729

Framework versions

  • Distily 0.2.0
  • Transformers 4.44.0
  • Pytorch 2.3.0
  • Datasets 2.20.0