--- base_model: gpt2 library_name: Distily license: mit tags: - generated_from_trainer model-index: - name: distily_bench_gpt2_activation_loss_b results: [] --- # distily_bench_gpt2_activation_loss_b This student model is distilled from the teacher model [gpt2](https://huggingface.co/gpt2) using the dataset (unspecified). The [Distily](https://github.com/lapp0/distily) library was used for this distillation. It achieves the following results on the evaluation set: - eval_enwikippl: 225.9773 - eval_frwikippl: 1391.1320 - eval_zhwikippl: 821.2236 - eval_loss: 19.6630 - eval_runtime: 17.2806 - eval_samples_per_second: 57.868 - eval_steps_per_second: 7.234 ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=2.0, loss_fn=ce, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None)) - train_embeddings: True - learning_rate: 4e-05 - train_batch_size: 8 - eval_batch_size: 8 - seed: 42 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: constant - num_epochs: 1.0 ### Resource Usage Peak GPU Memory: 8.0903 GB ### Eval-Phase Metrics | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | zhwikippl | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | **teacher eval** | | 30.2086 | 57.2728 | | | | | 18.1784 | | 0 | 0 | 55429.6875 | 57698.8047 | 24.5150 | 17.2943 | 57.823 | 7.228 | 56988.9141 | | 1000 | 0.0808 | 713.7677 | 4453.7666 | 20.3910 | 17.3531 | 57.627 | 7.203 | 17866.8926 | | 2000 | 0.1616 | 521.2028 | 3308.0386 | 20.2010 | 17.3798 | 57.538 | 7.192 | 2471.2515 | | 3000 | 0.2424 | 433.2541 | 2722.2993 | 20.1000 | 17.3672 | 57.58 | 7.197 | 1283.4985 | | 4000 | 0.3232 | 387.5081 | 2569.3728 | 20.0170 | 17.3651 | 57.587 | 7.198 | 1167.0867 | | 5000 | 0.4040 | 332.2302 | 2197.1006 | 19.9310 | 17.283 | 57.86 | 7.233 | 1141.8051 | | 6000 | 0.4848 | 292.5944 | 1835.8154 | 19.8590 | 17.2939 | 57.824 | 7.228 | 905.3102 | | 7000 | 0.5657 | 266.3748 | 1648.5508 | 19.7820 | 17.3184 | 57.742 | 7.218 | 844.8045 | | 8000 | 0.6465 | 244.8321 | 1513.9550 | 19.7310 | 17.3028 | 57.794 | 7.224 | 1150.9904 | | 9000 | 0.7273 | 225.9773 | 1391.1320 | 19.6630 | 17.2806 | 57.868 | 7.234 | 821.2236 | | 10000 | 0.8081 | 209.6788 | 1266.0754 | 19.6040 | 17.3446 | 57.655 | 7.207 | 718.9499 | | 11000 | 0.8889 | 196.7588 | 1248.5234 | 19.5620 | 17.3611 | 57.6 | 7.2 | 611.5998 | | 12000 | 0.9697 | 179.4194 | 1137.2484 | 19.5120 | 17.3767 | 57.548 | 7.194 | 572.3267 | | 12375 | 1.0 | 175.7241 | 1080.9574 | 19.4920 | 17.3076 | 57.778 | 7.222 | 584.9987 | ### Framework versions - Distily 0.2.0 - Transformers 4.44.0 - Pytorch 2.3.0 - Datasets 2.21.0