--- base_model: gpt2 library_name: Distily license: mit tags: - generated_from_trainer model-index: - name: istily_bench_gpt2_simple_objectives results: [] --- # distily_bench_gpt2_simple_objectives This student model is distilled from the teacher model [gpt2](https://huggingface.co/gpt2) using the dataset (unspecified). The [Distily](https://github.com/lapp0/distily) library was used for this distillation. It achieves the following results on the evaluation set: - eval_enwikippl: 495.9718 - eval_frwikippl: 3345.0957 - eval_zhwikippl: 2696.0598 - eval_loss: 40.5622 - eval_runtime: 34.3051 - eval_samples_per_second: 58.3 - eval_steps_per_second: 7.288 ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - distillation_objective: MultiObjective(logits_weight=1, logits_loss_fn=(fn:kl_divergence_loss()), activations_weight=0.2, activations_loss_fn=(fn:mse_loss()), attentions_weight=0, attentions_loss_fn=(fn:mse_loss())) - train_embeddings: True - learning_rate: 4e-05 - train_batch_size: 8 - eval_batch_size: 8 - seed: 42 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: constant - num_epochs: 1.0 ### Resource Usage Peak GPU Memory: 8.0893 GB ### Eval-Phase Metrics | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | zhwikippl | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | **teacher eval** | | 30.2086 | 57.2728 | | | | | 18.1784 | | 0 | 0 | 54069.2930 | 57285.3438 | 133.3560 | 34.3024 | 58.305 | 7.288 | 54227.1016 | | 1000 | 0.0404 | 1437.1931 | 8480.3613 | 43.3980 | 34.5438 | 57.898 | 7.237 | 74395.6406 | | 2000 | 0.0808 | 997.8790 | 5260.8130 | 42.3937 | 34.3703 | 58.19 | 7.274 | 33120.7461 | | 3000 | 0.1212 | 830.5260 | 5152.1616 | 41.7965 | 34.348 | 58.228 | 7.278 | 11334.5342 | | 4000 | 0.1616 | 745.0864 | 4422.4756 | 41.2595 | 34.4519 | 58.052 | 7.256 | 5651.1323 | | 5000 | 0.2020 | 644.0798 | 4158.1821 | 41.1632 | 34.4631 | 58.033 | 7.254 | 4903.9395 | | 6000 | 0.2424 | 592.7726 | 3791.3215 | 40.8778 | 34.3097 | 58.293 | 7.287 | 4353.2559 | | 7000 | 0.2828 | 545.3409 | 3490.1353 | 40.8020 | 34.4207 | 58.105 | 7.263 | 3123.4839 | | 8000 | 0.3232 | 519.2236 | 3238.8032 | 40.6310 | 34.2625 | 58.373 | 7.297 | 1952.6049 | | 9000 | 0.3636 | 495.9718 | 3345.0957 | 40.5622 | 34.3051 | 58.3 | 7.288 | 2696.0598 | | 10000 | 0.4040 | 482.7110 | 3048.2520 | 40.4688 | 34.3828 | 58.169 | 7.271 | 2027.5375 | | 11000 | 0.4444 | 453.9180 | 2860.8340 | 40.3758 | 34.2822 | 58.339 | 7.292 | 2861.5081 | | 12000 | 0.4848 | 441.7129 | 2985.2966 | 40.2887 | 34.2175 | 58.45 | 7.306 | 2510.5007 | | 13000 | 0.5253 | 429.0357 | 2882.7014 | 40.1765 | 34.4175 | 58.11 | 7.264 | 6012.3589 | | 14000 | 0.5657 | 416.6578 | 2756.2913 | 40.1022 | 34.4762 | 58.011 | 7.251 | 12478.4199 | | 15000 | 0.6061 | 406.1163 | 2797.8003 | 40.0135 | 34.5042 | 57.964 | 7.246 | 6068.8252 | | 16000 | 0.6465 | 405.6435 | 2525.5491 | 39.9328 | 34.3124 | 58.288 | 7.286 | 4309.2979 | | 17000 | 0.6869 | 394.7977 | 2709.6606 | 39.9735 | 34.3165 | 58.281 | 7.285 | 2797.2800 | | 18000 | 0.7273 | 397.4739 | 2544.8535 | 39.7368 | 34.4016 | 58.137 | 7.267 | 9888.5605 | | 19000 | 0.7677 | 387.6284 | 2540.5505 | 39.7513 | 34.3493 | 58.225 | 7.278 | 5071.7769 | | 20000 | 0.8081 | 378.9675 | 2503.9182 | 39.6105 | 34.4198 | 58.106 | 7.263 | 3492.3926 | | 21000 | 0.8485 | 376.9130 | 2442.8845 | 39.5590 | 34.343 | 58.236 | 7.28 | 10077.8555 | | 22000 | 0.8889 | 374.1136 | 2348.3101 | 39.5182 | 34.2953 | 58.317 | 7.29 | 3595.5537 | | 23000 | 0.9293 | 368.7203 | 2389.3955 | 39.4282 | 34.6197 | 57.771 | 7.221 | 11663.1113 | | 24000 | 0.9697 | 365.7831 | 2363.9253 | 39.4065 | 34.6468 | 57.725 | 7.216 | 5269.2183 | | 24750 | 1.0 | 363.6872 | 2441.5068 | 39.3040 | 34.7181 | 57.607 | 7.201 | 2566.7729 | ### Framework versions - Distily 0.2.0 - Transformers 4.44.0 - Pytorch 2.3.0 - Datasets 2.20.0