metadata

base_model: gpt2
library_name: Distily
license: mit
tags:
  - generated_from_trainer
model-index:
  - name: distily_bench_obj_cross_v2.13_gpt2
    results: []

distily_bench_obj_cross_v2.13_gpt2

This student model is distilled from the teacher model gpt2 using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

eval_enwikippl: 1272.0
eval_frwikippl: 5952.0
eval_zhwikippl: 44800.0
eval_tinystoriesppl: 816.0
eval_loss: 2.4588
eval_runtime: 12.986
eval_samples_per_second: 46.204
eval_steps_per_second: 11.551

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=1.0, loss_fn=mse, layer_mapper=last, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
train_embeddings: True
learning_rate: 0.0001
train_batch_size: 8
eval_batch_size: 4
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.5
num_epochs: 1.0

Resource Usage

Peak GPU Memory: 8.0905 GB

Eval-Phase Metrics

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	tinystoriesppl	zhwikippl
teacher eval		43.75	61.75					11.8125	19.125
0	0	1821066133504.0	158329674399744.0	19.3254	12.8793	46.586	11.647	12079595520.0	98956046499840.0
750	0.1010	1272.0	5952.0	2.4588	12.986	46.204	11.551	816.0	44800.0
1500	0.2020	506.0	3248.0	1.7942	13.1654	45.574	11.393	350.0	736.0
2250	0.3030	338.0	1632.0	1.5655	13.13	45.697	11.424	278.0	306.0
3000	0.4040	241.0	916.0	1.3548	13.0817	45.865	11.466	209.0	188.0
3750	0.5051	199.0	624.0	1.2057	13.1347	45.68	11.42	169.0	174.0
4500	0.6061	160.0	510.0	1.0606	13.0674	45.916	11.479	129.0	171.0
5250	0.7071	128.0	468.0	0.9370	13.0344	46.032	11.508	106.5	132.0
6000	0.8081	118.0	418.0	0.8787	13.3306	45.009	11.252	97.0	131.0
6750	0.9091	112.0	400.0	0.8501	13.0719	45.9	11.475	93.0	129.0
7425	1.0	112.0	396.0	0.8454	13.0273	46.057	11.514	91.5	128.0

Framework versions

Distily 0.2.0
Transformers 4.44.0
Pytorch 2.3.0
Datasets 2.21.0