metadata

base_model: gpt2
library_name: Distily
license: mit
tags:
  - generated_from_trainer
model-index:
  - name: distily_bench_obj_cross_v2.13_gpt2
    results: []

distily_bench_obj_cross_v2.13_gpt2

This student model is distilled from the teacher model gpt2 using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

eval_enwikippl: 1280.0
eval_frwikippl: 5952.0
eval_zhwikippl: 45056.0
eval_tinystoriesppl: 816.0
eval_loss: 2.4587
eval_runtime: 12.9968
eval_samples_per_second: 46.165
eval_steps_per_second: 11.541

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=1.0, loss_fn=mse, layer_mapper=uniform_cons, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
train_embeddings: True
learning_rate: 0.0001
train_batch_size: 8
eval_batch_size: 4
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.5
num_epochs: 1.0

Resource Usage

Peak GPU Memory: 8.0905 GB

Eval-Phase Metrics

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	tinystoriesppl	zhwikippl
teacher eval		43.75	61.75					11.8125	19.125
0	0	1821066133504.0	158329674399744.0	19.3254	13.0242	46.068	11.517	12079595520.0	98956046499840.0
750	0.1010	1280.0	5952.0	2.4587	12.9968	46.165	11.541	816.0	45056.0
1500	0.2020	506.0	3248.0	1.7942	13.0259	46.062	11.515	352.0	736.0
2250	0.3030	338.0	1624.0	1.5654	13.2237	45.373	11.343	278.0	306.0
3000	0.4040	241.0	920.0	1.3552	13.084	45.857	11.464	210.0	186.0
3750	0.5051	200.0	624.0	1.2059	13.1502	45.627	11.407	169.0	175.0
4500	0.6061	159.0	510.0	1.0607	12.9758	46.24	11.56	128.0	173.0
5250	0.7071	128.0	464.0	0.9374	13.0443	45.997	11.499	106.0	130.0
6000	0.8081	117.5	414.0	0.8786	12.9003	46.51	11.628	96.0	129.0
6750	0.9091	112.5	398.0	0.8508	12.8837	46.571	11.643	93.0	125.0
7425	1.0	112.0	396.0	0.8456	12.9485	46.337	11.584	91.5	125.0

Framework versions

Distily 0.2.0
Transformers 4.44.0
Pytorch 2.3.0
Datasets 2.21.0