metadata

base_model: gpt2
library_name: Distily
license: mit
tags:
  - generated_from_trainer
model-index:
  - name: distily_bench_obj_cross_v2.13_gpt2
    results: []

distily_bench_obj_cross_v2.13_gpt2

This student model is distilled from the teacher model gpt2 using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

eval_enwikippl: 1376.0
eval_frwikippl: 5856.0
eval_zhwikippl: 111104.0
eval_tinystoriesppl: 964.0
eval_loss: 3.1072
eval_runtime: 12.9331
eval_samples_per_second: 46.392
eval_steps_per_second: 11.598

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=1.0, loss_fn=cos, layer_mapper=last, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
train_embeddings: True
learning_rate: 0.0001
train_batch_size: 8
eval_batch_size: 4
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.5
num_epochs: 1.0

Resource Usage

Peak GPU Memory: 8.0905 GB

Eval-Phase Metrics

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	tinystoriesppl	zhwikippl
teacher eval		43.75	61.75					11.8125	19.125
0	0	1821066133504.0	158329674399744.0	20.2008	12.9107	46.473	11.618	12079595520.0	98956046499840.0
750	0.1010	1376.0	5856.0	3.1072	12.9331	46.392	11.598	964.0	111104.0
1500	0.2020	580.0	3600.0	2.2189	12.9266	46.416	11.604	438.0	1020.0
2250	0.3030	376.0	1904.0	1.9249	12.9295	46.405	11.601	312.0	374.0
3000	0.4040	268.0	1080.0	1.6655	12.9091	46.479	11.62	238.0	208.0
3750	0.5051	211.0	732.0	1.4810	12.9336	46.391	11.598	172.0	217.0
4500	0.6061	167.0	580.0	1.2985	12.9202	46.439	11.61	143.0	146.0
5250	0.7071	135.0	486.0	1.1339	12.9225	46.431	11.608	112.5	133.0
6000	0.8081	124.5	452.0	1.0647	12.9107	46.473	11.618	101.5	125.5
6750	0.9091	118.5	436.0	1.0324	12.9153	46.456	11.614	97.5	120.0
7425	1.0	117.5	432.0	1.0264	13.1576	45.601	11.4	95.5	118.5

Framework versions

Distily 0.2.0
Transformers 4.44.0
Pytorch 2.3.0
Datasets 2.21.0