metadata

base_model: gpt2
library_name: Distily
license: mit
tags:
  - generated_from_trainer
model-index:
  - name: distily_bench_obj_cross_v2.13_gpt2
    results: []

distily_bench_obj_cross_v2.13_gpt2

This student model is distilled from the teacher model gpt2 using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

eval_enwikippl: 2144.0
eval_frwikippl: 8704.0
eval_zhwikippl: 148480.0
eval_tinystoriesppl: 1648.0
eval_loss: 3.2443
eval_runtime: 12.9651
eval_samples_per_second: 46.278
eval_steps_per_second: 11.569

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=1.0, loss_fn=kl, layer_mapper=uniform_cons, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
train_embeddings: True
learning_rate: 0.0001
train_batch_size: 8
eval_batch_size: 4
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.5
num_epochs: 1.0

Resource Usage

Peak GPU Memory: 8.0905 GB

Eval-Phase Metrics

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	tinystoriesppl	zhwikippl
teacher eval		43.75	61.75					11.8125	19.125
0	0	1821066133504.0	158329674399744.0	25.4650	12.9053	46.493	11.623	12079595520.0	98956046499840.0
750	0.1010	2144.0	8704.0	3.2443	12.9651	46.278	11.569	1648.0	148480.0
1500	0.2020	776.0	4800.0	2.2942	12.9521	46.325	11.581	568.0	4736.0
2250	0.3030	448.0	2832.0	1.9334	12.9674	46.27	11.568	358.0	592.0
3000	0.4040	318.0	1424.0	1.6643	13.0049	46.136	11.534	256.0	308.0
3750	0.5051	249.0	912.0	1.4761	12.9665	46.273	11.568	198.0	474.0
4500	0.6061	188.0	684.0	1.2711	12.9804	46.224	11.556	152.0	354.0
5250	0.7071	147.0	560.0	1.1017	12.9809	46.222	11.555	116.0	218.0
6000	0.8081	134.0	490.0	1.0242	12.9725	46.252	11.563	105.5	186.0
6750	0.9091	125.5	464.0	0.9844	12.9741	46.246	11.561	99.0	175.0
7425	1.0	124.0	462.0	0.9768	12.938	46.375	11.594	97.5	165.0

Framework versions

Distily 0.2.0
Transformers 4.44.0
Pytorch 2.3.0
Datasets 2.21.0