metadata

base_model: gpt2
library_name: Distily
license: mit
tags:
  - generated_from_trainer
model-index:
  - name: distily_bench_obj_cross_v2.15_gpt2
    results: []

distily_bench_obj_cross_v2.15_gpt2

This student model is distilled from the teacher model gpt2 using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

eval_enwikippl: 2192.0
eval_frwikippl: 11200.0
eval_zhwikippl: 93184.0
eval_tinystoriesppl: 1808.0
eval_loss: 2.6293
eval_runtime: 16.9228
eval_samples_per_second: 59.092
eval_steps_per_second: 7.386

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
train_embeddings: True
learning_rate: 0.0004
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: constant
lr_scheduler_warmup_ratio: 0.2
num_epochs: 1.0

Resource Usage

Peak GPU Memory: 7.9368 GB

Eval-Phase Metrics

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	tinystoriesppl	zhwikippl
teacher eval		43.75	61.75					11.8125	19.125
0	0	2473901162496.0	170424302305280.0	20.7680	16.794	59.545	7.443	4060086272.0	71468255805440.0
1000	0.0808	688.0	3728.0	1.9530	16.821	59.449	7.431	652.0	2784.0
2000	0.1616	1728.0	8256.0	2.4948	16.7878	59.567	7.446	1384.0	35584.0
3000	0.2424	2040.0	10112.0	2.6087	16.7522	59.694	7.462	1720.0	64256.0
4000	0.3232	2160.0	9280.0	2.6353	16.796	59.538	7.442	1816.0	57088.0
5000	0.4040	1904.0	9088.0	2.5782	16.8206	59.451	7.431	1848.0	61440.0
6000	0.4848	1840.0	8960.0	2.5344	16.7618	59.659	7.457	1592.0	69120.0
7000	0.5657	1808.0	8512.0	2.5269	16.7913	59.555	7.444	1648.0	60672.0
8000	0.6465	2096.0	8960.0	2.6404	16.8233	59.442	7.43	1928.0	137216.0
9000	0.7273	2192.0	11200.0	2.6293	16.9228	59.092	7.386	1808.0	93184.0
10000	0.8081	1944.0	9984.0	2.5759	16.857	59.323	7.415	1568.0	80896.0
11000	0.8889	1736.0	9344.0	2.5147	16.8438	59.369	7.421	1488.0	48640.0
12000	0.9697	2224.0	11840.0	2.6633	16.7839	59.581	7.448	1968.0	98816.0
12375	1.0	2432.0	11072.0	2.7197	16.7952	59.541	7.443	2176.0	109568.0

Framework versions

Distily 0.2.0
Transformers 4.44.0
Pytorch 2.3.0
Datasets 2.21.0