metadata

base_model: gpt2
library_name: Distily
license: mit
tags:
  - generated_from_trainer
model-index:
  - name: distily_bench_obj_cross_v2.15_gpt2
    results: []

distily_bench_obj_cross_v2.15_gpt2

This student model is distilled from the teacher model gpt2 using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

eval_enwikippl: 5056.0
eval_frwikippl: 3696.0
eval_zhwikippl: 29312.0
eval_tinystoriesppl: 4672.0
eval_loss: 1.3042
eval_runtime: 16.773
eval_samples_per_second: 59.62
eval_steps_per_second: 7.452

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
train_embeddings: True
learning_rate: 0.0004
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: constant
lr_scheduler_warmup_ratio: 0.2
num_epochs: 1.0

Resource Usage

Peak GPU Memory: 7.9368 GB

Eval-Phase Metrics

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	tinystoriesppl	zhwikippl
teacher eval		43.75	61.75					11.8125	19.125
0	0	2336462209024.0	122045790683136.0	24.1200	16.7622	59.658	7.457	4429185024.0	25975962206208.0
1000	0.0808	3312.0	3184.0	1.1229	16.7657	59.646	7.456	2720.0	2992.0
2000	0.1616	5248.0	4048.0	1.2528	16.7825	59.586	7.448	5408.0	9984.0
3000	0.2424	5600.0	3744.0	1.2812	16.7695	59.632	7.454	5312.0	23680.0
4000	0.3232	5408.0	3920.0	1.2832	16.8694	59.279	7.41	5440.0	33280.0
5000	0.4040	5376.0	3952.0	1.2841	16.7361	59.751	7.469	5408.0	27008.0
6000	0.4848	5344.0	3680.0	1.2770	16.7635	59.653	7.457	5440.0	29312.0
7000	0.5657	5024.0	3760.0	1.2800	16.7492	59.704	7.463	5184.0	39936.0
8000	0.6465	4992.0	3712.0	1.2922	16.7445	59.721	7.465	5088.0	26752.0
9000	0.7273	5056.0	3696.0	1.3042	16.773	59.62	7.452	4672.0	29312.0
10000	0.8081	5824.0	3648.0	1.3192	16.7669	59.641	7.455	5312.0	24448.0
11000	0.8889	5568.0	3872.0	1.3215	16.8215	59.448	7.431	5504.0	40704.0
12000	0.9697	5440.0	3792.0	1.3263	16.7825	59.586	7.448	5120.0	72704.0
12375	1.0	5696.0	3936.0	1.3389	16.7852	59.576	7.447	5696.0	40192.0

Framework versions

Distily 0.2.0
Transformers 4.44.0
Pytorch 2.3.0
Datasets 2.21.0