metadata

base_model: gpt2
library_name: Distily
license: mit
tags:
  - generated_from_trainer
model-index:
  - name: distily_bench_obj_cross_v2.15_gpt2
    results: []

distily_bench_obj_cross_v2.15_gpt2

This student model is distilled from the teacher model gpt2 using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

eval_enwikippl: 99.0
eval_frwikippl: 408.0
eval_zhwikippl: 149.0
eval_tinystoriesppl: 74.5
eval_loss: 0.7768
eval_runtime: 16.7488
eval_samples_per_second: 59.706
eval_steps_per_second: 7.463

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
train_embeddings: True
learning_rate: 0.0001
train_batch_size: 4
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: constant
lr_scheduler_warmup_ratio: 0.2
num_epochs: 1.0

Resource Usage

Peak GPU Memory: 7.4226 GB

Eval-Phase Metrics

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	tinystoriesppl	zhwikippl
teacher eval		43.75	61.75					11.8125	19.125
0	0	10027008.0	5734400.0	11.3280	16.7303	59.772	7.471	5865472.0	3637248.0
1000	0.0404	382.0	2016.0	1.6270	16.7162	59.822	7.478	298.0	1200.0
2000	0.0808	270.0	988.0	1.4340	16.7294	59.775	7.472	227.0	390.0
3000	0.1212	218.0	736.0	1.2960	16.686	59.931	7.491	185.0	239.0
4000	0.1616	182.0	708.0	1.1629	16.7199	59.809	7.476	144.0	213.0
5000	0.2020	155.0	620.0	1.0601	16.7539	59.688	7.461	117.0	230.0
6000	0.2424	131.0	496.0	0.9643	16.7329	59.763	7.47	104.5	193.0
7000	0.2828	122.0	512.0	0.9001	16.7515	59.696	7.462	91.5	193.0
8000	0.3232	111.5	428.0	0.8336	16.677	59.963	7.495	82.5	161.0
9000	0.3636	99.0	408.0	0.7768	16.7488	59.706	7.463	74.5	149.0
10000	0.4040	89.5	386.0	0.7219	16.6219	60.162	7.52	71.5	140.0
11000	0.4444	82.5	332.0	0.6595	16.6753	59.969	7.496	67.0	148.0
12000	0.4848	78.0	300.0	0.6302	16.6631	60.013	7.502	60.75	131.0
13000	0.5253	73.5	292.0	0.6019	16.7166	59.821	7.478	59.5	117.0
14000	0.5657	75.0	284.0	0.5861	16.7002	59.88	7.485	59.5	137.0
15000	0.6061	71.5	252.0	0.5722	16.6732	59.976	7.497	55.25	130.0
16000	0.6465	70.0	250.0	0.5545	16.6934	59.904	7.488	57.75	104.5
17000	0.6869	70.0	272.0	0.5426	16.6888	59.92	7.49	55.5	130.0
18000	0.7273	70.0	248.0	0.5380	16.6762	59.966	7.496	53.75	124.0
19000	0.7677	68.5	227.0	0.5270	16.6682	59.994	7.499	53.5	96.5
20000	0.8081	65.5	219.0	0.5260	16.6778	59.96	7.495	52.0	129.0
21000	0.8485	68.0	228.0	0.5154	16.7388	59.741	7.468	52.0	140.0
22000	0.8889	68.5	246.0	0.5128	16.6637	60.011	7.501	51.75	216.0
23000	0.9293	64.5	245.0	0.5029	16.7201	59.808	7.476	52.5	146.0
24000	0.9697	66.5	230.0	0.5067	16.7059	59.859	7.482	51.25	168.0
24750	1.0	65.5	228.0	0.5042	16.685	59.934	7.492	51.25	100.5

Framework versions

Distily 0.2.0
Transformers 4.44.0
Pytorch 2.3.0
Datasets 2.21.0