distily_bench_obj_cross_v2.10_gpt2

This student model is distilled from the teacher model gpt2 using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

eval_enwikippl: 401.6902
eval_frwikippl: 385.9396
eval_zhwikippl: 137.9653
eval_tinystoriesppl: 881.4292
eval_loss: 0.7112
eval_runtime: 21.2483
eval_samples_per_second: 47.063
eval_steps_per_second: 11.766

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
train_embeddings: True
learning_rate: 4e-05
train_batch_size: 1
eval_batch_size: 4
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 1.0

Resource Usage

Peak GPU Memory: 3.9285 GB

Eval-Phase Metrics

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	tinystoriesppl	zhwikippl
teacher eval		270.2348	76.8142					671.1238	22.8030
0	0	120078.375	1867851235328.0	18.7920	21.2125	47.142	11.786	72.8770	4013754155008.0
5000	0.0505	621.5149	991.7020	1.3528	21.2177	47.13	11.783	980.0922	399.9691
10000	0.1010	574.4407	664.8521	1.1590	21.2225	47.12	11.78	1036.6780	493.8460
15000	0.1515	543.0890	635.0353	1.0360	21.2351	47.092	11.773	1033.2988	145.9157
20000	0.2020	509.8121	599.6746	0.9759	21.2099	47.148	11.787	985.1690	251.1274
25000	0.2525	448.2854	486.9003	0.8334	21.2284	47.107	11.777	923.3450	171.9567
30000	0.3030	420.2149	441.8981	0.7741	21.2742	47.005	11.751	893.9037	129.4944
35000	0.3535	417.6187	442.7548	0.7695	21.5924	46.313	11.578	884.2755	140.6411
40000	0.4040	419.8570	418.2776	0.7678	21.23	47.103	11.776	893.9774	162.6632
45000	0.4545	420.1905	413.8966	0.7576	21.2355	47.091	11.773	905.9177	154.8089
50000	0.5051	420.9561	426.7430	0.7544	21.2196	47.126	11.782	906.1800	147.5501
55000	0.5556	417.3034	409.1867	0.7509	21.2021	47.165	11.791	902.3304	143.7327
60000	0.6061	418.3230	413.0230	0.7525	21.2367	47.088	11.772	894.0145	156.6996
65000	0.6566	404.0308	404.5305	0.7221	21.2003	47.169	11.792	878.4468	136.2006
70000	0.7071	406.0154	392.1317	0.7194	21.2119	47.143	11.786	891.9106	137.0481
75000	0.7576	400.8665	383.9604	0.7188	21.2118	47.144	11.786	871.7914	140.4630
80000	0.8081	402.5625	387.4647	0.7168	21.2234	47.118	11.779	882.3771	141.0827
85000	0.8586	399.3479	385.9124	0.7123	21.2047	47.159	11.79	875.1130	140.0700
90000	0.9091	401.2549	386.7830	0.7117	21.2316	47.1	11.775	881.0649	138.5555
95000	0.9596	401.4725	386.1842	0.7112	21.2217	47.122	11.78	880.2640	138.0389
99000	1.0	401.6902	385.9396	0.7112	21.2483	47.063	11.766	881.4292	137.9653

Framework versions

Distily 0.2.0
Transformers 4.44.0
Pytorch 2.3.0
Datasets 2.21.0

lapp0
/

distily_bench_obj_cross_v2.10_gpt2