metadata

base_model: gpt2
library_name: Distily
license: mit
tags:
  - generated_from_trainer
model-index:
  - name: distily_bench_gpt2_activation_loss_b
    results: []

distily_bench_gpt2_activation_loss_b

This student model is distilled from the teacher model gpt2 using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

eval_enwikippl: 210.2820
eval_frwikippl: 1274.1346
eval_zhwikippl: 583.2827
eval_loss: 1.2965
eval_runtime: 17.2526
eval_samples_per_second: 57.962
eval_steps_per_second: 7.245

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=2.0, loss_fn=mse, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
train_embeddings: True
learning_rate: 4e-05
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: constant
num_epochs: 1.0

Resource Usage

Peak GPU Memory: 8.0904 GB

Eval-Phase Metrics

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	zhwikippl
teacher eval		30.2086	57.2728					18.1784
0	0	58037.3203	58017.0117	6.0237	17.2607	57.935	7.242	56038.0625
1000	0.0808	715.0994	4658.6846	2.0131	17.1734	58.23	7.279	16350.8623
2000	0.1616	508.9246	3343.2109	1.8201	17.2004	58.138	7.267	3102.6990
3000	0.2424	419.7101	2552.4004	1.7020	17.1441	58.329	7.291	1042.4126
4000	0.3232	361.0421	2336.7490	1.6177	17.0616	58.611	7.326	911.8621
5000	0.4040	313.2633	1815.2219	1.5316	17.1786	58.212	7.276	863.9713
6000	0.4848	281.3860	1725.1301	1.4597	17.3168	57.747	7.218	705.6341
7000	0.5657	253.9131	1485.6165	1.3999	17.1434	58.332	7.291	605.2624
8000	0.6465	229.4073	1427.2965	1.3455	17.134	58.363	7.295	629.6656
9000	0.7273	210.2820	1274.1346	1.2965	17.2526	57.962	7.245	583.2827
10000	0.8081	194.6313	1199.3423	1.2490	17.1679	58.248	7.281	677.5621
11000	0.8889	180.3274	1160.25	1.1980	17.1591	58.278	7.285	758.1945
12000	0.9697	164.7045	1005.8066	1.1583	17.1824	58.199	7.275	600.1918
12375	1.0	161.0243	969.7354	1.1403	17.1939	58.16	7.27	632.9536

Framework versions

Distily 0.2.0
Transformers 4.44.0
Pytorch 2.3.0
Datasets 2.21.0