metadata

base_model: gpt2
library_name: Distily
license: mit
tags:
  - generated_from_trainer
model-index:
  - name: distily_bench_obj_cross_v2.15_gpt2
    results: []

distily_bench_obj_cross_v2.15_gpt2

This student model is distilled from the teacher model gpt2 using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

eval_enwikippl: 1784.0
eval_frwikippl: 9792.0
eval_zhwikippl: 72192.0
eval_tinystoriesppl: 1448.0
eval_loss: 2.5122
eval_runtime: 17.041
eval_samples_per_second: 58.682
eval_steps_per_second: 7.335

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=1.0, loss_fn=mse, layer_mapper=last, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
train_embeddings: True
learning_rate: 0.0004
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: constant
lr_scheduler_warmup_ratio: 0.2
num_epochs: 1.0

Resource Usage

Peak GPU Memory: 8.0892 GB

Eval-Phase Metrics

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	tinystoriesppl	zhwikippl
teacher eval		43.75	61.75					11.8125	19.125
0	0	2336462209024.0	122045790683136.0	22.4230	17.051	58.648	7.331	4429185024.0	25975962206208.0
1000	0.0808	588.0	3680.0	1.8545	17.0585	58.622	7.328	612.0	2880.0
2000	0.1616	988.0	5600.0	2.1657	17.0179	58.762	7.345	816.0	3200.0
3000	0.2424	1744.0	8640.0	2.5064	17.083	58.538	7.317	1544.0	46080.0
4000	0.3232	1864.0	8896.0	2.5506	17.0876	58.522	7.315	1544.0	63744.0
5000	0.4040	1728.0	8832.0	2.4970	17.0783	58.554	7.319	1520.0	51200.0
6000	0.4848	1936.0	9216.0	2.5779	17.0361	58.699	7.337	1688.0	63744.0
7000	0.5657	2224.0	9792.0	2.6441	17.0202	58.754	7.344	1832.0	82944.0
8000	0.6465	1936.0	8832.0	2.5707	17.057	58.627	7.328	1808.0	115200.0
9000	0.7273	1784.0	9792.0	2.5122	17.041	58.682	7.335	1448.0	72192.0
10000	0.8081	2064.0	9664.0	2.5934	17.147	58.319	7.29	1552.0	91648.0
11000	0.8889	2064.0	10240.0	2.6004	17.0431	58.675	7.334	1720.0	80896.0
12000	0.9697	2064.0	11008.0	2.6036	17.1142	58.431	7.304	1800.0	68608.0
12375	1.0	1992.0	9280.0	2.5963	17.0677	58.59	7.324	1752.0	70144.0

Framework versions

Distily 0.2.0
Transformers 4.44.0
Pytorch 2.3.0
Datasets 2.21.0