metadata

base_model: roneneldan/TinyStories-33M
library_name: Distily
tags:
  - generated_from_trainer
model-index:
  - name: distily_bench_obj_cross
    results: []

distily_bench_obj_cross

This student model is distilled from the teacher model roneneldan/TinyStories-33M using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

eval_enwikippl: 3944.9531
eval_frwikippl: 30197.7344
eval_zhwikippl: 52496.3438
eval_tinystoriesppl: 1385.5492
eval_loss: 16.6107
eval_runtime: 66.9937
eval_samples_per_second: 74.634
eval_steps_per_second: 9.329

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=10.0, loss_fn=raw_mse, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=10.0, loss_fn=mse, layer_mapper=None, projector=None))
train_embeddings: True
learning_rate: 0.0004
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: constant
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1.0

Resource Usage

Peak GPU Memory: 8.2677 GB

Eval-Phase Metrics

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	tinystoriesppl	zhwikippl
teacher eval		169.9865	47377.9414					3.9789	4998.1294
0	0	21397.4785	57946.0117	18.3162	67.1093	74.505	9.313	12321.8145	60955.8008
3000	0.0485	3940.0654	30197.7344	16.6107	67.006	74.62	9.328	1383.7178	52496.3438
6000	0.0970	3937.6238	30180.7188	16.6119	67.1095	74.505	9.313	1383.9467	52496.3438
9000	0.1455	3944.9531	30197.7344	16.6107	66.9937	74.634	9.329	1385.5492	52496.3438
12000	0.1939	3944.9531	30197.7344	16.6115	67.0666	74.553	9.319	1384.8617	52496.3438
15000	0.2424	3937.6238	30180.7188	16.6121	66.6143	75.059	9.382	1385.3201	52524.3359
18000	0.2909	3937.6238	30180.7188	16.6117	66.7575	74.898	9.362	1384.1750	52552.3945
21000	0.3394	3944.9531	30197.7344	16.6115	66.679	74.986	9.373	1384.4041	52496.3438
24000	0.3879	3944.9531	30197.7344	16.6121	66.8908	74.749	9.344	1384.4041	52468.3164
27000	0.4364	3942.5085	30197.7344	16.6117	66.4311	75.266	9.408	1383.0317	52496.3438
30000	0.4848	3940.0654	30180.7188	16.6107	66.4762	75.215	9.402	1383.2599	52496.3438
33000	0.5333	3937.6238	30197.7344	16.6107	66.4814	75.209	9.401	1382.8029	52496.3438
36000	0.5818	3942.5085	30180.7188	16.6111	67.3001	74.294	9.287	1385.3201	52496.3438
39000	0.6303	3937.6238	30180.7188	16.6115	67.0065	74.62	9.327	1383.4888	52496.3438
42000	0.6788	3942.5085	30197.7344	16.6109	66.7444	74.913	9.364	1384.1750	52496.3438
45000	0.7273	3941.2869	30197.7344	16.6115	67.1516	74.458	9.307	1382.8029	52496.3438
48000	0.7758	3944.9531	30180.7188	16.6107	66.7762	74.877	9.36	1386.6947	52524.3359
51000	0.8242	3942.5085	30197.7344	16.6111	67.2623	74.336	9.292	1384.8617	52496.3438
54000	0.8727	3944.9531	30180.7188	16.6107	66.724	74.936	9.367	1385.3201	52496.3438
57000	0.9212	3941.2869	30197.7344	16.6115	67.0602	74.56	9.32	1382.8029	52468.3164
60000	0.9697	3942.5085	30197.7344	16.6119	67.4137	74.169	9.271	1382.8029	52468.3164
61875	1.0	3937.6238	30180.7188	16.6119	67.1794	74.428	9.303	1383.7178	52496.3438

Framework versions

Distily 0.2.0
Transformers 4.44.0
Pytorch 2.3.0
Datasets 2.21.0