metadata

base_model: roneneldan/TinyStories-33M
library_name: Distily
tags:
  - generated_from_trainer
model-index:
  - name: distily_bench_obj_cross
    results: []

distily_bench_obj_cross

This student model is distilled from the teacher model roneneldan/TinyStories-33M using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

eval_enwikippl: 158.7294
eval_frwikippl: 15434.1611
eval_zhwikippl: 106089.8359
eval_tinystoriesppl: 15.5930
eval_loss: 2.3671
eval_runtime: 65.5679
eval_samples_per_second: 76.257
eval_steps_per_second: 9.532

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=10.0, loss_fn=mse, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=10.0, loss_fn=mse, layer_mapper=None, projector=None))
train_embeddings: True
learning_rate: 0.001
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: constant
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1.0

Resource Usage

Peak GPU Memory: 8.2677 GB

Eval-Phase Metrics

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	tinystoriesppl	zhwikippl
teacher eval		169.9865	47377.9414					3.9789	4998.1294
0	0	21397.4785	57946.0117	6.1625	65.3142	76.553	9.569	12321.8145	60955.8008
3000	0.0485	158.5205	15451.5547	2.3672	65.4452	76.4	9.55	15.5596	105976.6797
6000	0.0970	159.4627	15519.1768	2.3670	65.5893	76.232	9.529	15.6570	107916.8281
9000	0.1455	158.7294	15434.1611	2.3671	65.5679	76.257	9.532	15.5930	106089.8359
12000	0.1939	159.8214	15466.7988	2.3670	65.4602	76.382	9.548	15.6933	107457.1484
15000	0.2424	159.5122	15501.6934	2.3671	65.4126	76.438	9.555	15.6648	107916.8281
18000	0.2909	158.8771	15434.1611	2.3670	65.4041	76.448	9.556	15.5956	106033.1953
21000	0.3394	159.3146	15434.1611	2.3671	65.4872	76.351	9.544	15.6460	106089.8359
24000	0.3879	159.4504	15434.1611	2.3670	65.5249	76.307	9.538	15.6589	106089.8359
27000	0.4364	158.9386	15386.3984	2.3669	65.4767	76.363	9.545	15.6163	105581.5391
30000	0.4848	159.1728	15451.5547	2.3671	65.3648	76.494	9.562	15.6369	107342.5391
33000	0.5333	159.8709	15466.7988	2.3670	65.4363	76.41	9.551	15.6965	106942.4062
36000	0.5818	159.2097	15460.2656	2.3670	65.4686	76.373	9.547	15.6318	107629.3516
39000	0.6303	158.6066	15503.8809	2.3670	65.5342	76.296	9.537	15.5724	107744.2734
42000	0.6788	158.5205	15468.9824	2.3671	65.5105	76.324	9.54	15.5576	107399.8828
45000	0.7273	158.7909	15399.4043	2.3670	65.5316	76.299	9.537	15.6163	106089.8359
48000	0.7758	158.7909	15434.1611	2.3671	65.4706	76.37	9.546	15.6027	106373.1953
51000	0.8242	158.8033	15425.4648	2.3669	65.5734	76.25	9.531	15.6169	106089.8359
54000	0.8727	158.9263	15434.1611	2.3670	65.5021	76.333	9.542	15.6085	106486.7812
57000	0.9212	159.3887	15451.5547	2.3671	65.5842	76.238	9.53	15.6505	107342.5391
60000	0.9697	159.4874	15390.7422	2.3670	65.5517	76.276	9.534	15.6641	105581.5391
61875	1.0	159.6729	15492.9736	2.3671	65.3871	76.468	9.558	15.6926	107342.5391

Framework versions

Distily 0.2.0
Transformers 4.44.0
Pytorch 2.3.0
Datasets 2.21.0