metadata

base_model: roneneldan/TinyStories-33M
library_name: Distily
tags:
  - generated_from_trainer
model-index:
  - name: distily_bench_obj_cross
    results: []

distily_bench_obj_cross

This student model is distilled from the teacher model roneneldan/TinyStories-33M using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

eval_enwikippl: 150.6954
eval_frwikippl: 20983.1934
eval_zhwikippl: 163274.0312
eval_tinystoriesppl: 13.6584
eval_loss: 2.1824
eval_runtime: 65.7475
eval_samples_per_second: 76.049
eval_steps_per_second: 9.506

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=10.0, loss_fn=mse, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=10.0, loss_fn=mse, layer_mapper=None, projector=None))
train_embeddings: True
learning_rate: 0.001
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: constant
num_epochs: 1.0

Resource Usage

Peak GPU Memory: 8.2666 GB

Eval-Phase Metrics

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	tinystoriesppl	zhwikippl
teacher eval		169.9865	47377.9414					3.9789	4998.1294
0	0	15507.5488	57030.9961	5.9658	65.2151	76.669	9.584	7965.0278	102309.4219
3000	0.0485	150.3223	21042.3887	2.1825	65.2954	76.575	9.572	13.5953	164147.5625
6000	0.0970	150.5029	21042.3887	2.1824	65.675	76.132	9.517	13.6268	164498.2812
9000	0.1455	150.6954	20983.1934	2.1824	65.7475	76.049	9.506	13.6584	163274.0312
12000	0.1939	150.3514	20983.1934	2.1823	65.8055	75.981	9.498	13.6251	162882.4219
15000	0.2424	150.6196	21042.3887	2.1824	65.3959	76.457	9.557	13.6482	164147.5625
18000	0.2909	150.6487	21042.3887	2.1824	65.5594	76.267	9.533	13.6533	163797.5938
21000	0.3394	150.3980	21042.3887	2.1824	65.4773	76.362	9.545	13.6234	163972.4844
24000	0.3879	150.5495	21131.4961	2.1825	65.4348	76.412	9.551	13.6184	164849.75
27000	0.4364	150.7538	21042.3887	2.1822	65.4127	76.438	9.555	13.6635	162925.9219
30000	0.4848	150.6954	21042.3887	2.1825	65.4113	76.439	9.555	13.6510	164586.0
33000	0.5333	151.0109	20983.1934	2.1823	65.6274	76.188	9.523	13.6832	163186.8594
36000	0.5818	150.3514	21042.3887	2.1824	65.4107	76.44	9.555	13.6184	164586.0
39000	0.6303	150.6020	20983.1934	2.1823	65.415	76.435	9.554	13.6550	163884.9375
42000	0.6788	150.5495	21042.3887	2.1823	65.3696	76.488	9.561	13.6454	163186.8594
45000	0.7273	150.3223	20995.0234	2.1824	65.7092	76.093	9.512	13.6257	163274.0312
48000	0.7758	150.8706	21042.3887	2.1824	65.5511	76.276	9.535	13.6652	163186.8594
51000	0.8242	150.8940	21006.8594	2.1823	65.6118	76.206	9.526	13.6719	163186.8594
54000	0.8727	150.4738	20918.2773	2.1824	65.6557	76.155	9.519	13.6539	162925.9219
57000	0.9212	150.4446	21042.3887	2.1824	65.3885	76.466	9.558	13.6257	163622.8906
60000	0.9697	150.4097	20918.2773	2.1824	65.4087	76.442	9.555	13.6533	162795.4688
61875	1.0	150.6896	21042.3887	2.1825	65.8705	75.907	9.488	13.6533	163972.4844

Framework versions

Distily 0.2.0
Transformers 4.44.0
Pytorch 2.3.0
Datasets 2.21.0