metadata

base_model: roneneldan/TinyStories-33M
library_name: Distily
tags:
  - generated_from_trainer
model-index:
  - name: distily_bench_obj_cross
    results: []

distily_bench_obj_cross

This student model is distilled from the teacher model roneneldan/TinyStories-33M using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

eval_enwikippl: 207.7861
eval_frwikippl: 15066.8408
eval_zhwikippl: 64727.0352
eval_tinystoriesppl: 24.1522
eval_loss: 13.2019
eval_runtime: 65.4151
eval_samples_per_second: 76.435
eval_steps_per_second: 9.554

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=10.0, loss_fn=raw_mse, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=10.0, loss_fn=mse, layer_mapper=None, projector=None))
train_embeddings: True
learning_rate: 0.001
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: constant
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1.0

Resource Usage

Peak GPU Memory: 8.2677 GB

Eval-Phase Metrics

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	tinystoriesppl	zhwikippl
teacher eval		169.9865	47377.9414					3.9789	4998.1294
0	0	21397.4785	57946.0117	18.3162	65.6143	76.203	9.525	12321.8145	60955.8008
3000	0.0485	207.9149	15083.8350	13.2031	65.3099	76.558	9.57	24.1822	63920.4375
6000	0.0970	207.6253	15109.3467	13.2019	65.2976	76.572	9.572	24.1253	65386.5469
9000	0.1455	207.7861	15066.8408	13.2019	65.4151	76.435	9.554	24.1522	64727.0352
12000	0.1939	207.4002	15100.8330	13.2016	65.3491	76.512	9.564	24.0894	65229.7188
15000	0.2424	207.8023	15100.8330	13.2017	65.4255	76.423	9.553	24.1442	65247.1406
18000	0.2909	208.3987	15075.3359	13.2031	65.4213	76.428	9.553	24.2462	64057.0078
21000	0.3394	208.0761	15100.8330	13.2026	65.2706	76.604	9.576	24.2142	64537.3164
24000	0.3879	207.9955	15100.8330	13.2027	65.2287	76.653	9.582	24.1822	64159.6602
27000	0.4364	208.3180	15058.3516	13.2033	65.1653	76.728	9.591	24.2272	63869.3125
30000	0.4848	207.1754	15100.8330	13.2016	65.1169	76.785	9.598	24.0546	65229.7188
33000	0.5333	208.0761	15083.8350	13.2026	65.2105	76.675	9.584	24.2412	64588.9727
36000	0.5818	207.1754	15066.8408	13.2023	65.3453	76.517	9.565	24.0715	65229.7188
39000	0.6303	207.1754	15100.8330	13.2017	65.2569	76.62	9.578	24.0695	65229.7188
42000	0.6788	207.3681	15058.3516	13.2021	65.2167	76.668	9.583	24.0954	64796.1484
45000	0.7273	207.9955	15100.8330	13.2026	65.2551	76.622	9.578	24.1982	64159.6602
48000	0.7758	207.7861	15092.3242	13.2017	65.3187	76.548	9.568	24.1412	64727.0352
51000	0.8242	208.2050	15058.3516	13.2029	65.2525	76.625	9.578	24.2262	64193.8711
54000	0.8727	207.7861	15100.8330	13.2027	65.2798	76.593	9.574	24.1362	64331.0312
57000	0.9212	207.7218	15100.8330	13.2017	65.2646	76.611	9.576	24.1163	65125.4180
60000	0.9697	208.5925	15092.3242	13.2034	65.2233	76.66	9.582	24.2653	63869.3125
61875	1.0	208.0116	15100.8330	13.2018	65.2936	76.577	9.572	24.2012	64917.2539

Framework versions

Distily 0.2.0
Transformers 4.44.0
Pytorch 2.3.0
Datasets 2.21.0