metadata

base_model: roneneldan/TinyStories-33M
library_name: Distily
tags:
  - generated_from_trainer
model-index:
  - name: distily_bench_obj_cross_v2
    results: []

distily_bench_obj_cross_v2

This student model is distilled from the teacher model roneneldan/TinyStories-33M using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

eval_enwikippl: 1882.2876
eval_frwikippl: 38923.2266
eval_zhwikippl: 63461.6641
eval_tinystoriesppl: 451.2739
eval_loss: 4.8257
eval_runtime: 13.1445
eval_samples_per_second: 76.078
eval_steps_per_second: 9.51

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=2.0, loss_fn=mse, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
train_embeddings: True
learning_rate: 0.0004
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine_with_restarts
num_epochs: 1.0

Resource Usage

Peak GPU Memory: 8.1729 GB

Eval-Phase Metrics

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	tinystoriesppl	zhwikippl
teacher eval		169.9865	47377.9414					3.9789	4998.1294
0	0	10909.4980	77116.0	6.3550	13.1937	75.794	9.474	4267.7983	73081.2031
1000	0.0808	1884.7683	38923.2266	4.8260	13.1354	76.13	9.516	453.2929	63529.4258
2000	0.1616	1882.5793	38923.2266	4.8257	13.2412	75.522	9.44	451.5352	63461.6641
3000	0.2424	1882.5793	38923.2266	4.8257	13.2384	75.538	9.442	451.6844	63461.6641
4000	0.3232	1881.7043	38923.2266	4.8257	13.2242	75.619	9.452	450.9009	63461.6641
5000	0.4040	1883.1630	38923.2266	4.8257	13.1558	76.012	9.501	451.8337	63461.6641
6000	0.4848	1883.1630	38923.2266	4.8257	13.2198	75.644	9.456	451.8337	63461.6641
7000	0.5657	1884.4762	38923.2266	4.8257	13.2183	75.653	9.457	452.8433	63529.4258
8000	0.6465	1882.5793	38923.2266	4.8257	13.1236	76.198	9.525	451.4604	63461.6641
9000	0.7273	1882.2876	38923.2266	4.8257	13.1445	76.078	9.51	451.2739	63461.6641
10000	0.8081	1880.2477	38923.2266	4.8257	13.2204	75.641	9.455	450.4167	63461.6641
11000	0.8889	1882.5793	38923.2266	4.8257	13.267	75.375	9.422	451.7592	63461.6641
12000	0.9697	1883.1630	38923.2266	4.8257	13.182	75.861	9.483	451.8337	63461.6641
12375	1.0	1883.1630	38923.2266	4.8257	13.202	75.746	9.468	451.8337	63461.6641

Framework versions

Distily 0.2.0
Transformers 4.44.0
Pytorch 2.3.0
Datasets 2.20.0