metadata
base_model: roneneldan/TinyStories-33M
library_name: Distily
tags:
- generated_from_trainer
model-index:
- name: distily_bench_obj_cross_v2
results: []
distily_bench_obj_cross_v2
This student model is distilled from the teacher model roneneldan/TinyStories-33M using the dataset (unspecified).
The Distily library was used for this distillation.
It achieves the following results on the evaluation set:
- eval_enwikippl: 1882.2876
- eval_frwikippl: 38923.2266
- eval_zhwikippl: 63461.6641
- eval_tinystoriesppl: 451.2739
- eval_loss: 4.8257
- eval_runtime: 13.1445
- eval_samples_per_second: 76.078
- eval_steps_per_second: 9.51
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=2.0, loss_fn=mse, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
- train_embeddings: True
- learning_rate: 0.0004
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine_with_restarts
- num_epochs: 1.0
Resource Usage
Peak GPU Memory: 8.1729 GB
Eval-Phase Metrics
step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
---|---|---|---|---|---|---|---|---|---|
teacher eval | 169.9865 | 47377.9414 | 3.9789 | 4998.1294 | |||||
0 | 0 | 10909.4980 | 77116.0 | 6.3550 | 13.1937 | 75.794 | 9.474 | 4267.7983 | 73081.2031 |
1000 | 0.0808 | 1884.7683 | 38923.2266 | 4.8260 | 13.1354 | 76.13 | 9.516 | 453.2929 | 63529.4258 |
2000 | 0.1616 | 1882.5793 | 38923.2266 | 4.8257 | 13.2412 | 75.522 | 9.44 | 451.5352 | 63461.6641 |
3000 | 0.2424 | 1882.5793 | 38923.2266 | 4.8257 | 13.2384 | 75.538 | 9.442 | 451.6844 | 63461.6641 |
4000 | 0.3232 | 1881.7043 | 38923.2266 | 4.8257 | 13.2242 | 75.619 | 9.452 | 450.9009 | 63461.6641 |
5000 | 0.4040 | 1883.1630 | 38923.2266 | 4.8257 | 13.1558 | 76.012 | 9.501 | 451.8337 | 63461.6641 |
6000 | 0.4848 | 1883.1630 | 38923.2266 | 4.8257 | 13.2198 | 75.644 | 9.456 | 451.8337 | 63461.6641 |
7000 | 0.5657 | 1884.4762 | 38923.2266 | 4.8257 | 13.2183 | 75.653 | 9.457 | 452.8433 | 63529.4258 |
8000 | 0.6465 | 1882.5793 | 38923.2266 | 4.8257 | 13.1236 | 76.198 | 9.525 | 451.4604 | 63461.6641 |
9000 | 0.7273 | 1882.2876 | 38923.2266 | 4.8257 | 13.1445 | 76.078 | 9.51 | 451.2739 | 63461.6641 |
10000 | 0.8081 | 1880.2477 | 38923.2266 | 4.8257 | 13.2204 | 75.641 | 9.455 | 450.4167 | 63461.6641 |
11000 | 0.8889 | 1882.5793 | 38923.2266 | 4.8257 | 13.267 | 75.375 | 9.422 | 451.7592 | 63461.6641 |
12000 | 0.9697 | 1883.1630 | 38923.2266 | 4.8257 | 13.182 | 75.861 | 9.483 | 451.8337 | 63461.6641 |
12375 | 1.0 | 1883.1630 | 38923.2266 | 4.8257 | 13.202 | 75.746 | 9.468 | 451.8337 | 63461.6641 |
Framework versions
- Distily 0.2.0
- Transformers 4.44.0
- Pytorch 2.3.0
- Datasets 2.20.0