distily_TinyStories-33M

This student model is distilled from the teacher model roneneldan/TinyStories-33M using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

  • eval_enwikippl: 5885.9341
  • eval_frwikippl: 24294.9414
  • eval_zhwikippl: 264331.3438
  • eval_loss: 0.3987
  • eval_runtime: 51.5838
  • eval_samples_per_second: 48.465
  • eval_steps_per_second: 6.068

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
  • train_embeddings: True
  • learning_rate: 4e-05
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: constant
  • num_epochs: 1.0

Resource Usage

Peak GPU Memory: 8.1416 GB

Eval-Phase Metrics

step epoch enwikippl frwikippl loss runtime samples_per_second steps_per_second zhwikippl
teacher eval 20633.1680 131577.2812 7615.4468
0 0 55266.375 57180.4375 6.2843 26.4237 94.612 11.845 56806.5430
1000 0.0323 11414.3389 87921.1172 0.7142 26.3405 94.911 11.883 611931.1875
2000 0.0646 8814.8682 53295.2305 0.6287 51.0412 48.98 6.132 507315.5625
3000 0.0970 8020.6040 41652.3320 0.5662 29.4187 84.98 10.639 268242.625
4000 0.1293 7153.7090 33178.5977 0.5197 40.0478 62.425 7.816 315367.9062
5000 0.1616 6865.2617 31042.1875 0.4833 36.655 68.203 8.539 372857.25
6000 0.1939 6828.5781 30924.2324 0.4539 47.1811 52.987 6.634 379690.5
7000 0.2263 6329.1855 28375.3984 0.4331 51.6027 48.447 6.066 325812.875
8000 0.2586 6229.7119 28592.2773 0.4123 51.6184 48.432 6.064 318159.5
9000 0.2909 5885.9341 24294.9414 0.3987 51.5838 48.465 6.068 264331.3438
10000 0.3232 5634.5898 24401.3828 0.3856 51.6233 48.428 6.063 248118.4062
11000 0.3555 5849.9346 26113.8555 0.3761 51.5949 48.454 6.066 255583.9844
12000 0.3879 5588.8325 23138.0430 0.3666 51.5384 48.508 6.073 255106.6875
13000 0.4202 5498.4355 23102.1699 0.3618 51.6778 48.377 6.057 244239.3125
14000 0.4525 5495.8716 24775.8398 0.3530 51.4537 48.587 6.083 271776.25
15000 0.4848 5449.1309 23173.9512 0.3490 51.6347 48.417 6.062 235716.0625
16000 0.5172 5464.8057 25348.3184 0.3430 48.3546 51.701 6.473 305992.3125
17000 0.5495 5289.8618 23652.6602 0.3426 45.4673 54.985 6.884 290930.0625
18000 0.5818 5362.6548 23393.9375 0.3378 42.8681 58.318 7.301 237739.0938
19000 0.6141 5970.6357 32165.1016 0.3332 38.4757 64.976 8.135 492760.0312
20000 0.6465 5680.7217 30225.7988 0.3322 31.9943 78.139 9.783 391742.4062
21000 0.6788 5494.1685 27750.1914 0.3288 49.7191 50.283 6.295 288762.6875
22000 0.7111 5693.0815 24919.4883 0.3272 49.6244 50.378 6.307 263274.4375
23000 0.7434 5303.4346 25441.4375 0.3230 50.6137 49.394 6.184 261801.9844
24000 0.7757 5458.4463 26499.6543 0.3217 51.4227 48.617 6.087 229626.5781
25000 0.8081 5728.1162 28263.5859 0.3203 51.6717 48.382 6.057 258605.3594
26000 0.8404 5226.1689 23493.1152 0.3186 51.4811 48.562 6.08 180660.6719
27000 0.8727 5192.1890 22039.3262 0.3165 51.6376 48.414 6.061 194013.875
28000 0.9050 5418.7476 22450.2344 0.3169 51.6539 48.399 6.06 182503.5312
29000 0.9374 5170.8613 23860.3691 0.3141 51.4944 48.549 6.078 197516.9531
30000 0.9697 5569.3379 25081.6641 0.3130 51.3337 48.701 6.097 160202.3281
30938 1.0 5306.7280 25078.125 0.3130 51.5266 48.519 6.075 179410.5625

Framework versions

  • Distily 0.2.0
  • Transformers 4.44.0
  • Pytorch 2.3.0
  • Datasets 2.21.0
Downloads last month
6
Safetensors
Model size
68.5M params
Tensor type
BF16
·
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Model tree for distily/distily_TinyStories-33M

Finetuned
(8)
this model