distily_bench_obj_cross

This student model is distilled from the teacher model roneneldan/TinyStories-33M using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

  • eval_enwikippl: 148.8680
  • eval_frwikippl: 21987.7637
  • eval_zhwikippl: 181662.0469
  • eval_tinystoriesppl: 12.2941
  • eval_loss: 25.4402
  • eval_runtime: 66.3462
  • eval_samples_per_second: 75.362
  • eval_steps_per_second: 9.42

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=10.0, loss_fn=kl, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=10.0, loss_fn=mse, layer_mapper=None, projector=None))
  • train_embeddings: True
  • learning_rate: 0.004
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: constant
  • num_epochs: 1.0

Resource Usage

Peak GPU Memory: 8.2666 GB

Eval-Phase Metrics

step epoch enwikippl frwikippl loss runtime samples_per_second steps_per_second tinystoriesppl zhwikippl
teacher eval 169.9865 47377.9414 3.9789 4998.1294
0 0 8167.3613 48488.5742 38.4688 65.8686 75.909 9.489 3345.3254 73944.1484
3000 0.0485 145.5666 22094.8828 25.4406 65.8501 75.93 9.491 11.9420 179685.7344
6000 0.0970 146.8635 22376.7676 25.4394 66.4135 75.286 9.411 11.9667 183024.1719
9000 0.1455 148.8680 21987.7637 25.4402 66.3462 75.362 9.42 12.2941 181662.0469
12000 0.1939 151.0636 22504.7676 25.4400 66.2246 75.501 9.438 12.5052 181759.0938
15000 0.2424 146.5339 22604.8535 25.4392 66.1192 75.621 9.453 11.8540 189888.4375
18000 0.2909 147.3192 22481.0215 25.4400 66.2457 75.477 9.435 12.0058 183905.2969
21000 0.3394 150.9525 22555.5625 25.4390 66.2661 75.453 9.432 12.4310 188575.7344
24000 0.3879 149.8920 22155.6523 25.4404 66.2363 75.487 9.436 12.4593 177493.9531
27000 0.4364 147.1653 22531.7402 25.4398 66.3823 75.321 9.415 11.9514 183905.2969
30000 0.4848 150.4855 22580.9805 25.4400 66.2281 75.497 9.437 12.4172 183513.1875
33000 0.5333 145.7359 22307.5195 25.4400 66.4448 75.25 9.406 11.9159 180165.8438
36000 0.5818 148.7297 22495.2617 25.4396 66.2715 75.447 9.431 12.1426 186574.0156
39000 0.6303 147.5820 22807.9492 25.4406 66.6342 75.037 9.38 11.9944 187372.1406
42000 0.6788 150.2292 22193.125 25.4402 66.5873 75.089 9.386 12.5202 182050.1875
45000 0.7273 146.7725 22207.2051 25.4400 66.1476 75.589 9.449 11.9890 181468.2812
48000 0.7758 146.3014 22194.6914 25.4398 66.4166 75.282 9.41 11.9746 177588.7812
51000 0.8242 148.6375 22533.3301 25.4402 66.2612 75.459 9.432 12.1471 186275.5156
54000 0.8727 147.6220 22394.1035 25.4404 66.4085 75.292 9.411 12.1140 185581.1406
57000 0.9212 148.8161 22679.8047 25.4400 66.3328 75.377 9.422 12.1230 187872.7812
60000 0.9697 146.8180 22345.2695 25.4392 66.5261 75.158 9.395 12.0317 181371.5625
61875 1.0 149.0526 22099.5410 25.4400 66.498 75.19 9.399 12.3048 181371.5625

Framework versions

  • Distily 0.2.0
  • Transformers 4.44.0
  • Pytorch 2.3.0
  • Datasets 2.20.0
Downloads last month
23
Safetensors
Model size
68.5M params
Tensor type
BF16
·
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Model tree for lapp0/distily_bench_obj_cross

Quantized
(10)
this model