Training in progress, step 61875

Browse files

Files changed (6) hide show

README.md +19 -19
logs/attn_loss_fn=mse, attn_weight=10.0, hidden_weight=10.0, hs_loss_fn=raw_mse, learning_rate=0.004, warmup_ratio=0/events.out.tfevents.1723766493.93d6cbb3ad53 +3 -0
logs/attn_loss_fn=mse, attn_weight=10.0, hs_loss_fn=raw_mse, hs_weight=10.0, learning_rate=0.004, warmup_ratio=0/events.out.tfevents.1723766870.93d6cbb3ad53 +3 -0
logs/attn_loss_fn=raw_mse, attn_weight=10.0, hs_loss_fn=mse, hs_weight=10.0, learning_rate=4e-06/events.out.tfevents.1723766171.93d6cbb3ad53 +2 -2
model.safetensors +1 -1
training_args.bin +1 -1

README.md CHANGED Viewed

@@ -15,14 +15,14 @@ This student model is distilled from the teacher model [roneneldan/TinyStories-3
 The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
 It achieves the following results on the evaluation set:
-- eval_enwikippl: 4707.3955
-- eval_frwikippl: 38983.5547
-- eval_zhwikippl: 53243.9219
-- eval_tinystoriesppl: 1636.1367
-- eval_loss: 5.6090
-- eval_runtime: 33.6693
-- eval_samples_per_second: 74.252
-- eval_steps_per_second: 9.296
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment.
@@ -47,7 +47,7 @@ More information needed
 The following hyperparameters were used during training:
 - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=10.0, loss_fn=mse, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=10.0, loss_fn=raw_mse, layer_mapper=None, projector=None))
 - train_embeddings: True
-- learning_rate: 0.0004
 - train_batch_size: 16
 - eval_batch_size: 8
 - seed: 42
@@ -62,18 +62,18 @@ Peak GPU Memory: 16.2515 GB
 | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | **teacher eval** |  | 169.9865 | 47377.9414 |  |  |  |  | 3.9789 | 4998.1294 |
-| 0 | 0 | 27548.9473 | 81896.5703 | 7.1177 | 33.5606 | 74.492 | 9.326 | 15431.1592 | 64176.7344 |
-| 2000 | 0.1293 | 4707.3955 | 39038.5078 | 5.6090 | 33.5798 | 74.45 | 9.321 | 1635.3256 | 53243.9219 |
-| 4000 | 0.2586 | 4704.4829 | 38983.5547 | 5.6090 | 33.4351 | 74.772 | 9.361 | 1632.6235 | 53243.9219 |
-| 6000 | 0.3879 | 4717.6201 | 38972.5898 | 5.6090 | 33.4912 | 74.646 | 9.346 | 1638.3024 | 53243.9219 |
-| 8000 | 0.5172 | 4707.3955 | 38983.5547 | 5.6090 | 33.6693 | 74.252 | 9.296 | 1636.1367 | 53243.9219 |
-| 10000 | 0.6465 | 4707.3955 | 38994.5625 | 5.6090 | 33.5614 | 74.49 | 9.326 | 1636.4075 | 53243.9219 |
-| 12000 | 0.7757 | 4707.3955 | 39005.5352 | 5.6090 | 33.5763 | 74.457 | 9.322 | 1634.2444 | 53243.9219 |
-| 14000 | 0.9050 | 4708.1274 | 38994.5625 | 5.6090 | 33.4805 | 74.67 | 9.349 | 1636.4075 | 53243.9219 |
-| 15469 | 1.0 | 4704.4829 | 39027.5234 | 5.6090 | 33.5496 | 74.517 | 9.329 | 1632.6235 | 53243.9219 |
 ### Framework versions
 - Distily 0.2.0
 - Transformers 4.44.0
 - Pytorch 2.3.0
-- Datasets 2.21.0

 The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
 It achieves the following results on the evaluation set:
+- eval_enwikippl: 26828.6738
+- eval_frwikippl: 80365.2578
+- eval_zhwikippl: 64005.7734
+- eval_tinystoriesppl: 14879.9648
+- eval_loss: 7.0927
+- eval_runtime: 33.2949
+- eval_samples_per_second: 75.087
+- eval_steps_per_second: 9.401
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment.
 The following hyperparameters were used during training:
 - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=10.0, loss_fn=mse, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=10.0, loss_fn=raw_mse, layer_mapper=None, projector=None))
 - train_embeddings: True
+- learning_rate: 4e-06
 - train_batch_size: 16
 - eval_batch_size: 8
 - seed: 42
 | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | **teacher eval** |  | 169.9865 | 47377.9414 |  |  |  |  | 3.9789 | 4998.1294 |
+| 0 | 0 | 27548.9473 | 81896.5703 | 7.1177 | 33.2322 | 75.228 | 9.419 | 15431.1592 | 64176.7344 |
+| 2000 | 0.1293 | 26828.6738 | 80365.2578 | 7.0927 | 33.0326 | 75.683 | 9.475 | 14879.9648 | 64005.7734 |
+| 4000 | 0.2586 | 26828.6738 | 80365.2578 | 7.0927 | 33.2498 | 75.188 | 9.414 | 14879.9648 | 64005.7734 |
+| 6000 | 0.3879 | 26828.6738 | 80365.2578 | 7.0927 | 33.1527 | 75.409 | 9.441 | 14879.9648 | 64005.7734 |
+| 8000 | 0.5172 | 26828.6738 | 80365.2578 | 7.0927 | 33.2949 | 75.087 | 9.401 | 14879.9648 | 64005.7734 |
+| 10000 | 0.6465 | 26828.6738 | 80365.2578 | 7.0927 | 32.9736 | 75.818 | 9.492 | 14879.9648 | 64005.7734 |
+| 12000 | 0.7757 | 26828.6738 | 80365.2578 | 7.0927 | 33.0002 | 75.757 | 9.485 | 14879.9648 | 64005.7734 |
+| 14000 | 0.9050 | 26828.6738 | 80365.2578 | 7.0927 | 33.0244 | 75.702 | 9.478 | 14879.9648 | 64005.7734 |
+| 15469 | 1.0 | 26828.6738 | 80365.2578 | 7.0927 | 33.0358 | 75.675 | 9.475 | 14879.9648 | 64005.7734 |
 ### Framework versions
 - Distily 0.2.0
 - Transformers 4.44.0
 - Pytorch 2.3.0
+- Datasets 2.20.0

logs/attn_loss_fn=mse, attn_weight=10.0, hidden_weight=10.0, hs_loss_fn=raw_mse, learning_rate=0.004, warmup_ratio=0/events.out.tfevents.1723766493.93d6cbb3ad53 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8211110fbed13b0847779fe60e88eab28936dd52f234ee43dd30bd3664417593
+size 6165

logs/attn_loss_fn=mse, attn_weight=10.0, hs_loss_fn=raw_mse, hs_weight=10.0, learning_rate=0.004, warmup_ratio=0/events.out.tfevents.1723766870.93d6cbb3ad53 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:84283b041bc58105a70e83c603b4c177c60032f1454b10e6ec870894b0ea7cad
+size 16923352

logs/attn_loss_fn=raw_mse, attn_weight=10.0, hs_loss_fn=mse, hs_weight=10.0, learning_rate=4e-06/events.out.tfevents.1723766171.93d6cbb3ad53 CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:a68d1623296a8011dc3c5e5fe3f3b75dc729bb2759dc0a8b461138eea9669f99
-size 307

 version https://git-lfs.github.com/spec/v1
+oid sha256:696496ca2470c450d90bdbf8b3a79234aef33fd4bc229801bb90d0c2be0767ee
+size 578

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:c72edab3245e3f2a55532c6506aa9508843fa472a1853ac282d63d887c20ae82
 size 137033984

 version https://git-lfs.github.com/spec/v1
+oid sha256:9505c0f32ea5812d4a2fba0aa92c57148f2a4589c13fbdefc77bdb4fa6713906
 size 137033984

training_args.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:1870bf67e1c56310a08a6ab1cd1bd24e0af6a47271e2e9ff8f676e0a5597bee1
 size 1017948104

 version https://git-lfs.github.com/spec/v1
+oid sha256:658966eef28ccc5ad224ee05199999e2c90265f455f817d1dd80e3722545273f
 size 1017948104