Training in progress, step 15469

Browse files

Files changed (5) hide show

README.md +19 -19
logs/attn_loss_fn=raw_mse, attn_weight=10.0, hs_loss_fn=mse, hs_weight=10.0, learning_rate=0.0004/events.out.tfevents.1723762065.b7d545513dcf +3 -0
logs/attn_loss_fn=raw_mse, attn_weight=10.0, hs_loss_fn=raw_mse, hs_weight=10.0, learning_rate=0.0004/events.out.tfevents.1723761879.b7d545513dcf +2 -2
model.safetensors +1 -1
training_args.bin +1 -1

README.md CHANGED Viewed

@@ -15,14 +15,14 @@ This student model is distilled from the teacher model [roneneldan/TinyStories-3
 The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
 It achieves the following results on the evaluation set:
-- eval_enwikippl: 12623.7412
-- eval_frwikippl: 52327.9961
-- eval_zhwikippl: 96451.4609
-- eval_tinystoriesppl: 6281.7222
-- eval_loss: 18.2212
-- eval_runtime: 32.9567
-- eval_samples_per_second: 75.857
-- eval_steps_per_second: 9.497
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment.
@@ -47,7 +47,7 @@ More information needed
 The following hyperparameters were used during training:
 - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=10.0, loss_fn=raw_mse, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=10.0, loss_fn=raw_mse, layer_mapper=None, projector=None))
 - train_embeddings: True
-- learning_rate: 4e-06
 - train_batch_size: 16
 - eval_batch_size: 8
 - seed: 42
@@ -62,18 +62,18 @@ Peak GPU Memory: 16.2498 GB
 | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | **teacher eval** |  | 169.9865 | 47377.9414 |  |  |  |  | 3.9789 | 4998.1294 |
-| 0 | 0 | 12930.5508 | 52883.7109 | 18.2508 | 32.8136 | 76.188 | 9.539 | 6471.4722 | 96760.75 |
-| 2000 | 0.1293 | 12623.7412 | 52327.9961 | 18.2212 | 32.8936 | 76.003 | 9.516 | 6281.7222 | 96451.4609 |
-| 4000 | 0.2586 | 12623.7412 | 52327.9961 | 18.2212 | 32.8034 | 76.212 | 9.542 | 6281.7222 | 96451.4609 |
-| 6000 | 0.3879 | 12623.7412 | 52327.9961 | 18.2212 | 32.8868 | 76.018 | 9.517 | 6281.7222 | 96451.4609 |
-| 8000 | 0.5172 | 12623.7412 | 52327.9961 | 18.2212 | 32.9567 | 75.857 | 9.497 | 6281.7222 | 96451.4609 |
-| 10000 | 0.6465 | 12623.7412 | 52327.9961 | 18.2212 | 32.9256 | 75.929 | 9.506 | 6281.7222 | 96451.4609 |
-| 12000 | 0.7757 | 12623.7412 | 52327.9961 | 18.2212 | 32.9173 | 75.948 | 9.509 | 6281.7222 | 96451.4609 |
-| 14000 | 0.9050 | 12623.7412 | 52327.9961 | 18.2212 | 33.1691 | 75.371 | 9.437 | 6281.7222 | 96451.4609 |
-| 15469 | 1.0 | 12623.7412 | 52327.9961 | 18.2212 | 33.0829 | 75.568 | 9.461 | 6281.7222 | 96451.4609 |
 ### Framework versions
 - Distily 0.2.0
 - Transformers 4.44.0
 - Pytorch 2.3.0
-- Datasets 2.20.0

 The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
 It achieves the following results on the evaluation set:
+- eval_enwikippl: 5102.7344
+- eval_frwikippl: 36133.4453
+- eval_zhwikippl: 51745.4414
+- eval_tinystoriesppl: 1872.7611
+- eval_loss: 17.3780
+- eval_runtime: 33.2598
+- eval_samples_per_second: 75.166
+- eval_steps_per_second: 9.411
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment.
 The following hyperparameters were used during training:
 - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=10.0, loss_fn=raw_mse, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=10.0, loss_fn=raw_mse, layer_mapper=None, projector=None))
 - train_embeddings: True
+- learning_rate: 0.0004
 - train_batch_size: 16
 - eval_batch_size: 8
 - seed: 42
 | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | **teacher eval** |  | 169.9865 | 47377.9414 |  |  |  |  | 3.9789 | 4998.1294 |
+| 0 | 0 | 27673.0234 | 72839.8984 | 19.1728 | 33.1089 | 75.508 | 9.454 | 16450.7617 | 62570.5625 |
+| 2000 | 0.1293 | 5104.3164 | 36174.1641 | 17.3784 | 33.2758 | 75.13 | 9.406 | 1874.0001 | 51745.4414 |
+| 4000 | 0.2586 | 5104.3164 | 36214.9648 | 17.3776 | 33.0549 | 75.632 | 9.469 | 1873.0709 | 51773.0352 |
+| 6000 | 0.3879 | 5104.3164 | 36133.4453 | 17.3784 | 33.0479 | 75.648 | 9.471 | 1874.3102 | 51717.8164 |
+| 8000 | 0.5172 | 5102.7344 | 36133.4453 | 17.3780 | 33.2598 | 75.166 | 9.411 | 1872.7611 | 51745.4414 |
+| 10000 | 0.6465 | 5105.8984 | 36133.4453 | 17.3784 | 33.2485 | 75.191 | 9.414 | 1875.8596 | 51745.4414 |
+| 12000 | 0.7757 | 5105.8984 | 36133.4453 | 17.3780 | 33.0133 | 75.727 | 9.481 | 1874.9297 | 51717.8164 |
+| 14000 | 0.9050 | 5104.3164 | 36214.9648 | 17.3776 | 33.2272 | 75.24 | 9.42 | 1872.7611 | 51745.4414 |
+| 15469 | 1.0 | 5104.3164 | 36133.4453 | 17.3784 | 33.002 | 75.753 | 9.484 | 1874.3102 | 51745.4414 |
 ### Framework versions
 - Distily 0.2.0
 - Transformers 4.44.0
 - Pytorch 2.3.0
+- Datasets 2.21.0

logs/attn_loss_fn=raw_mse, attn_weight=10.0, hs_loss_fn=mse, hs_weight=10.0, learning_rate=0.0004/events.out.tfevents.1723762065.b7d545513dcf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6518ff87f675a6b3e5adae33b7e064317bf9583faeec3900450f036594b27b50
+size 4177663

logs/attn_loss_fn=raw_mse, attn_weight=10.0, hs_loss_fn=raw_mse, hs_weight=10.0, learning_rate=0.0004/events.out.tfevents.1723761879.b7d545513dcf CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:39caabfb83489b9cb0240c2fdf65bc7c3d658ecbe6d525426ca6ba71cc5cc15a
-size 307

 version https://git-lfs.github.com/spec/v1
+oid sha256:0c9f5b4b57dd364d6d66043f5138e2c74ea477fa4fedb9d680bc6a2544f0d10e
+size 578

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:e2f3aa452fa56b42240ee51b50d8be7acdbed06dd7363b534af5d456d9810f25
 size 137033984

 version https://git-lfs.github.com/spec/v1
+oid sha256:c72edab3245e3f2a55532c6506aa9508843fa472a1853ac282d63d887c20ae82
 size 137033984

training_args.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:f55d3533da93ae93135f2e1e4741afe9f327280493c21e3a1afb6cebed627fce
 size 1017948104

 version https://git-lfs.github.com/spec/v1
+oid sha256:1870bf67e1c56310a08a6ab1cd1bd24e0af6a47271e2e9ff8f676e0a5597bee1
 size 1017948104