Training in progress, step 61875

Browse files

Files changed (5) hide show

README.md +32 -32
logs/attn_loss_fn=mse, attn_weight=10.0, hs_loss_fn=cos, hs_weight=10.0, learning_rate=0.004, warmup_ratio=0/events.out.tfevents.1723806741.93d6cbb3ad53 +3 -0
logs/attn_loss_fn=mse, attn_weight=10.0, hs_loss_fn=mse, hs_weight=10.0, learning_rate=0.004, warmup_ratio=0.1/events.out.tfevents.1723806543.93d6cbb3ad53 +2 -2
model.safetensors +1 -1
training_args.bin +1 -1

README.md CHANGED Viewed

@@ -15,14 +15,14 @@ This student model is distilled from the teacher model [roneneldan/TinyStories-3
 The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
 It achieves the following results on the evaluation set:
-- eval_enwikippl: 158.7294
-- eval_frwikippl: 15434.1611
-- eval_zhwikippl: 106089.8359
-- eval_tinystoriesppl: 15.5930
-- eval_loss: 2.3671
-- eval_runtime: 65.5679
-- eval_samples_per_second: 76.257
-- eval_steps_per_second: 9.532
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment.
@@ -47,7 +47,7 @@ More information needed
 The following hyperparameters were used during training:
 - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=10.0, loss_fn=mse, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=10.0, loss_fn=mse, layer_mapper=None, projector=None))
 - train_embeddings: True
-- learning_rate: 0.001
 - train_batch_size: 8
 - eval_batch_size: 8
 - seed: 42
@@ -63,31 +63,31 @@ Peak GPU Memory: 8.2677 GB
 | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | **teacher eval** |  | 169.9865 | 47377.9414 |  |  |  |  | 3.9789 | 4998.1294 |
-| 0 | 0 | 21397.4785 | 57946.0117 | 6.1625 | 65.3142 | 76.553 | 9.569 | 12321.8145 | 60955.8008 |
-| 3000 | 0.0485 | 158.5205 | 15451.5547 | 2.3672 | 65.4452 | 76.4 | 9.55 | 15.5596 | 105976.6797 |
-| 6000 | 0.0970 | 159.4627 | 15519.1768 | 2.3670 | 65.5893 | 76.232 | 9.529 | 15.6570 | 107916.8281 |
-| 9000 | 0.1455 | 158.7294 | 15434.1611 | 2.3671 | 65.5679 | 76.257 | 9.532 | 15.5930 | 106089.8359 |
-| 12000 | 0.1939 | 159.8214 | 15466.7988 | 2.3670 | 65.4602 | 76.382 | 9.548 | 15.6933 | 107457.1484 |
-| 15000 | 0.2424 | 159.5122 | 15501.6934 | 2.3671 | 65.4126 | 76.438 | 9.555 | 15.6648 | 107916.8281 |
-| 18000 | 0.2909 | 158.8771 | 15434.1611 | 2.3670 | 65.4041 | 76.448 | 9.556 | 15.5956 | 106033.1953 |
-| 21000 | 0.3394 | 159.3146 | 15434.1611 | 2.3671 | 65.4872 | 76.351 | 9.544 | 15.6460 | 106089.8359 |
-| 24000 | 0.3879 | 159.4504 | 15434.1611 | 2.3670 | 65.5249 | 76.307 | 9.538 | 15.6589 | 106089.8359 |
-| 27000 | 0.4364 | 158.9386 | 15386.3984 | 2.3669 | 65.4767 | 76.363 | 9.545 | 15.6163 | 105581.5391 |
-| 30000 | 0.4848 | 159.1728 | 15451.5547 | 2.3671 | 65.3648 | 76.494 | 9.562 | 15.6369 | 107342.5391 |
-| 33000 | 0.5333 | 159.8709 | 15466.7988 | 2.3670 | 65.4363 | 76.41 | 9.551 | 15.6965 | 106942.4062 |
-| 36000 | 0.5818 | 159.2097 | 15460.2656 | 2.3670 | 65.4686 | 76.373 | 9.547 | 15.6318 | 107629.3516 |
-| 39000 | 0.6303 | 158.6066 | 15503.8809 | 2.3670 | 65.5342 | 76.296 | 9.537 | 15.5724 | 107744.2734 |
-| 42000 | 0.6788 | 158.5205 | 15468.9824 | 2.3671 | 65.5105 | 76.324 | 9.54 | 15.5576 | 107399.8828 |
-| 45000 | 0.7273 | 158.7909 | 15399.4043 | 2.3670 | 65.5316 | 76.299 | 9.537 | 15.6163 | 106089.8359 |
-| 48000 | 0.7758 | 158.7909 | 15434.1611 | 2.3671 | 65.4706 | 76.37 | 9.546 | 15.6027 | 106373.1953 |
-| 51000 | 0.8242 | 158.8033 | 15425.4648 | 2.3669 | 65.5734 | 76.25 | 9.531 | 15.6169 | 106089.8359 |
-| 54000 | 0.8727 | 158.9263 | 15434.1611 | 2.3670 | 65.5021 | 76.333 | 9.542 | 15.6085 | 106486.7812 |
-| 57000 | 0.9212 | 159.3887 | 15451.5547 | 2.3671 | 65.5842 | 76.238 | 9.53 | 15.6505 | 107342.5391 |
-| 60000 | 0.9697 | 159.4874 | 15390.7422 | 2.3670 | 65.5517 | 76.276 | 9.534 | 15.6641 | 105581.5391 |
-| 61875 | 1.0 | 159.6729 | 15492.9736 | 2.3671 | 65.3871 | 76.468 | 9.558 | 15.6926 | 107342.5391 |
 ### Framework versions
 - Distily 0.2.0
 - Transformers 4.44.0
 - Pytorch 2.3.0
-- Datasets 2.21.0

 The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
 It achieves the following results on the evaluation set:
+- eval_enwikippl: 165.1131
+- eval_frwikippl: 53475.5117
+- eval_zhwikippl: 433274.5625
+- eval_tinystoriesppl: 9.8234
+- eval_loss: 1.2315
+- eval_runtime: 66.333
+- eval_samples_per_second: 75.377
+- eval_steps_per_second: 9.422
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment.
 The following hyperparameters were used during training:
 - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=10.0, loss_fn=mse, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=10.0, loss_fn=mse, layer_mapper=None, projector=None))
 - train_embeddings: True
+- learning_rate: 0.004
 - train_batch_size: 8
 - eval_batch_size: 8
 - seed: 42
 | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | **teacher eval** |  | 169.9865 | 47377.9414 |  |  |  |  | 3.9789 | 4998.1294 |
+| 0 | 0 | 21397.4785 | 57946.0117 | 6.1625 | 66.2144 | 75.512 | 9.439 | 12321.8145 | 60955.8008 |
+| 3000 | 0.0485 | 163.4777 | 52357.5 | 1.2318 | 66.2157 | 75.511 | 9.439 | 9.7461 | 405426.5312 |
+| 6000 | 0.0970 | 164.5832 | 52313.2773 | 1.2313 | 66.4422 | 75.253 | 9.407 | 9.8299 | 399840.8438 |
+| 9000 | 0.1455 | 165.1131 | 53475.5117 | 1.2315 | 66.333 | 75.377 | 9.422 | 9.8234 | 433274.5625 |
+| 12000 | 0.1939 | 175.9144 | 53347.6094 | 1.2314 | 66.1602 | 75.574 | 9.447 | 10.8017 | 431314.25 |
+| 15000 | 0.2424 | 166.3006 | 53279.9844 | 1.2312 | 66.2525 | 75.469 | 9.434 | 9.8988 | 439327.75 |
+| 18000 | 0.2909 | 165.6127 | 53520.7148 | 1.2316 | 66.1729 | 75.56 | 9.445 | 9.8144 | 435128.4375 |
+| 21000 | 0.3394 | 164.5322 | 54035.7852 | 1.2317 | 66.1602 | 75.574 | 9.447 | 9.7405 | 440971.5312 |
+| 24000 | 0.3879 | 175.3294 | 52764.6719 | 1.2315 | 66.2724 | 75.446 | 9.431 | 10.8317 | 413399.9688 |
+| 27000 | 0.4364 | 176.5424 | 52542.1719 | 1.2318 | 66.1192 | 75.621 | 9.453 | 10.8923 | 411858.9688 |
+| 30000 | 0.4848 | 165.8760 | 53490.5586 | 1.2310 | 66.2275 | 75.497 | 9.437 | 9.8575 | 418728.4062 |
+| 33000 | 0.5333 | 165.8246 | 53823.1172 | 1.2314 | 66.2901 | 75.426 | 9.428 | 9.8718 | 418728.4062 |
+| 36000 | 0.5818 | 164.1630 | 51814.5781 | 1.2318 | 66.2021 | 75.526 | 9.441 | 9.7994 | 426621.9688 |
+| 39000 | 0.6303 | 176.7887 | 53868.6172 | 1.2316 | 66.2783 | 75.439 | 9.43 | 10.8604 | 428103.875 |
+| 42000 | 0.6788 | 176.1667 | 52483.0273 | 1.2317 | 66.1638 | 75.57 | 9.446 | 10.8981 | 430854.2188 |
+| 45000 | 0.7273 | 165.7860 | 54833.2031 | 1.2316 | 66.226 | 75.499 | 9.437 | 9.7437 | 433274.5625 |
+| 48000 | 0.7758 | 177.6536 | 54096.7305 | 1.2314 | 66.2463 | 75.476 | 9.434 | 10.8174 | 436174.1875 |
+| 51000 | 0.8242 | 164.2648 | 53400.2422 | 1.2316 | 66.365 | 75.341 | 9.418 | 9.7212 | 425257.9062 |
+| 54000 | 0.8727 | 177.6812 | 53792.7930 | 1.2314 | 66.3929 | 75.309 | 9.414 | 10.8147 | 419846.8125 |
+| 57000 | 0.9212 | 169.2639 | 54421.5391 | 1.2310 | 66.2287 | 75.496 | 9.437 | 10.0082 | 441678.1875 |
+| 60000 | 0.9697 | 163.0982 | 52077.9805 | 1.2316 | 66.2184 | 75.508 | 9.438 | 9.7240 | 401765.375 |
+| 61875 | 1.0 | 164.1884 | 52549.5898 | 1.2312 | 66.1617 | 75.572 | 9.447 | 9.8116 | 419846.8125 |
 ### Framework versions
 - Distily 0.2.0
 - Transformers 4.44.0
 - Pytorch 2.3.0
+- Datasets 2.20.0

logs/attn_loss_fn=mse, attn_weight=10.0, hs_loss_fn=cos, hs_weight=10.0, learning_rate=0.004, warmup_ratio=0/events.out.tfevents.1723806741.93d6cbb3ad53 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6c5808af816d629478fd1f0f14f2940ca43d57f5483d9725d4bc5c468d132e20
+size 16923348

logs/attn_loss_fn=mse, attn_weight=10.0, hs_loss_fn=mse, hs_weight=10.0, learning_rate=0.004, warmup_ratio=0.1/events.out.tfevents.1723806543.93d6cbb3ad53 CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:59c189d9b938217f6fb57af494a20fb22bf2717b1bf394e46f3f0c25d0c9f70f
-size 312

 version https://git-lfs.github.com/spec/v1
+oid sha256:64807b2ee2daeb4cffe0f85ae30a4ab9ad7fb4d9036c81d29254cf9c1b4fba56
+size 588

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:331a8ab929b55895bcffd8c0c5219a8b4a3bb88818c1a6dca552d4a8164c7664
 size 137033984

 version https://git-lfs.github.com/spec/v1
+oid sha256:1e70b8f8cdbd3dfa80135450fd849dc798769a33ae1dc5b193ef3658b30e881e
 size 137033984

training_args.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:d140612a7dd97149912b4523c511f5472fbf76630c103e3935cf205281819d6b
 size 1017948104

 version https://git-lfs.github.com/spec/v1
+oid sha256:2dcb263d55f6965884f8fa3afad966b9db28704ad230bb9ff93a7464658c2df1
 size 1017948104