Training in progress, step 61875

Browse files

Files changed (5) hide show

README.md +32 -32
logs/attn_loss_fn=mse, attn_weight=10.0, hs_loss_fn=mse, hs_weight=10.0, learning_rate=0.004, warmup_ratio=0/events.out.tfevents.1723786733.93d6cbb3ad53 +3 -0
logs/attn_loss_fn=mse, attn_weight=10.0, hs_loss_fn=raw_mse, hs_weight=10.0, learning_rate=0.004, warmup_ratio=0.1/events.out.tfevents.1723786534.93d6cbb3ad53 +2 -2
model.safetensors +1 -1
training_args.bin +1 -1

README.md CHANGED Viewed

@@ -15,14 +15,14 @@ This student model is distilled from the teacher model [roneneldan/TinyStories-3
 The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
 It achieves the following results on the evaluation set:
-- eval_enwikippl: 207.7861
-- eval_frwikippl: 15066.8408
-- eval_zhwikippl: 64727.0352
-- eval_tinystoriesppl: 24.1522
-- eval_loss: 13.2019
-- eval_runtime: 65.4151
-- eval_samples_per_second: 76.435
-- eval_steps_per_second: 9.554
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment.
@@ -47,7 +47,7 @@ More information needed
 The following hyperparameters were used during training:
 - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=10.0, loss_fn=raw_mse, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=10.0, loss_fn=mse, layer_mapper=None, projector=None))
 - train_embeddings: True
-- learning_rate: 0.001
 - train_batch_size: 8
 - eval_batch_size: 8
 - seed: 42
@@ -63,31 +63,31 @@ Peak GPU Memory: 8.2677 GB
 | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | **teacher eval** |  | 169.9865 | 47377.9414 |  |  |  |  | 3.9789 | 4998.1294 |
-| 0 | 0 | 21397.4785 | 57946.0117 | 18.3162 | 65.6143 | 76.203 | 9.525 | 12321.8145 | 60955.8008 |
-| 3000 | 0.0485 | 207.9149 | 15083.8350 | 13.2031 | 65.3099 | 76.558 | 9.57 | 24.1822 | 63920.4375 |
-| 6000 | 0.0970 | 207.6253 | 15109.3467 | 13.2019 | 65.2976 | 76.572 | 9.572 | 24.1253 | 65386.5469 |
-| 9000 | 0.1455 | 207.7861 | 15066.8408 | 13.2019 | 65.4151 | 76.435 | 9.554 | 24.1522 | 64727.0352 |
-| 12000 | 0.1939 | 207.4002 | 15100.8330 | 13.2016 | 65.3491 | 76.512 | 9.564 | 24.0894 | 65229.7188 |
-| 15000 | 0.2424 | 207.8023 | 15100.8330 | 13.2017 | 65.4255 | 76.423 | 9.553 | 24.1442 | 65247.1406 |
-| 18000 | 0.2909 | 208.3987 | 15075.3359 | 13.2031 | 65.4213 | 76.428 | 9.553 | 24.2462 | 64057.0078 |
-| 21000 | 0.3394 | 208.0761 | 15100.8330 | 13.2026 | 65.2706 | 76.604 | 9.576 | 24.2142 | 64537.3164 |
-| 24000 | 0.3879 | 207.9955 | 15100.8330 | 13.2027 | 65.2287 | 76.653 | 9.582 | 24.1822 | 64159.6602 |
-| 27000 | 0.4364 | 208.3180 | 15058.3516 | 13.2033 | 65.1653 | 76.728 | 9.591 | 24.2272 | 63869.3125 |
-| 30000 | 0.4848 | 207.1754 | 15100.8330 | 13.2016 | 65.1169 | 76.785 | 9.598 | 24.0546 | 65229.7188 |
-| 33000 | 0.5333 | 208.0761 | 15083.8350 | 13.2026 | 65.2105 | 76.675 | 9.584 | 24.2412 | 64588.9727 |
-| 36000 | 0.5818 | 207.1754 | 15066.8408 | 13.2023 | 65.3453 | 76.517 | 9.565 | 24.0715 | 65229.7188 |
-| 39000 | 0.6303 | 207.1754 | 15100.8330 | 13.2017 | 65.2569 | 76.62 | 9.578 | 24.0695 | 65229.7188 |
-| 42000 | 0.6788 | 207.3681 | 15058.3516 | 13.2021 | 65.2167 | 76.668 | 9.583 | 24.0954 | 64796.1484 |
-| 45000 | 0.7273 | 207.9955 | 15100.8330 | 13.2026 | 65.2551 | 76.622 | 9.578 | 24.1982 | 64159.6602 |
-| 48000 | 0.7758 | 207.7861 | 15092.3242 | 13.2017 | 65.3187 | 76.548 | 9.568 | 24.1412 | 64727.0352 |
-| 51000 | 0.8242 | 208.2050 | 15058.3516 | 13.2029 | 65.2525 | 76.625 | 9.578 | 24.2262 | 64193.8711 |
-| 54000 | 0.8727 | 207.7861 | 15100.8330 | 13.2027 | 65.2798 | 76.593 | 9.574 | 24.1362 | 64331.0312 |
-| 57000 | 0.9212 | 207.7218 | 15100.8330 | 13.2017 | 65.2646 | 76.611 | 9.576 | 24.1163 | 65125.4180 |
-| 60000 | 0.9697 | 208.5925 | 15092.3242 | 13.2034 | 65.2233 | 76.66 | 9.582 | 24.2653 | 63869.3125 |
-| 61875 | 1.0 | 208.0116 | 15100.8330 | 13.2018 | 65.2936 | 76.577 | 9.572 | 24.2012 | 64917.2539 |
 ### Framework versions
 - Distily 0.2.0
 - Transformers 4.44.0
 - Pytorch 2.3.0
-- Datasets 2.21.0

 The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
 It achieves the following results on the evaluation set:
+- eval_enwikippl: 141.7497
+- eval_frwikippl: 27160.7188
+- eval_zhwikippl: 182390.6094
+- eval_tinystoriesppl: 11.3304
+- eval_loss: 7.2221
+- eval_runtime: 65.8262
+- eval_samples_per_second: 75.958
+- eval_steps_per_second: 9.495
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment.
 The following hyperparameters were used during training:
 - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=10.0, loss_fn=raw_mse, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=10.0, loss_fn=mse, layer_mapper=None, projector=None))
 - train_embeddings: True
+- learning_rate: 0.004
 - train_batch_size: 8
 - eval_batch_size: 8
 - seed: 42
 | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | **teacher eval** |  | 169.9865 | 47377.9414 |  |  |  |  | 3.9789 | 4998.1294 |
+| 0 | 0 | 21397.4785 | 57946.0117 | 18.3162 | 65.8981 | 75.875 | 9.484 | 12321.8145 | 60955.8008 |
+| 3000 | 0.0485 | 142.1731 | 27314.1816 | 7.2225 | 65.7757 | 76.016 | 9.502 | 11.3529 | 185037.1875 |
+| 6000 | 0.0970 | 137.4355 | 27373.8730 | 7.2213 | 65.7399 | 76.057 | 9.507 | 10.6638 | 187922.7812 |
+| 9000 | 0.1455 | 141.7497 | 27160.7188 | 7.2221 | 65.8262 | 75.958 | 9.495 | 11.3304 | 182390.6094 |
+| 12000 | 0.1939 | 137.9795 | 27281.5098 | 7.2216 | 65.8535 | 75.926 | 9.491 | 10.8366 | 186325.2656 |
+| 15000 | 0.2424 | 137.0103 | 27513.0293 | 7.2218 | 65.7764 | 76.015 | 9.502 | 10.6413 | 189332.0312 |
+| 18000 | 0.2909 | 142.2282 | 27191.3516 | 7.2224 | 65.9523 | 75.812 | 9.477 | 11.3351 | 182877.7812 |
+| 21000 | 0.3394 | 135.1864 | 27656.7969 | 7.2224 | 65.9704 | 75.792 | 9.474 | 10.4478 | 187422.1875 |
+| 24000 | 0.3879 | 142.1621 | 27468.5117 | 7.2223 | 65.8641 | 75.914 | 9.489 | 11.3332 | 180117.75 |
+| 27000 | 0.4364 | 134.9144 | 27293.0391 | 7.2231 | 65.8626 | 75.916 | 9.489 | 10.4872 | 182050.1875 |
+| 30000 | 0.4848 | 140.8412 | 27042.3691 | 7.2219 | 65.826 | 75.958 | 9.495 | 11.2056 | 181468.2812 |
+| 33000 | 0.5333 | 136.8459 | 27680.2012 | 7.2217 | 65.8058 | 75.981 | 9.498 | 10.6224 | 187022.5938 |
+| 36000 | 0.5818 | 136.5546 | 26858.2676 | 7.2218 | 65.7491 | 76.047 | 9.506 | 10.7000 | 182244.5625 |
+| 39000 | 0.6303 | 135.1864 | 27323.7949 | 7.2218 | 65.914 | 75.856 | 9.482 | 10.4642 | 185185.4844 |
+| 42000 | 0.6788 | 141.3933 | 26540.4746 | 7.2219 | 65.7482 | 76.048 | 9.506 | 11.3379 | 179925.625 |
+| 45000 | 0.7273 | 142.8466 | 28055.0605 | 7.2226 | 65.741 | 76.056 | 9.507 | 11.2944 | 186922.7344 |
+| 48000 | 0.7758 | 136.5335 | 27478.1797 | 7.2224 | 65.8834 | 75.892 | 9.486 | 10.5581 | 186225.9531 |
+| 51000 | 0.8242 | 142.2612 | 27429.8477 | 7.2221 | 65.8847 | 75.89 | 9.486 | 11.3215 | 182877.7812 |
+| 54000 | 0.8727 | 137.5739 | 27848.3633 | 7.2220 | 66.0422 | 75.709 | 9.464 | 10.6466 | 187622.2969 |
+| 57000 | 0.9212 | 141.6180 | 27561.5352 | 7.2221 | 65.9877 | 75.772 | 9.471 | 11.1880 | 191772.1875 |
+| 60000 | 0.9697 | 141.9915 | 27429.8477 | 7.2230 | 65.9205 | 75.849 | 9.481 | 11.3163 | 182585.3594 |
+| 61875 | 1.0 | 138.3541 | 27281.5098 | 7.2215 | 65.6926 | 76.112 | 9.514 | 10.8914 | 184839.8281 |
 ### Framework versions
 - Distily 0.2.0
 - Transformers 4.44.0
 - Pytorch 2.3.0
+- Datasets 2.20.0

logs/attn_loss_fn=mse, attn_weight=10.0, hs_loss_fn=mse, hs_weight=10.0, learning_rate=0.004, warmup_ratio=0/events.out.tfevents.1723786733.93d6cbb3ad53 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:254cb33f3bd8083cf8618875d03b952fc1f28659dcbbf0199eed50014a6783f3
+size 16923348

logs/attn_loss_fn=mse, attn_weight=10.0, hs_loss_fn=raw_mse, hs_weight=10.0, learning_rate=0.004, warmup_ratio=0.1/events.out.tfevents.1723786534.93d6cbb3ad53 CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:97d32b320522924c8dcc6c869382a81362fd3ab34ffabdebcc19154f1492a2a4
-size 312

 version https://git-lfs.github.com/spec/v1
+oid sha256:3b1611898542cb6cdaa388d6affcf9523ee2558dd901a270beecceee4a8da26c
+size 588

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:2658a5220385cffc81b92e31dc9da6ebb2722bea1f49ceeca4a1a62234bf776f
 size 137033984

 version https://git-lfs.github.com/spec/v1
+oid sha256:b2b31538cce0ce4b165827bc5f17f54668fbd3fb115028a3f94e3732aaf22719
 size 137033984

training_args.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:d9595c7cbd53d3be0b11c43459f912b16bc77ebd4289706d42a11a77f806711d
 size 1017948104

 version https://git-lfs.github.com/spec/v1
+oid sha256:a8e83afba9c435ea9ad8bd8c09e27f039619a31ac399d42958a8e043e1426bd5
 size 1017948104