End of training

Browse files

Files changed (2) hide show

README.md +33 -20
logs/attn_loss_fn=mse, attn_weight=10.0, hs_loss_fn=raw_mse, hs_weight=10.0, learning_rate=0.0004, warmup_ratio=0/events.out.tfevents.1723776757.b7d545513dcf +3 -0

README.md CHANGED Viewed

@@ -15,14 +15,14 @@ This student model is distilled from the teacher model [roneneldan/TinyStories-3
 The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
 It achieves the following results on the evaluation set:
-- eval_enwikippl: 4707.3955
-- eval_frwikippl: 38983.5547
-- eval_zhwikippl: 53243.9219
-- eval_tinystoriesppl: 1636.1367
-- eval_loss: 5.6090
-- eval_runtime: 33.6693
-- eval_samples_per_second: 74.252
-- eval_steps_per_second: 9.296
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment.
@@ -45,10 +45,10 @@ More information needed
 ### Training hyperparameters
 The following hyperparameters were used during training:
-- distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=10.0, loss_fn=mse, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=10.0, loss_fn=raw_mse, layer_mapper=None, projector=None))
 - train_embeddings: True
 - learning_rate: 0.0004
-- train_batch_size: 16
 - eval_batch_size: 8
 - seed: 42
 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
@@ -56,21 +56,34 @@ The following hyperparameters were used during training:
 - num_epochs: 1.0
 ### Resource Usage
-Peak GPU Memory: 16.2515 GB
 ### Eval-Phase Metrics
 | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | **teacher eval** |  | 169.9865 | 47377.9414 |  |  |  |  | 3.9789 | 4998.1294 |
-| 0 | 0 | 27548.9473 | 81896.5703 | 7.1177 | 33.5606 | 74.492 | 9.326 | 15431.1592 | 64176.7344 |
-| 2000 | 0.1293 | 4707.3955 | 39038.5078 | 5.6090 | 33.5798 | 74.45 | 9.321 | 1635.3256 | 53243.9219 |
-| 4000 | 0.2586 | 4704.4829 | 38983.5547 | 5.6090 | 33.4351 | 74.772 | 9.361 | 1632.6235 | 53243.9219 |
-| 6000 | 0.3879 | 4717.6201 | 38972.5898 | 5.6090 | 33.4912 | 74.646 | 9.346 | 1638.3024 | 53243.9219 |
-| 8000 | 0.5172 | 4707.3955 | 38983.5547 | 5.6090 | 33.6693 | 74.252 | 9.296 | 1636.1367 | 53243.9219 |
-| 10000 | 0.6465 | 4707.3955 | 38994.5625 | 5.6090 | 33.5614 | 74.49 | 9.326 | 1636.4075 | 53243.9219 |
-| 12000 | 0.7757 | 4707.3955 | 39005.5352 | 5.6090 | 33.5763 | 74.457 | 9.322 | 1634.2444 | 53243.9219 |
-| 14000 | 0.9050 | 4708.1274 | 38994.5625 | 5.6090 | 33.4805 | 74.67 | 9.349 | 1636.4075 | 53243.9219 |
-| 15469 | 1.0 | 4704.4829 | 39027.5234 | 5.6090 | 33.5496 | 74.517 | 9.329 | 1632.6235 | 53243.9219 |
 ### Framework versions
 - Distily 0.2.0

 The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
 It achieves the following results on the evaluation set:
+- eval_enwikippl: 3694.4192
+- eval_frwikippl: 30929.5703
+- eval_zhwikippl: 45501.2617
+- eval_tinystoriesppl: 1160.7031
+- eval_loss: 15.6337
+- eval_runtime: 66.5963
+- eval_samples_per_second: 75.079
+- eval_steps_per_second: 9.385
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment.
 ### Training hyperparameters
 The following hyperparameters were used during training:
+- distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=10.0, loss_fn=raw_mse, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=10.0, loss_fn=mse, layer_mapper=None, projector=None))
 - train_embeddings: True
 - learning_rate: 0.0004
+- train_batch_size: 8
 - eval_batch_size: 8
 - seed: 42
 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
 - num_epochs: 1.0
 ### Resource Usage
+Peak GPU Memory: 8.2666 GB
 ### Eval-Phase Metrics
 | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | **teacher eval** |  | 169.9865 | 47377.9414 |  |  |  |  | 3.9789 | 4998.1294 |
+| 0 | 0 | 19278.7617 | 60268.5703 | 17.3716 | 66.6062 | 75.068 | 9.384 | 9660.0908 | 53858.2383 |
+| 3000 | 0.0485 | 3702.4434 | 30929.5703 | 15.6332 | 66.4884 | 75.201 | 9.4 | 1163.3928 | 45525.5703 |
+| 6000 | 0.0970 | 3702.4434 | 30929.5703 | 15.6346 | 67.0021 | 74.625 | 9.328 | 1163.0084 | 45525.5703 |
+| 9000 | 0.1455 | 3694.4192 | 30929.5703 | 15.6337 | 66.5963 | 75.079 | 9.385 | 1160.7031 | 45501.2617 |
+| 12000 | 0.1939 | 3696.7100 | 30981.8828 | 15.6348 | 66.5218 | 75.163 | 9.395 | 1161.0868 | 45525.5703 |
+| 15000 | 0.2424 | 3697.8560 | 30929.5703 | 15.6342 | 66.6948 | 74.968 | 9.371 | 1161.8550 | 45525.5703 |
+| 18000 | 0.2909 | 3696.7100 | 30981.8828 | 15.6348 | 66.2258 | 75.499 | 9.437 | 1161.2789 | 45525.5703 |
+| 21000 | 0.3394 | 3697.8560 | 30946.9785 | 15.6344 | 66.5835 | 75.094 | 9.387 | 1161.4711 | 45525.5703 |
+| 24000 | 0.3879 | 3697.8560 | 30929.5703 | 15.6334 | 66.8279 | 74.819 | 9.352 | 1162.0472 | 45525.5703 |
+| 27000 | 0.4364 | 3697.8560 | 30981.8828 | 15.6346 | 66.5691 | 75.11 | 9.389 | 1161.6627 | 45525.5703 |
+| 30000 | 0.4848 | 3696.7100 | 30946.9785 | 15.6346 | 66.7012 | 74.961 | 9.37 | 1160.7031 | 45525.5703 |
+| 33000 | 0.5333 | 3696.1389 | 30981.8828 | 15.6346 | 66.5211 | 75.164 | 9.396 | 1160.1277 | 45525.5703 |
+| 36000 | 0.5818 | 3700.1489 | 30929.5703 | 15.6331 | 66.5006 | 75.187 | 9.398 | 1162.6237 | 45525.5703 |
+| 39000 | 0.6303 | 3694.4192 | 30964.4258 | 15.6344 | 66.3802 | 75.324 | 9.415 | 1160.5111 | 45501.2617 |
+| 42000 | 0.6788 | 3696.7100 | 30946.9785 | 15.6346 | 66.6702 | 74.996 | 9.375 | 1160.7031 | 45525.5703 |
+| 45000 | 0.7273 | 3696.7100 | 30981.8828 | 15.6347 | 66.7768 | 74.876 | 9.36 | 1161.0868 | 45525.5703 |
+| 48000 | 0.7758 | 3694.4192 | 30929.5703 | 15.6331 | 66.6573 | 75.011 | 9.376 | 1160.7031 | 45525.5703 |
+| 51000 | 0.8242 | 3692.7039 | 30981.8828 | 15.6344 | 66.8297 | 74.817 | 9.352 | 1159.7439 | 45501.2617 |
+| 54000 | 0.8727 | 3692.1333 | 30946.9785 | 15.6344 | 66.8788 | 74.762 | 9.345 | 1158.7859 | 45501.2617 |
+| 57000 | 0.9212 | 3696.7100 | 30946.9785 | 15.6346 | 66.8322 | 74.814 | 9.352 | 1160.7031 | 45501.2617 |
+| 60000 | 0.9697 | 3707.0330 | 30929.5703 | 15.6328 | 66.9377 | 74.696 | 9.337 | 1165.3177 | 45525.5703 |
+| 61875 | 1.0 | 3702.4434 | 30929.5703 | 15.6331 | 66.5826 | 75.095 | 9.387 | 1163.3928 | 45501.2617 |
 ### Framework versions
 - Distily 0.2.0

logs/attn_loss_fn=mse, attn_weight=10.0, hs_loss_fn=raw_mse, hs_weight=10.0, learning_rate=0.0004, warmup_ratio=0/events.out.tfevents.1723776757.b7d545513dcf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:097e6d52c1b4dd5d6eb0fc4c0858d22c2bdc5f3d78102c9def7dd7f526676cd7
+size 312