End of training

Browse files

Files changed (2) hide show

README.md +34 -21
logs/attn_loss_fn=mse, attn_weight=10.0, hs_loss_fn=raw_mse, hs_weight=10.0, learning_rate=0.001, warmup_ratio=0/events.out.tfevents.1723783790.5f530b1cf724 +3 -0

README.md CHANGED Viewed

@@ -15,14 +15,14 @@ This student model is distilled from the teacher model [roneneldan/TinyStories-3
 The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
 It achieves the following results on the evaluation set:
-- eval_enwikippl: 22177.1309
-- eval_frwikippl: 73852.3203
-- eval_zhwikippl: 62537.2148
-- eval_tinystoriesppl: 11654.0615
-- eval_loss: 6.9194
-- eval_runtime: 32.7141
-- eval_samples_per_second: 76.42
-- eval_steps_per_second: 9.568
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment.
@@ -45,10 +45,10 @@ More information needed
 ### Training hyperparameters
 The following hyperparameters were used during training:
-- distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=10.0, loss_fn=mse, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=10.0, loss_fn=raw_mse, layer_mapper=None, projector=None))
 - train_embeddings: True
-- learning_rate: 4e-05
-- train_batch_size: 16
 - eval_batch_size: 8
 - seed: 42
 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
@@ -56,21 +56,34 @@ The following hyperparameters were used during training:
 - num_epochs: 1.0
 ### Resource Usage
-Peak GPU Memory: 16.2515 GB
 ### Eval-Phase Metrics
 | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | **teacher eval** |  | 169.9865 | 47377.9414 |  |  |  |  | 3.9789 | 4998.1294 |
-| 0 | 0 | 27548.9473 | 81896.5703 | 7.1177 | 32.6791 | 76.501 | 9.578 | 15431.1592 | 64176.7344 |
-| 2000 | 0.1293 | 22177.1309 | 73852.3203 | 6.9194 | 32.645 | 76.582 | 9.588 | 11654.0615 | 62537.2148 |
-| 4000 | 0.2586 | 22177.1309 | 73852.3203 | 6.9194 | 32.649 | 76.572 | 9.587 | 11654.0615 | 62537.2148 |
-| 6000 | 0.3879 | 22177.1309 | 73852.3203 | 6.9194 | 32.8099 | 76.196 | 9.54 | 11654.0615 | 62537.2148 |
-| 8000 | 0.5172 | 22177.1309 | 73852.3203 | 6.9194 | 32.7141 | 76.42 | 9.568 | 11654.0615 | 62537.2148 |
-| 10000 | 0.6465 | 22177.1309 | 73852.3203 | 6.9194 | 32.6497 | 76.57 | 9.587 | 11654.0615 | 62537.2148 |
-| 12000 | 0.7757 | 22177.1309 | 73852.3203 | 6.9194 | 32.6408 | 76.591 | 9.589 | 11654.0615 | 62537.2148 |
-| 14000 | 0.9050 | 22177.1309 | 73852.3203 | 6.9194 | 32.6631 | 76.539 | 9.583 | 11654.0615 | 62537.2148 |
-| 15469 | 1.0 | 22177.1309 | 73852.3203 | 6.9194 | 32.6508 | 76.568 | 9.586 | 11654.0615 | 62537.2148 |
 ### Framework versions
 - Distily 0.2.0

 The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
 It achieves the following results on the evaluation set:
+- eval_enwikippl: 198.0270
+- eval_frwikippl: 17127.3379
+- eval_zhwikippl: 63614.1797
+- eval_tinystoriesppl: 21.7514
+- eval_loss: 12.7123
+- eval_runtime: 65.1143
+- eval_samples_per_second: 76.788
+- eval_steps_per_second: 9.599
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment.
 ### Training hyperparameters
 The following hyperparameters were used during training:
+- distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=10.0, loss_fn=raw_mse, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=10.0, loss_fn=mse, layer_mapper=None, projector=None))
 - train_embeddings: True
+- learning_rate: 0.001
+- train_batch_size: 8
 - eval_batch_size: 8
 - seed: 42
 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
 - num_epochs: 1.0
 ### Resource Usage
+Peak GPU Memory: 8.2666 GB
 ### Eval-Phase Metrics
 | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | **teacher eval** |  | 169.9865 | 47377.9414 |  |  |  |  | 3.9789 | 4998.1294 |
+| 0 | 0 | 14504.7236 | 73076.2578 | 17.7824 | 65.8473 | 75.933 | 9.492 | 6091.4858 | 69506.0391 |
+| 3000 | 0.0485 | 198.0270 | 17127.3379 | 12.7127 | 65.1257 | 76.775 | 9.597 | 21.7514 | 63648.1641 |
+| 6000 | 0.0970 | 197.0935 | 17088.7852 | 12.7124 | 65.0473 | 76.867 | 9.608 | 21.6429 | 63750.0977 |
+| 9000 | 0.1455 | 198.0270 | 17127.3379 | 12.7123 | 65.1143 | 76.788 | 9.599 | 21.7514 | 63614.1797 |
+| 12000 | 0.1939 | 197.4985 | 17098.4199 | 12.7128 | 65.0689 | 76.842 | 9.605 | 21.7173 | 63886.3086 |
+| 15000 | 0.2424 | 197.3838 | 17122.5215 | 12.7128 | 65.1155 | 76.787 | 9.598 | 21.6850 | 63954.5195 |
+| 18000 | 0.2909 | 197.9733 | 17108.0586 | 12.7130 | 65.0798 | 76.829 | 9.604 | 21.7865 | 63818.1680 |
+| 21000 | 0.3394 | 197.1698 | 17103.2305 | 12.7134 | 65.0799 | 76.829 | 9.604 | 21.6322 | 64022.8672 |
+| 24000 | 0.3879 | 197.8507 | 17117.7051 | 12.7131 | 65.1719 | 76.72 | 9.59 | 21.7757 | 63716.1211 |
+| 27000 | 0.4364 | 197.6975 | 17127.3379 | 12.7131 | 65.1124 | 76.79 | 9.599 | 21.7191 | 63954.5195 |
+| 30000 | 0.4848 | 197.3150 | 17079.1562 | 12.7131 | 65.1085 | 76.795 | 9.599 | 21.6904 | 63920.4375 |
+| 33000 | 0.5333 | 197.6209 | 17103.2305 | 12.7129 | 65.2897 | 76.582 | 9.573 | 21.7191 | 63750.0977 |
+| 36000 | 0.5818 | 198.3110 | 17127.3379 | 12.7122 | 65.5537 | 76.273 | 9.534 | 21.7883 | 63614.1797 |
+| 39000 | 0.6303 | 198.2802 | 17127.3379 | 12.7128 | 65.2001 | 76.687 | 9.586 | 21.7721 | 63648.1641 |
+| 42000 | 0.6788 | 197.9580 | 17127.3379 | 12.7130 | 65.4586 | 76.384 | 9.548 | 21.7433 | 63512.4648 |
+| 45000 | 0.7273 | 198.2802 | 17108.0586 | 12.7129 | 65.2819 | 76.591 | 9.574 | 21.8009 | 63614.1797 |
+| 48000 | 0.7758 | 197.5979 | 17098.4199 | 12.7125 | 65.1997 | 76.688 | 9.586 | 21.6940 | 63648.1641 |
+| 51000 | 0.8242 | 198.2802 | 17127.3379 | 12.7120 | 65.5503 | 76.277 | 9.535 | 21.7892 | 63512.4648 |
+| 54000 | 0.8727 | 198.2189 | 17127.3379 | 12.7129 | 65.3863 | 76.469 | 9.559 | 21.7811 | 63716.1211 |
+| 57000 | 0.9212 | 199.1886 | 17136.9941 | 12.7133 | 65.1649 | 76.728 | 9.591 | 21.8759 | 63343.2148 |
+| 60000 | 0.9697 | 197.3226 | 17122.5215 | 12.7127 | 65.2079 | 76.678 | 9.585 | 21.6886 | 63648.1641 |
+| 61875 | 1.0 | 198.9419 | 17127.3379 | 12.7117 | 65.2766 | 76.597 | 9.575 | 21.8469 | 63648.1641 |
 ### Framework versions
 - Distily 0.2.0

logs/attn_loss_fn=mse, attn_weight=10.0, hs_loss_fn=raw_mse, hs_weight=10.0, learning_rate=0.001, warmup_ratio=0/events.out.tfevents.1723783790.5f530b1cf724 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cf049bc8c5609dd342a0f94201a27d4f72d1883217a827d7aa47a5c5bc50a784
+size 312