End of training

Browse files

Files changed (2) hide show

README.md +19 -19
logs/attn_loss_fn=raw_mse, attn_weight=10.0, hs_loss_fn=mse, hs_weight=10.0, learning_rate=4e-05/events.out.tfevents.1723766051.5f530b1cf724 +3 -0

README.md CHANGED Viewed

@@ -15,14 +15,14 @@ This student model is distilled from the teacher model [roneneldan/TinyStories-3
 The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
 It achieves the following results on the evaluation set:
-- eval_enwikippl: 24580.0566
-- eval_frwikippl: 58429.5703
-- eval_zhwikippl: 90638.1875
-- eval_tinystoriesppl: 13633.8428
-- eval_loss: 18.8988
-- eval_runtime: 32.6253
-- eval_samples_per_second: 76.628
-- eval_steps_per_second: 9.594
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment.
@@ -45,7 +45,7 @@ More information needed
 ### Training hyperparameters
 The following hyperparameters were used during training:
-- distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=10.0, loss_fn=raw_mse, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=10.0, loss_fn=raw_mse, layer_mapper=None, projector=None))
 - train_embeddings: True
 - learning_rate: 4e-05
 - train_batch_size: 16
@@ -56,21 +56,21 @@ The following hyperparameters were used during training:
 - num_epochs: 1.0
 ### Resource Usage
-Peak GPU Memory: 16.2498 GB
 ### Eval-Phase Metrics
 | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | **teacher eval** |  | 169.9865 | 47377.9414 |  |  |  |  | 3.9789 | 4998.1294 |
-| 0 | 0 | 30500.8262 | 64429.8789 | 19.1222 | 32.5358 | 76.838 | 9.62 | 17883.2402 | 92396.1641 |
-| 2000 | 0.1293 | 24580.0566 | 58429.5703 | 18.8980 | 32.4735 | 76.986 | 9.639 | 13633.8428 | 90638.1875 |
-| 4000 | 0.2586 | 24580.0566 | 58429.5703 | 18.8980 | 32.5203 | 76.875 | 9.625 | 13633.8428 | 90638.1875 |
-| 6000 | 0.3879 | 24580.0566 | 58429.5703 | 18.8988 | 32.628 | 76.621 | 9.593 | 13633.8428 | 90638.1875 |
-| 8000 | 0.5172 | 24580.0566 | 58429.5703 | 18.8988 | 32.6253 | 76.628 | 9.594 | 13633.8428 | 90638.1875 |
-| 10000 | 0.6465 | 24580.0566 | 58429.5703 | 18.8988 | 32.4883 | 76.951 | 9.634 | 13633.8428 | 90638.1875 |
-| 12000 | 0.7757 | 24580.0566 | 58429.5703 | 18.8980 | 32.4949 | 76.935 | 9.632 | 13633.8428 | 90638.1875 |
-| 14000 | 0.9050 | 24580.0566 | 58429.5703 | 18.8988 | 32.507 | 76.906 | 9.629 | 13633.8428 | 90638.1875 |
-| 15469 | 1.0 | 24580.0566 | 58429.5703 | 18.8988 | 32.6353 | 76.604 | 9.591 | 13633.8428 | 90638.1875 |
 ### Framework versions
 - Distily 0.2.0

 The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
 It achieves the following results on the evaluation set:
+- eval_enwikippl: 22177.1309
+- eval_frwikippl: 73852.3203
+- eval_zhwikippl: 62537.2148
+- eval_tinystoriesppl: 11654.0615
+- eval_loss: 6.9194
+- eval_runtime: 32.7141
+- eval_samples_per_second: 76.42
+- eval_steps_per_second: 9.568
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment.
 ### Training hyperparameters
 The following hyperparameters were used during training:
+- distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=10.0, loss_fn=mse, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=10.0, loss_fn=raw_mse, layer_mapper=None, projector=None))
 - train_embeddings: True
 - learning_rate: 4e-05
 - train_batch_size: 16
 - num_epochs: 1.0
 ### Resource Usage
+Peak GPU Memory: 16.2515 GB
 ### Eval-Phase Metrics
 | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | **teacher eval** |  | 169.9865 | 47377.9414 |  |  |  |  | 3.9789 | 4998.1294 |
+| 0 | 0 | 27548.9473 | 81896.5703 | 7.1177 | 32.6791 | 76.501 | 9.578 | 15431.1592 | 64176.7344 |
+| 2000 | 0.1293 | 22177.1309 | 73852.3203 | 6.9194 | 32.645 | 76.582 | 9.588 | 11654.0615 | 62537.2148 |
+| 4000 | 0.2586 | 22177.1309 | 73852.3203 | 6.9194 | 32.649 | 76.572 | 9.587 | 11654.0615 | 62537.2148 |
+| 6000 | 0.3879 | 22177.1309 | 73852.3203 | 6.9194 | 32.8099 | 76.196 | 9.54 | 11654.0615 | 62537.2148 |
+| 8000 | 0.5172 | 22177.1309 | 73852.3203 | 6.9194 | 32.7141 | 76.42 | 9.568 | 11654.0615 | 62537.2148 |
+| 10000 | 0.6465 | 22177.1309 | 73852.3203 | 6.9194 | 32.6497 | 76.57 | 9.587 | 11654.0615 | 62537.2148 |
+| 12000 | 0.7757 | 22177.1309 | 73852.3203 | 6.9194 | 32.6408 | 76.591 | 9.589 | 11654.0615 | 62537.2148 |
+| 14000 | 0.9050 | 22177.1309 | 73852.3203 | 6.9194 | 32.6631 | 76.539 | 9.583 | 11654.0615 | 62537.2148 |
+| 15469 | 1.0 | 22177.1309 | 73852.3203 | 6.9194 | 32.6508 | 76.568 | 9.586 | 11654.0615 | 62537.2148 |
 ### Framework versions
 - Distily 0.2.0

logs/attn_loss_fn=raw_mse, attn_weight=10.0, hs_loss_fn=mse, hs_weight=10.0, learning_rate=4e-05/events.out.tfevents.1723766051.5f530b1cf724 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:90e143c8f8a2001681b81ace6d7867a2978c824702d66a7c9fbaef8f0c6ef47b
+size 307