lapp0 commited on
Commit
965ce1c
1 Parent(s): 74cac1a

End of training

Browse files
README.md CHANGED
@@ -15,14 +15,14 @@ This student model is distilled from the teacher model [roneneldan/TinyStories-3
15
  The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
16
 
17
  It achieves the following results on the evaluation set:
18
- - eval_enwikippl: 24580.0566
19
- - eval_frwikippl: 58429.5703
20
- - eval_zhwikippl: 90638.1875
21
- - eval_tinystoriesppl: 13633.8428
22
- - eval_loss: 18.8988
23
- - eval_runtime: 32.6253
24
- - eval_samples_per_second: 76.628
25
- - eval_steps_per_second: 9.594
26
 
27
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
28
  should probably proofread and complete it, then remove this comment.
@@ -45,7 +45,7 @@ More information needed
45
  ### Training hyperparameters
46
 
47
  The following hyperparameters were used during training:
48
- - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=10.0, loss_fn=raw_mse, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=10.0, loss_fn=raw_mse, layer_mapper=None, projector=None))
49
  - train_embeddings: True
50
  - learning_rate: 4e-05
51
  - train_batch_size: 16
@@ -56,21 +56,21 @@ The following hyperparameters were used during training:
56
  - num_epochs: 1.0
57
 
58
  ### Resource Usage
59
- Peak GPU Memory: 16.2498 GB
60
 
61
  ### Eval-Phase Metrics
62
  | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
63
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
64
  | **teacher eval** | | 169.9865 | 47377.9414 | | | | | 3.9789 | 4998.1294 |
65
- | 0 | 0 | 30500.8262 | 64429.8789 | 19.1222 | 32.5358 | 76.838 | 9.62 | 17883.2402 | 92396.1641 |
66
- | 2000 | 0.1293 | 24580.0566 | 58429.5703 | 18.8980 | 32.4735 | 76.986 | 9.639 | 13633.8428 | 90638.1875 |
67
- | 4000 | 0.2586 | 24580.0566 | 58429.5703 | 18.8980 | 32.5203 | 76.875 | 9.625 | 13633.8428 | 90638.1875 |
68
- | 6000 | 0.3879 | 24580.0566 | 58429.5703 | 18.8988 | 32.628 | 76.621 | 9.593 | 13633.8428 | 90638.1875 |
69
- | 8000 | 0.5172 | 24580.0566 | 58429.5703 | 18.8988 | 32.6253 | 76.628 | 9.594 | 13633.8428 | 90638.1875 |
70
- | 10000 | 0.6465 | 24580.0566 | 58429.5703 | 18.8988 | 32.4883 | 76.951 | 9.634 | 13633.8428 | 90638.1875 |
71
- | 12000 | 0.7757 | 24580.0566 | 58429.5703 | 18.8980 | 32.4949 | 76.935 | 9.632 | 13633.8428 | 90638.1875 |
72
- | 14000 | 0.9050 | 24580.0566 | 58429.5703 | 18.8988 | 32.507 | 76.906 | 9.629 | 13633.8428 | 90638.1875 |
73
- | 15469 | 1.0 | 24580.0566 | 58429.5703 | 18.8988 | 32.6353 | 76.604 | 9.591 | 13633.8428 | 90638.1875 |
74
 
75
  ### Framework versions
76
  - Distily 0.2.0
 
15
  The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
16
 
17
  It achieves the following results on the evaluation set:
18
+ - eval_enwikippl: 22177.1309
19
+ - eval_frwikippl: 73852.3203
20
+ - eval_zhwikippl: 62537.2148
21
+ - eval_tinystoriesppl: 11654.0615
22
+ - eval_loss: 6.9194
23
+ - eval_runtime: 32.7141
24
+ - eval_samples_per_second: 76.42
25
+ - eval_steps_per_second: 9.568
26
 
27
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
28
  should probably proofread and complete it, then remove this comment.
 
45
  ### Training hyperparameters
46
 
47
  The following hyperparameters were used during training:
48
+ - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=10.0, loss_fn=mse, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=10.0, loss_fn=raw_mse, layer_mapper=None, projector=None))
49
  - train_embeddings: True
50
  - learning_rate: 4e-05
51
  - train_batch_size: 16
 
56
  - num_epochs: 1.0
57
 
58
  ### Resource Usage
59
+ Peak GPU Memory: 16.2515 GB
60
 
61
  ### Eval-Phase Metrics
62
  | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
63
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
64
  | **teacher eval** | | 169.9865 | 47377.9414 | | | | | 3.9789 | 4998.1294 |
65
+ | 0 | 0 | 27548.9473 | 81896.5703 | 7.1177 | 32.6791 | 76.501 | 9.578 | 15431.1592 | 64176.7344 |
66
+ | 2000 | 0.1293 | 22177.1309 | 73852.3203 | 6.9194 | 32.645 | 76.582 | 9.588 | 11654.0615 | 62537.2148 |
67
+ | 4000 | 0.2586 | 22177.1309 | 73852.3203 | 6.9194 | 32.649 | 76.572 | 9.587 | 11654.0615 | 62537.2148 |
68
+ | 6000 | 0.3879 | 22177.1309 | 73852.3203 | 6.9194 | 32.8099 | 76.196 | 9.54 | 11654.0615 | 62537.2148 |
69
+ | 8000 | 0.5172 | 22177.1309 | 73852.3203 | 6.9194 | 32.7141 | 76.42 | 9.568 | 11654.0615 | 62537.2148 |
70
+ | 10000 | 0.6465 | 22177.1309 | 73852.3203 | 6.9194 | 32.6497 | 76.57 | 9.587 | 11654.0615 | 62537.2148 |
71
+ | 12000 | 0.7757 | 22177.1309 | 73852.3203 | 6.9194 | 32.6408 | 76.591 | 9.589 | 11654.0615 | 62537.2148 |
72
+ | 14000 | 0.9050 | 22177.1309 | 73852.3203 | 6.9194 | 32.6631 | 76.539 | 9.583 | 11654.0615 | 62537.2148 |
73
+ | 15469 | 1.0 | 22177.1309 | 73852.3203 | 6.9194 | 32.6508 | 76.568 | 9.586 | 11654.0615 | 62537.2148 |
74
 
75
  ### Framework versions
76
  - Distily 0.2.0
logs/attn_loss_fn=raw_mse, attn_weight=10.0, hs_loss_fn=mse, hs_weight=10.0, learning_rate=4e-05/events.out.tfevents.1723766051.5f530b1cf724 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:90e143c8f8a2001681b81ace6d7867a2978c824702d66a7c9fbaef8f0c6ef47b
3
+ size 307