lapp0 commited on
Commit
026baad
1 Parent(s): 5b13da5

Training in progress, step 15469

Browse files
README.md CHANGED
@@ -15,14 +15,14 @@ This student model is distilled from the teacher model [roneneldan/TinyStories-3
15
  The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
16
 
17
  It achieves the following results on the evaluation set:
18
- - eval_enwikippl: 12623.7412
19
- - eval_frwikippl: 52327.9961
20
- - eval_zhwikippl: 96451.4609
21
- - eval_tinystoriesppl: 6281.7222
22
- - eval_loss: 18.2212
23
- - eval_runtime: 32.9567
24
- - eval_samples_per_second: 75.857
25
- - eval_steps_per_second: 9.497
26
 
27
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
28
  should probably proofread and complete it, then remove this comment.
@@ -47,7 +47,7 @@ More information needed
47
  The following hyperparameters were used during training:
48
  - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=10.0, loss_fn=raw_mse, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=10.0, loss_fn=raw_mse, layer_mapper=None, projector=None))
49
  - train_embeddings: True
50
- - learning_rate: 4e-06
51
  - train_batch_size: 16
52
  - eval_batch_size: 8
53
  - seed: 42
@@ -62,18 +62,18 @@ Peak GPU Memory: 16.2498 GB
62
  | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
63
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
64
  | **teacher eval** | | 169.9865 | 47377.9414 | | | | | 3.9789 | 4998.1294 |
65
- | 0 | 0 | 12930.5508 | 52883.7109 | 18.2508 | 32.8136 | 76.188 | 9.539 | 6471.4722 | 96760.75 |
66
- | 2000 | 0.1293 | 12623.7412 | 52327.9961 | 18.2212 | 32.8936 | 76.003 | 9.516 | 6281.7222 | 96451.4609 |
67
- | 4000 | 0.2586 | 12623.7412 | 52327.9961 | 18.2212 | 32.8034 | 76.212 | 9.542 | 6281.7222 | 96451.4609 |
68
- | 6000 | 0.3879 | 12623.7412 | 52327.9961 | 18.2212 | 32.8868 | 76.018 | 9.517 | 6281.7222 | 96451.4609 |
69
- | 8000 | 0.5172 | 12623.7412 | 52327.9961 | 18.2212 | 32.9567 | 75.857 | 9.497 | 6281.7222 | 96451.4609 |
70
- | 10000 | 0.6465 | 12623.7412 | 52327.9961 | 18.2212 | 32.9256 | 75.929 | 9.506 | 6281.7222 | 96451.4609 |
71
- | 12000 | 0.7757 | 12623.7412 | 52327.9961 | 18.2212 | 32.9173 | 75.948 | 9.509 | 6281.7222 | 96451.4609 |
72
- | 14000 | 0.9050 | 12623.7412 | 52327.9961 | 18.2212 | 33.1691 | 75.371 | 9.437 | 6281.7222 | 96451.4609 |
73
- | 15469 | 1.0 | 12623.7412 | 52327.9961 | 18.2212 | 33.0829 | 75.568 | 9.461 | 6281.7222 | 96451.4609 |
74
 
75
  ### Framework versions
76
  - Distily 0.2.0
77
  - Transformers 4.44.0
78
  - Pytorch 2.3.0
79
- - Datasets 2.20.0
 
15
  The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
16
 
17
  It achieves the following results on the evaluation set:
18
+ - eval_enwikippl: 5102.7344
19
+ - eval_frwikippl: 36133.4453
20
+ - eval_zhwikippl: 51745.4414
21
+ - eval_tinystoriesppl: 1872.7611
22
+ - eval_loss: 17.3780
23
+ - eval_runtime: 33.2598
24
+ - eval_samples_per_second: 75.166
25
+ - eval_steps_per_second: 9.411
26
 
27
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
28
  should probably proofread and complete it, then remove this comment.
 
47
  The following hyperparameters were used during training:
48
  - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=10.0, loss_fn=raw_mse, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=10.0, loss_fn=raw_mse, layer_mapper=None, projector=None))
49
  - train_embeddings: True
50
+ - learning_rate: 0.0004
51
  - train_batch_size: 16
52
  - eval_batch_size: 8
53
  - seed: 42
 
62
  | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
63
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
64
  | **teacher eval** | | 169.9865 | 47377.9414 | | | | | 3.9789 | 4998.1294 |
65
+ | 0 | 0 | 27673.0234 | 72839.8984 | 19.1728 | 33.1089 | 75.508 | 9.454 | 16450.7617 | 62570.5625 |
66
+ | 2000 | 0.1293 | 5104.3164 | 36174.1641 | 17.3784 | 33.2758 | 75.13 | 9.406 | 1874.0001 | 51745.4414 |
67
+ | 4000 | 0.2586 | 5104.3164 | 36214.9648 | 17.3776 | 33.0549 | 75.632 | 9.469 | 1873.0709 | 51773.0352 |
68
+ | 6000 | 0.3879 | 5104.3164 | 36133.4453 | 17.3784 | 33.0479 | 75.648 | 9.471 | 1874.3102 | 51717.8164 |
69
+ | 8000 | 0.5172 | 5102.7344 | 36133.4453 | 17.3780 | 33.2598 | 75.166 | 9.411 | 1872.7611 | 51745.4414 |
70
+ | 10000 | 0.6465 | 5105.8984 | 36133.4453 | 17.3784 | 33.2485 | 75.191 | 9.414 | 1875.8596 | 51745.4414 |
71
+ | 12000 | 0.7757 | 5105.8984 | 36133.4453 | 17.3780 | 33.0133 | 75.727 | 9.481 | 1874.9297 | 51717.8164 |
72
+ | 14000 | 0.9050 | 5104.3164 | 36214.9648 | 17.3776 | 33.2272 | 75.24 | 9.42 | 1872.7611 | 51745.4414 |
73
+ | 15469 | 1.0 | 5104.3164 | 36133.4453 | 17.3784 | 33.002 | 75.753 | 9.484 | 1874.3102 | 51745.4414 |
74
 
75
  ### Framework versions
76
  - Distily 0.2.0
77
  - Transformers 4.44.0
78
  - Pytorch 2.3.0
79
+ - Datasets 2.21.0
logs/attn_loss_fn=raw_mse, attn_weight=10.0, hs_loss_fn=mse, hs_weight=10.0, learning_rate=0.0004/events.out.tfevents.1723762065.b7d545513dcf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6518ff87f675a6b3e5adae33b7e064317bf9583faeec3900450f036594b27b50
3
+ size 4177663
logs/attn_loss_fn=raw_mse, attn_weight=10.0, hs_loss_fn=raw_mse, hs_weight=10.0, learning_rate=0.0004/events.out.tfevents.1723761879.b7d545513dcf CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:39caabfb83489b9cb0240c2fdf65bc7c3d658ecbe6d525426ca6ba71cc5cc15a
3
- size 307
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0c9f5b4b57dd364d6d66043f5138e2c74ea477fa4fedb9d680bc6a2544f0d10e
3
+ size 578
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:e2f3aa452fa56b42240ee51b50d8be7acdbed06dd7363b534af5d456d9810f25
3
  size 137033984
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c72edab3245e3f2a55532c6506aa9508843fa472a1853ac282d63d887c20ae82
3
  size 137033984
training_args.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:f55d3533da93ae93135f2e1e4741afe9f327280493c21e3a1afb6cebed627fce
3
  size 1017948104
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1870bf67e1c56310a08a6ab1cd1bd24e0af6a47271e2e9ff8f676e0a5597bee1
3
  size 1017948104