lapp0 commited on
Commit
a384cac
1 Parent(s): 7f0bf3f

Training in progress, step 61875

Browse files
README.md CHANGED
@@ -15,14 +15,14 @@ This student model is distilled from the teacher model [roneneldan/TinyStories-3
15
  The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
16
 
17
  It achieves the following results on the evaluation set:
18
- - eval_enwikippl: 158.7294
19
- - eval_frwikippl: 15434.1611
20
- - eval_zhwikippl: 106089.8359
21
- - eval_tinystoriesppl: 15.5930
22
- - eval_loss: 2.3671
23
- - eval_runtime: 65.5679
24
- - eval_samples_per_second: 76.257
25
- - eval_steps_per_second: 9.532
26
 
27
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
28
  should probably proofread and complete it, then remove this comment.
@@ -47,7 +47,7 @@ More information needed
47
  The following hyperparameters were used during training:
48
  - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=10.0, loss_fn=mse, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=10.0, loss_fn=mse, layer_mapper=None, projector=None))
49
  - train_embeddings: True
50
- - learning_rate: 0.001
51
  - train_batch_size: 8
52
  - eval_batch_size: 8
53
  - seed: 42
@@ -63,31 +63,31 @@ Peak GPU Memory: 8.2677 GB
63
  | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
64
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
65
  | **teacher eval** | | 169.9865 | 47377.9414 | | | | | 3.9789 | 4998.1294 |
66
- | 0 | 0 | 21397.4785 | 57946.0117 | 6.1625 | 65.3142 | 76.553 | 9.569 | 12321.8145 | 60955.8008 |
67
- | 3000 | 0.0485 | 158.5205 | 15451.5547 | 2.3672 | 65.4452 | 76.4 | 9.55 | 15.5596 | 105976.6797 |
68
- | 6000 | 0.0970 | 159.4627 | 15519.1768 | 2.3670 | 65.5893 | 76.232 | 9.529 | 15.6570 | 107916.8281 |
69
- | 9000 | 0.1455 | 158.7294 | 15434.1611 | 2.3671 | 65.5679 | 76.257 | 9.532 | 15.5930 | 106089.8359 |
70
- | 12000 | 0.1939 | 159.8214 | 15466.7988 | 2.3670 | 65.4602 | 76.382 | 9.548 | 15.6933 | 107457.1484 |
71
- | 15000 | 0.2424 | 159.5122 | 15501.6934 | 2.3671 | 65.4126 | 76.438 | 9.555 | 15.6648 | 107916.8281 |
72
- | 18000 | 0.2909 | 158.8771 | 15434.1611 | 2.3670 | 65.4041 | 76.448 | 9.556 | 15.5956 | 106033.1953 |
73
- | 21000 | 0.3394 | 159.3146 | 15434.1611 | 2.3671 | 65.4872 | 76.351 | 9.544 | 15.6460 | 106089.8359 |
74
- | 24000 | 0.3879 | 159.4504 | 15434.1611 | 2.3670 | 65.5249 | 76.307 | 9.538 | 15.6589 | 106089.8359 |
75
- | 27000 | 0.4364 | 158.9386 | 15386.3984 | 2.3669 | 65.4767 | 76.363 | 9.545 | 15.6163 | 105581.5391 |
76
- | 30000 | 0.4848 | 159.1728 | 15451.5547 | 2.3671 | 65.3648 | 76.494 | 9.562 | 15.6369 | 107342.5391 |
77
- | 33000 | 0.5333 | 159.8709 | 15466.7988 | 2.3670 | 65.4363 | 76.41 | 9.551 | 15.6965 | 106942.4062 |
78
- | 36000 | 0.5818 | 159.2097 | 15460.2656 | 2.3670 | 65.4686 | 76.373 | 9.547 | 15.6318 | 107629.3516 |
79
- | 39000 | 0.6303 | 158.6066 | 15503.8809 | 2.3670 | 65.5342 | 76.296 | 9.537 | 15.5724 | 107744.2734 |
80
- | 42000 | 0.6788 | 158.5205 | 15468.9824 | 2.3671 | 65.5105 | 76.324 | 9.54 | 15.5576 | 107399.8828 |
81
- | 45000 | 0.7273 | 158.7909 | 15399.4043 | 2.3670 | 65.5316 | 76.299 | 9.537 | 15.6163 | 106089.8359 |
82
- | 48000 | 0.7758 | 158.7909 | 15434.1611 | 2.3671 | 65.4706 | 76.37 | 9.546 | 15.6027 | 106373.1953 |
83
- | 51000 | 0.8242 | 158.8033 | 15425.4648 | 2.3669 | 65.5734 | 76.25 | 9.531 | 15.6169 | 106089.8359 |
84
- | 54000 | 0.8727 | 158.9263 | 15434.1611 | 2.3670 | 65.5021 | 76.333 | 9.542 | 15.6085 | 106486.7812 |
85
- | 57000 | 0.9212 | 159.3887 | 15451.5547 | 2.3671 | 65.5842 | 76.238 | 9.53 | 15.6505 | 107342.5391 |
86
- | 60000 | 0.9697 | 159.4874 | 15390.7422 | 2.3670 | 65.5517 | 76.276 | 9.534 | 15.6641 | 105581.5391 |
87
- | 61875 | 1.0 | 159.6729 | 15492.9736 | 2.3671 | 65.3871 | 76.468 | 9.558 | 15.6926 | 107342.5391 |
88
 
89
  ### Framework versions
90
  - Distily 0.2.0
91
  - Transformers 4.44.0
92
  - Pytorch 2.3.0
93
- - Datasets 2.21.0
 
15
  The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
16
 
17
  It achieves the following results on the evaluation set:
18
+ - eval_enwikippl: 165.1131
19
+ - eval_frwikippl: 53475.5117
20
+ - eval_zhwikippl: 433274.5625
21
+ - eval_tinystoriesppl: 9.8234
22
+ - eval_loss: 1.2315
23
+ - eval_runtime: 66.333
24
+ - eval_samples_per_second: 75.377
25
+ - eval_steps_per_second: 9.422
26
 
27
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
28
  should probably proofread and complete it, then remove this comment.
 
47
  The following hyperparameters were used during training:
48
  - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=10.0, loss_fn=mse, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=10.0, loss_fn=mse, layer_mapper=None, projector=None))
49
  - train_embeddings: True
50
+ - learning_rate: 0.004
51
  - train_batch_size: 8
52
  - eval_batch_size: 8
53
  - seed: 42
 
63
  | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
64
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
65
  | **teacher eval** | | 169.9865 | 47377.9414 | | | | | 3.9789 | 4998.1294 |
66
+ | 0 | 0 | 21397.4785 | 57946.0117 | 6.1625 | 66.2144 | 75.512 | 9.439 | 12321.8145 | 60955.8008 |
67
+ | 3000 | 0.0485 | 163.4777 | 52357.5 | 1.2318 | 66.2157 | 75.511 | 9.439 | 9.7461 | 405426.5312 |
68
+ | 6000 | 0.0970 | 164.5832 | 52313.2773 | 1.2313 | 66.4422 | 75.253 | 9.407 | 9.8299 | 399840.8438 |
69
+ | 9000 | 0.1455 | 165.1131 | 53475.5117 | 1.2315 | 66.333 | 75.377 | 9.422 | 9.8234 | 433274.5625 |
70
+ | 12000 | 0.1939 | 175.9144 | 53347.6094 | 1.2314 | 66.1602 | 75.574 | 9.447 | 10.8017 | 431314.25 |
71
+ | 15000 | 0.2424 | 166.3006 | 53279.9844 | 1.2312 | 66.2525 | 75.469 | 9.434 | 9.8988 | 439327.75 |
72
+ | 18000 | 0.2909 | 165.6127 | 53520.7148 | 1.2316 | 66.1729 | 75.56 | 9.445 | 9.8144 | 435128.4375 |
73
+ | 21000 | 0.3394 | 164.5322 | 54035.7852 | 1.2317 | 66.1602 | 75.574 | 9.447 | 9.7405 | 440971.5312 |
74
+ | 24000 | 0.3879 | 175.3294 | 52764.6719 | 1.2315 | 66.2724 | 75.446 | 9.431 | 10.8317 | 413399.9688 |
75
+ | 27000 | 0.4364 | 176.5424 | 52542.1719 | 1.2318 | 66.1192 | 75.621 | 9.453 | 10.8923 | 411858.9688 |
76
+ | 30000 | 0.4848 | 165.8760 | 53490.5586 | 1.2310 | 66.2275 | 75.497 | 9.437 | 9.8575 | 418728.4062 |
77
+ | 33000 | 0.5333 | 165.8246 | 53823.1172 | 1.2314 | 66.2901 | 75.426 | 9.428 | 9.8718 | 418728.4062 |
78
+ | 36000 | 0.5818 | 164.1630 | 51814.5781 | 1.2318 | 66.2021 | 75.526 | 9.441 | 9.7994 | 426621.9688 |
79
+ | 39000 | 0.6303 | 176.7887 | 53868.6172 | 1.2316 | 66.2783 | 75.439 | 9.43 | 10.8604 | 428103.875 |
80
+ | 42000 | 0.6788 | 176.1667 | 52483.0273 | 1.2317 | 66.1638 | 75.57 | 9.446 | 10.8981 | 430854.2188 |
81
+ | 45000 | 0.7273 | 165.7860 | 54833.2031 | 1.2316 | 66.226 | 75.499 | 9.437 | 9.7437 | 433274.5625 |
82
+ | 48000 | 0.7758 | 177.6536 | 54096.7305 | 1.2314 | 66.2463 | 75.476 | 9.434 | 10.8174 | 436174.1875 |
83
+ | 51000 | 0.8242 | 164.2648 | 53400.2422 | 1.2316 | 66.365 | 75.341 | 9.418 | 9.7212 | 425257.9062 |
84
+ | 54000 | 0.8727 | 177.6812 | 53792.7930 | 1.2314 | 66.3929 | 75.309 | 9.414 | 10.8147 | 419846.8125 |
85
+ | 57000 | 0.9212 | 169.2639 | 54421.5391 | 1.2310 | 66.2287 | 75.496 | 9.437 | 10.0082 | 441678.1875 |
86
+ | 60000 | 0.9697 | 163.0982 | 52077.9805 | 1.2316 | 66.2184 | 75.508 | 9.438 | 9.7240 | 401765.375 |
87
+ | 61875 | 1.0 | 164.1884 | 52549.5898 | 1.2312 | 66.1617 | 75.572 | 9.447 | 9.8116 | 419846.8125 |
88
 
89
  ### Framework versions
90
  - Distily 0.2.0
91
  - Transformers 4.44.0
92
  - Pytorch 2.3.0
93
+ - Datasets 2.20.0
logs/attn_loss_fn=mse, attn_weight=10.0, hs_loss_fn=cos, hs_weight=10.0, learning_rate=0.004, warmup_ratio=0/events.out.tfevents.1723806741.93d6cbb3ad53 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6c5808af816d629478fd1f0f14f2940ca43d57f5483d9725d4bc5c468d132e20
3
+ size 16923348
logs/attn_loss_fn=mse, attn_weight=10.0, hs_loss_fn=mse, hs_weight=10.0, learning_rate=0.004, warmup_ratio=0.1/events.out.tfevents.1723806543.93d6cbb3ad53 CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:59c189d9b938217f6fb57af494a20fb22bf2717b1bf394e46f3f0c25d0c9f70f
3
- size 312
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:64807b2ee2daeb4cffe0f85ae30a4ab9ad7fb4d9036c81d29254cf9c1b4fba56
3
+ size 588
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:331a8ab929b55895bcffd8c0c5219a8b4a3bb88818c1a6dca552d4a8164c7664
3
  size 137033984
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1e70b8f8cdbd3dfa80135450fd849dc798769a33ae1dc5b193ef3658b30e881e
3
  size 137033984
training_args.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:d140612a7dd97149912b4523c511f5472fbf76630c103e3935cf205281819d6b
3
  size 1017948104
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2dcb263d55f6965884f8fa3afad966b9db28704ad230bb9ff93a7464658c2df1
3
  size 1017948104