lapp0 commited on
Commit
740f2d7
1 Parent(s): 2603c2a

Training in progress, step 61875

Browse files
README.md CHANGED
@@ -15,14 +15,14 @@ This student model is distilled from the teacher model [roneneldan/TinyStories-3
15
  The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
16
 
17
  It achieves the following results on the evaluation set:
18
- - eval_enwikippl: 165.1131
19
- - eval_frwikippl: 53475.5117
20
- - eval_zhwikippl: 433274.5625
21
- - eval_tinystoriesppl: 9.8234
22
- - eval_loss: 1.2315
23
- - eval_runtime: 66.0943
24
- - eval_samples_per_second: 75.65
25
- - eval_steps_per_second: 9.456
26
 
27
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
28
  should probably proofread and complete it, then remove this comment.
@@ -45,14 +45,15 @@ More information needed
45
  ### Training hyperparameters
46
 
47
  The following hyperparameters were used during training:
48
- - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=10.0, loss_fn=mse, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=10.0, loss_fn=mse, layer_mapper=None, projector=None))
49
  - train_embeddings: True
50
- - learning_rate: 0.004
51
  - train_batch_size: 8
52
  - eval_batch_size: 8
53
  - seed: 42
54
  - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
55
  - lr_scheduler_type: constant
 
56
  - num_epochs: 1.0
57
 
58
  ### Resource Usage
@@ -62,31 +63,31 @@ Peak GPU Memory: 8.2677 GB
62
  | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
63
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
64
  | **teacher eval** | | 169.9865 | 47377.9414 | | | | | 3.9789 | 4998.1294 |
65
- | 0 | 0 | 21397.4785 | 57946.0117 | 6.1625 | 66.2174 | 75.509 | 9.439 | 12321.8145 | 60955.8008 |
66
- | 3000 | 0.0485 | 163.4777 | 52357.5 | 1.2318 | 66.2571 | 75.464 | 9.433 | 9.7461 | 405426.5312 |
67
- | 6000 | 0.0970 | 164.5832 | 52313.2773 | 1.2313 | 66.0272 | 75.726 | 9.466 | 9.8299 | 399840.8438 |
68
- | 9000 | 0.1455 | 165.1131 | 53475.5117 | 1.2315 | 66.0943 | 75.65 | 9.456 | 9.8234 | 433274.5625 |
69
- | 12000 | 0.1939 | 175.9144 | 53347.6094 | 1.2314 | 66.2812 | 75.436 | 9.43 | 10.8017 | 431314.25 |
70
- | 15000 | 0.2424 | 166.3006 | 53279.9844 | 1.2312 | 66.2142 | 75.513 | 9.439 | 9.8988 | 439327.75 |
71
- | 18000 | 0.2909 | 165.6127 | 53520.7148 | 1.2316 | 66.0569 | 75.692 | 9.462 | 9.8144 | 435128.4375 |
72
- | 21000 | 0.3394 | 164.5322 | 54035.7852 | 1.2317 | 66.1313 | 75.607 | 9.451 | 9.7405 | 440971.5312 |
73
- | 24000 | 0.3879 | 175.3294 | 52764.6719 | 1.2315 | 66.0715 | 75.676 | 9.459 | 10.8317 | 413399.9688 |
74
- | 27000 | 0.4364 | 176.5424 | 52542.1719 | 1.2318 | 66.0569 | 75.692 | 9.462 | 10.8923 | 411858.9688 |
75
- | 30000 | 0.4848 | 165.8760 | 53490.5586 | 1.2310 | 66.1506 | 75.585 | 9.448 | 9.8575 | 418728.4062 |
76
- | 33000 | 0.5333 | 165.8246 | 53823.1172 | 1.2314 | 66.2171 | 75.509 | 9.439 | 9.8718 | 418728.4062 |
77
- | 36000 | 0.5818 | 164.1630 | 51814.5781 | 1.2318 | 66.0204 | 75.734 | 9.467 | 9.7994 | 426621.9688 |
78
- | 39000 | 0.6303 | 176.7887 | 53868.6172 | 1.2316 | 66.1946 | 75.535 | 9.442 | 10.8604 | 428103.875 |
79
- | 42000 | 0.6788 | 176.1667 | 52483.0273 | 1.2317 | 66.1202 | 75.62 | 9.452 | 10.8981 | 430854.2188 |
80
- | 45000 | 0.7273 | 165.7860 | 54833.2031 | 1.2316 | 66.2376 | 75.486 | 9.436 | 9.7437 | 433274.5625 |
81
- | 48000 | 0.7758 | 177.6536 | 54096.7305 | 1.2314 | 66.1756 | 75.557 | 9.445 | 10.8174 | 436174.1875 |
82
- | 51000 | 0.8242 | 164.2648 | 53400.2422 | 1.2316 | 66.2723 | 75.446 | 9.431 | 9.7212 | 425257.9062 |
83
- | 54000 | 0.8727 | 177.6812 | 53792.7930 | 1.2314 | 66.4654 | 75.227 | 9.403 | 10.8147 | 419846.8125 |
84
- | 57000 | 0.9212 | 169.2639 | 54421.5391 | 1.2310 | 66.3397 | 75.37 | 9.421 | 10.0082 | 441678.1875 |
85
- | 60000 | 0.9697 | 163.0982 | 52077.9805 | 1.2316 | 66.1784 | 75.553 | 9.444 | 9.7240 | 401765.375 |
86
- | 61875 | 1.0 | 164.1884 | 52549.5898 | 1.2312 | 66.3619 | 75.344 | 9.418 | 9.8116 | 419846.8125 |
87
 
88
  ### Framework versions
89
  - Distily 0.2.0
90
  - Transformers 4.44.0
91
  - Pytorch 2.3.0
92
- - Datasets 2.20.0
 
15
  The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
16
 
17
  It achieves the following results on the evaluation set:
18
+ - eval_enwikippl: 207.7861
19
+ - eval_frwikippl: 15066.8408
20
+ - eval_zhwikippl: 64727.0352
21
+ - eval_tinystoriesppl: 24.1522
22
+ - eval_loss: 13.2019
23
+ - eval_runtime: 65.4151
24
+ - eval_samples_per_second: 76.435
25
+ - eval_steps_per_second: 9.554
26
 
27
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
28
  should probably proofread and complete it, then remove this comment.
 
45
  ### Training hyperparameters
46
 
47
  The following hyperparameters were used during training:
48
+ - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=10.0, loss_fn=raw_mse, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=10.0, loss_fn=mse, layer_mapper=None, projector=None))
49
  - train_embeddings: True
50
+ - learning_rate: 0.001
51
  - train_batch_size: 8
52
  - eval_batch_size: 8
53
  - seed: 42
54
  - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
55
  - lr_scheduler_type: constant
56
+ - lr_scheduler_warmup_ratio: 0.1
57
  - num_epochs: 1.0
58
 
59
  ### Resource Usage
 
63
  | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
64
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
65
  | **teacher eval** | | 169.9865 | 47377.9414 | | | | | 3.9789 | 4998.1294 |
66
+ | 0 | 0 | 21397.4785 | 57946.0117 | 18.3162 | 65.6143 | 76.203 | 9.525 | 12321.8145 | 60955.8008 |
67
+ | 3000 | 0.0485 | 207.9149 | 15083.8350 | 13.2031 | 65.3099 | 76.558 | 9.57 | 24.1822 | 63920.4375 |
68
+ | 6000 | 0.0970 | 207.6253 | 15109.3467 | 13.2019 | 65.2976 | 76.572 | 9.572 | 24.1253 | 65386.5469 |
69
+ | 9000 | 0.1455 | 207.7861 | 15066.8408 | 13.2019 | 65.4151 | 76.435 | 9.554 | 24.1522 | 64727.0352 |
70
+ | 12000 | 0.1939 | 207.4002 | 15100.8330 | 13.2016 | 65.3491 | 76.512 | 9.564 | 24.0894 | 65229.7188 |
71
+ | 15000 | 0.2424 | 207.8023 | 15100.8330 | 13.2017 | 65.4255 | 76.423 | 9.553 | 24.1442 | 65247.1406 |
72
+ | 18000 | 0.2909 | 208.3987 | 15075.3359 | 13.2031 | 65.4213 | 76.428 | 9.553 | 24.2462 | 64057.0078 |
73
+ | 21000 | 0.3394 | 208.0761 | 15100.8330 | 13.2026 | 65.2706 | 76.604 | 9.576 | 24.2142 | 64537.3164 |
74
+ | 24000 | 0.3879 | 207.9955 | 15100.8330 | 13.2027 | 65.2287 | 76.653 | 9.582 | 24.1822 | 64159.6602 |
75
+ | 27000 | 0.4364 | 208.3180 | 15058.3516 | 13.2033 | 65.1653 | 76.728 | 9.591 | 24.2272 | 63869.3125 |
76
+ | 30000 | 0.4848 | 207.1754 | 15100.8330 | 13.2016 | 65.1169 | 76.785 | 9.598 | 24.0546 | 65229.7188 |
77
+ | 33000 | 0.5333 | 208.0761 | 15083.8350 | 13.2026 | 65.2105 | 76.675 | 9.584 | 24.2412 | 64588.9727 |
78
+ | 36000 | 0.5818 | 207.1754 | 15066.8408 | 13.2023 | 65.3453 | 76.517 | 9.565 | 24.0715 | 65229.7188 |
79
+ | 39000 | 0.6303 | 207.1754 | 15100.8330 | 13.2017 | 65.2569 | 76.62 | 9.578 | 24.0695 | 65229.7188 |
80
+ | 42000 | 0.6788 | 207.3681 | 15058.3516 | 13.2021 | 65.2167 | 76.668 | 9.583 | 24.0954 | 64796.1484 |
81
+ | 45000 | 0.7273 | 207.9955 | 15100.8330 | 13.2026 | 65.2551 | 76.622 | 9.578 | 24.1982 | 64159.6602 |
82
+ | 48000 | 0.7758 | 207.7861 | 15092.3242 | 13.2017 | 65.3187 | 76.548 | 9.568 | 24.1412 | 64727.0352 |
83
+ | 51000 | 0.8242 | 208.2050 | 15058.3516 | 13.2029 | 65.2525 | 76.625 | 9.578 | 24.2262 | 64193.8711 |
84
+ | 54000 | 0.8727 | 207.7861 | 15100.8330 | 13.2027 | 65.2798 | 76.593 | 9.574 | 24.1362 | 64331.0312 |
85
+ | 57000 | 0.9212 | 207.7218 | 15100.8330 | 13.2017 | 65.2646 | 76.611 | 9.576 | 24.1163 | 65125.4180 |
86
+ | 60000 | 0.9697 | 208.5925 | 15092.3242 | 13.2034 | 65.2233 | 76.66 | 9.582 | 24.2653 | 63869.3125 |
87
+ | 61875 | 1.0 | 208.0116 | 15100.8330 | 13.2018 | 65.2936 | 76.577 | 9.572 | 24.2012 | 64917.2539 |
88
 
89
  ### Framework versions
90
  - Distily 0.2.0
91
  - Transformers 4.44.0
92
  - Pytorch 2.3.0
93
+ - Datasets 2.21.0
logs/attn_loss_fn=mse, attn_weight=10.0, hs_loss_fn=mse, hs_weight=10.0, learning_rate=0.001, warmup_ratio=0/events.out.tfevents.1723793817.5f530b1cf724 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a271e261243f07a895516bac5465c54600c35e229cac1acdfba7a67e948a94a6
3
+ size 16923348
logs/attn_loss_fn=mse, attn_weight=10.0, hs_loss_fn=raw_mse, hs_weight=10.0, learning_rate=0.001, warmup_ratio=0.1/events.out.tfevents.1723793623.5f530b1cf724 CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:3399554b3290cdc2f6e9aefc5517c45c6373a47fc48eb89da7e3857bb74515ed
3
- size 312
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9dbaea00a6a5062cb64decac9ae1794a3f95f4ca221e1bd46b7fae0031090e84
3
+ size 588
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:b2b31538cce0ce4b165827bc5f17f54668fbd3fb115028a3f94e3732aaf22719
3
  size 137033984
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:331a8ab929b55895bcffd8c0c5219a8b4a3bb88818c1a6dca552d4a8164c7664
3
  size 137033984
training_args.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:a8e83afba9c435ea9ad8bd8c09e27f039619a31ac399d42958a8e043e1426bd5
3
  size 1017948104
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:06603e22afc968d0e36b2e11bd6961be40ef09276a0f214f8780b1fe64eed6cd
3
  size 1017948104