lapp0 commited on
Commit
5a46d24
1 Parent(s): 06c2c39

Training in progress, step 61875

Browse files
README.md CHANGED
@@ -15,14 +15,14 @@ This student model is distilled from the teacher model [roneneldan/TinyStories-3
15
  The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
16
 
17
  It achieves the following results on the evaluation set:
18
- - eval_enwikippl: 4707.3955
19
- - eval_frwikippl: 38983.5547
20
- - eval_zhwikippl: 53243.9219
21
- - eval_tinystoriesppl: 1636.1367
22
- - eval_loss: 5.6090
23
- - eval_runtime: 33.6693
24
- - eval_samples_per_second: 74.252
25
- - eval_steps_per_second: 9.296
26
 
27
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
28
  should probably proofread and complete it, then remove this comment.
@@ -47,7 +47,7 @@ More information needed
47
  The following hyperparameters were used during training:
48
  - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=10.0, loss_fn=mse, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=10.0, loss_fn=raw_mse, layer_mapper=None, projector=None))
49
  - train_embeddings: True
50
- - learning_rate: 0.0004
51
  - train_batch_size: 16
52
  - eval_batch_size: 8
53
  - seed: 42
@@ -62,18 +62,18 @@ Peak GPU Memory: 16.2515 GB
62
  | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
63
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
64
  | **teacher eval** | | 169.9865 | 47377.9414 | | | | | 3.9789 | 4998.1294 |
65
- | 0 | 0 | 27548.9473 | 81896.5703 | 7.1177 | 33.5606 | 74.492 | 9.326 | 15431.1592 | 64176.7344 |
66
- | 2000 | 0.1293 | 4707.3955 | 39038.5078 | 5.6090 | 33.5798 | 74.45 | 9.321 | 1635.3256 | 53243.9219 |
67
- | 4000 | 0.2586 | 4704.4829 | 38983.5547 | 5.6090 | 33.4351 | 74.772 | 9.361 | 1632.6235 | 53243.9219 |
68
- | 6000 | 0.3879 | 4717.6201 | 38972.5898 | 5.6090 | 33.4912 | 74.646 | 9.346 | 1638.3024 | 53243.9219 |
69
- | 8000 | 0.5172 | 4707.3955 | 38983.5547 | 5.6090 | 33.6693 | 74.252 | 9.296 | 1636.1367 | 53243.9219 |
70
- | 10000 | 0.6465 | 4707.3955 | 38994.5625 | 5.6090 | 33.5614 | 74.49 | 9.326 | 1636.4075 | 53243.9219 |
71
- | 12000 | 0.7757 | 4707.3955 | 39005.5352 | 5.6090 | 33.5763 | 74.457 | 9.322 | 1634.2444 | 53243.9219 |
72
- | 14000 | 0.9050 | 4708.1274 | 38994.5625 | 5.6090 | 33.4805 | 74.67 | 9.349 | 1636.4075 | 53243.9219 |
73
- | 15469 | 1.0 | 4704.4829 | 39027.5234 | 5.6090 | 33.5496 | 74.517 | 9.329 | 1632.6235 | 53243.9219 |
74
 
75
  ### Framework versions
76
  - Distily 0.2.0
77
  - Transformers 4.44.0
78
  - Pytorch 2.3.0
79
- - Datasets 2.21.0
 
15
  The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
16
 
17
  It achieves the following results on the evaluation set:
18
+ - eval_enwikippl: 26828.6738
19
+ - eval_frwikippl: 80365.2578
20
+ - eval_zhwikippl: 64005.7734
21
+ - eval_tinystoriesppl: 14879.9648
22
+ - eval_loss: 7.0927
23
+ - eval_runtime: 33.2949
24
+ - eval_samples_per_second: 75.087
25
+ - eval_steps_per_second: 9.401
26
 
27
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
28
  should probably proofread and complete it, then remove this comment.
 
47
  The following hyperparameters were used during training:
48
  - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=10.0, loss_fn=mse, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=10.0, loss_fn=raw_mse, layer_mapper=None, projector=None))
49
  - train_embeddings: True
50
+ - learning_rate: 4e-06
51
  - train_batch_size: 16
52
  - eval_batch_size: 8
53
  - seed: 42
 
62
  | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
63
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
64
  | **teacher eval** | | 169.9865 | 47377.9414 | | | | | 3.9789 | 4998.1294 |
65
+ | 0 | 0 | 27548.9473 | 81896.5703 | 7.1177 | 33.2322 | 75.228 | 9.419 | 15431.1592 | 64176.7344 |
66
+ | 2000 | 0.1293 | 26828.6738 | 80365.2578 | 7.0927 | 33.0326 | 75.683 | 9.475 | 14879.9648 | 64005.7734 |
67
+ | 4000 | 0.2586 | 26828.6738 | 80365.2578 | 7.0927 | 33.2498 | 75.188 | 9.414 | 14879.9648 | 64005.7734 |
68
+ | 6000 | 0.3879 | 26828.6738 | 80365.2578 | 7.0927 | 33.1527 | 75.409 | 9.441 | 14879.9648 | 64005.7734 |
69
+ | 8000 | 0.5172 | 26828.6738 | 80365.2578 | 7.0927 | 33.2949 | 75.087 | 9.401 | 14879.9648 | 64005.7734 |
70
+ | 10000 | 0.6465 | 26828.6738 | 80365.2578 | 7.0927 | 32.9736 | 75.818 | 9.492 | 14879.9648 | 64005.7734 |
71
+ | 12000 | 0.7757 | 26828.6738 | 80365.2578 | 7.0927 | 33.0002 | 75.757 | 9.485 | 14879.9648 | 64005.7734 |
72
+ | 14000 | 0.9050 | 26828.6738 | 80365.2578 | 7.0927 | 33.0244 | 75.702 | 9.478 | 14879.9648 | 64005.7734 |
73
+ | 15469 | 1.0 | 26828.6738 | 80365.2578 | 7.0927 | 33.0358 | 75.675 | 9.475 | 14879.9648 | 64005.7734 |
74
 
75
  ### Framework versions
76
  - Distily 0.2.0
77
  - Transformers 4.44.0
78
  - Pytorch 2.3.0
79
+ - Datasets 2.20.0
logs/attn_loss_fn=mse, attn_weight=10.0, hidden_weight=10.0, hs_loss_fn=raw_mse, learning_rate=0.004, warmup_ratio=0/events.out.tfevents.1723766493.93d6cbb3ad53 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8211110fbed13b0847779fe60e88eab28936dd52f234ee43dd30bd3664417593
3
+ size 6165
logs/attn_loss_fn=mse, attn_weight=10.0, hs_loss_fn=raw_mse, hs_weight=10.0, learning_rate=0.004, warmup_ratio=0/events.out.tfevents.1723766870.93d6cbb3ad53 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:84283b041bc58105a70e83c603b4c177c60032f1454b10e6ec870894b0ea7cad
3
+ size 16923352
logs/attn_loss_fn=raw_mse, attn_weight=10.0, hs_loss_fn=mse, hs_weight=10.0, learning_rate=4e-06/events.out.tfevents.1723766171.93d6cbb3ad53 CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:a68d1623296a8011dc3c5e5fe3f3b75dc729bb2759dc0a8b461138eea9669f99
3
- size 307
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:696496ca2470c450d90bdbf8b3a79234aef33fd4bc229801bb90d0c2be0767ee
3
+ size 578
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:c72edab3245e3f2a55532c6506aa9508843fa472a1853ac282d63d887c20ae82
3
  size 137033984
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9505c0f32ea5812d4a2fba0aa92c57148f2a4589c13fbdefc77bdb4fa6713906
3
  size 137033984
training_args.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:1870bf67e1c56310a08a6ab1cd1bd24e0af6a47271e2e9ff8f676e0a5597bee1
3
  size 1017948104
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:658966eef28ccc5ad224ee05199999e2c90265f455f817d1dd80e3722545273f
3
  size 1017948104