lapp0 commited on
Commit
3546915
1 Parent(s): 4f7c44e

End of training

Browse files
README.md CHANGED
@@ -15,14 +15,14 @@ This student model is distilled from the teacher model [roneneldan/TinyStories-3
15
  The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
16
 
17
  It achieves the following results on the evaluation set:
18
- - eval_enwikippl: 4707.3955
19
- - eval_frwikippl: 38983.5547
20
- - eval_zhwikippl: 53243.9219
21
- - eval_tinystoriesppl: 1636.1367
22
- - eval_loss: 5.6090
23
- - eval_runtime: 33.6693
24
- - eval_samples_per_second: 74.252
25
- - eval_steps_per_second: 9.296
26
 
27
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
28
  should probably proofread and complete it, then remove this comment.
@@ -45,10 +45,10 @@ More information needed
45
  ### Training hyperparameters
46
 
47
  The following hyperparameters were used during training:
48
- - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=10.0, loss_fn=mse, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=10.0, loss_fn=raw_mse, layer_mapper=None, projector=None))
49
  - train_embeddings: True
50
  - learning_rate: 0.0004
51
- - train_batch_size: 16
52
  - eval_batch_size: 8
53
  - seed: 42
54
  - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
@@ -56,21 +56,34 @@ The following hyperparameters were used during training:
56
  - num_epochs: 1.0
57
 
58
  ### Resource Usage
59
- Peak GPU Memory: 16.2515 GB
60
 
61
  ### Eval-Phase Metrics
62
  | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
63
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
64
  | **teacher eval** | | 169.9865 | 47377.9414 | | | | | 3.9789 | 4998.1294 |
65
- | 0 | 0 | 27548.9473 | 81896.5703 | 7.1177 | 33.5606 | 74.492 | 9.326 | 15431.1592 | 64176.7344 |
66
- | 2000 | 0.1293 | 4707.3955 | 39038.5078 | 5.6090 | 33.5798 | 74.45 | 9.321 | 1635.3256 | 53243.9219 |
67
- | 4000 | 0.2586 | 4704.4829 | 38983.5547 | 5.6090 | 33.4351 | 74.772 | 9.361 | 1632.6235 | 53243.9219 |
68
- | 6000 | 0.3879 | 4717.6201 | 38972.5898 | 5.6090 | 33.4912 | 74.646 | 9.346 | 1638.3024 | 53243.9219 |
69
- | 8000 | 0.5172 | 4707.3955 | 38983.5547 | 5.6090 | 33.6693 | 74.252 | 9.296 | 1636.1367 | 53243.9219 |
70
- | 10000 | 0.6465 | 4707.3955 | 38994.5625 | 5.6090 | 33.5614 | 74.49 | 9.326 | 1636.4075 | 53243.9219 |
71
- | 12000 | 0.7757 | 4707.3955 | 39005.5352 | 5.6090 | 33.5763 | 74.457 | 9.322 | 1634.2444 | 53243.9219 |
72
- | 14000 | 0.9050 | 4708.1274 | 38994.5625 | 5.6090 | 33.4805 | 74.67 | 9.349 | 1636.4075 | 53243.9219 |
73
- | 15469 | 1.0 | 4704.4829 | 39027.5234 | 5.6090 | 33.5496 | 74.517 | 9.329 | 1632.6235 | 53243.9219 |
 
 
 
 
 
 
 
 
 
 
 
 
 
74
 
75
  ### Framework versions
76
  - Distily 0.2.0
 
15
  The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
16
 
17
  It achieves the following results on the evaluation set:
18
+ - eval_enwikippl: 3694.4192
19
+ - eval_frwikippl: 30929.5703
20
+ - eval_zhwikippl: 45501.2617
21
+ - eval_tinystoriesppl: 1160.7031
22
+ - eval_loss: 15.6337
23
+ - eval_runtime: 66.5963
24
+ - eval_samples_per_second: 75.079
25
+ - eval_steps_per_second: 9.385
26
 
27
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
28
  should probably proofread and complete it, then remove this comment.
 
45
  ### Training hyperparameters
46
 
47
  The following hyperparameters were used during training:
48
+ - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=10.0, loss_fn=raw_mse, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=10.0, loss_fn=mse, layer_mapper=None, projector=None))
49
  - train_embeddings: True
50
  - learning_rate: 0.0004
51
+ - train_batch_size: 8
52
  - eval_batch_size: 8
53
  - seed: 42
54
  - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
 
56
  - num_epochs: 1.0
57
 
58
  ### Resource Usage
59
+ Peak GPU Memory: 8.2666 GB
60
 
61
  ### Eval-Phase Metrics
62
  | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
63
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
64
  | **teacher eval** | | 169.9865 | 47377.9414 | | | | | 3.9789 | 4998.1294 |
65
+ | 0 | 0 | 19278.7617 | 60268.5703 | 17.3716 | 66.6062 | 75.068 | 9.384 | 9660.0908 | 53858.2383 |
66
+ | 3000 | 0.0485 | 3702.4434 | 30929.5703 | 15.6332 | 66.4884 | 75.201 | 9.4 | 1163.3928 | 45525.5703 |
67
+ | 6000 | 0.0970 | 3702.4434 | 30929.5703 | 15.6346 | 67.0021 | 74.625 | 9.328 | 1163.0084 | 45525.5703 |
68
+ | 9000 | 0.1455 | 3694.4192 | 30929.5703 | 15.6337 | 66.5963 | 75.079 | 9.385 | 1160.7031 | 45501.2617 |
69
+ | 12000 | 0.1939 | 3696.7100 | 30981.8828 | 15.6348 | 66.5218 | 75.163 | 9.395 | 1161.0868 | 45525.5703 |
70
+ | 15000 | 0.2424 | 3697.8560 | 30929.5703 | 15.6342 | 66.6948 | 74.968 | 9.371 | 1161.8550 | 45525.5703 |
71
+ | 18000 | 0.2909 | 3696.7100 | 30981.8828 | 15.6348 | 66.2258 | 75.499 | 9.437 | 1161.2789 | 45525.5703 |
72
+ | 21000 | 0.3394 | 3697.8560 | 30946.9785 | 15.6344 | 66.5835 | 75.094 | 9.387 | 1161.4711 | 45525.5703 |
73
+ | 24000 | 0.3879 | 3697.8560 | 30929.5703 | 15.6334 | 66.8279 | 74.819 | 9.352 | 1162.0472 | 45525.5703 |
74
+ | 27000 | 0.4364 | 3697.8560 | 30981.8828 | 15.6346 | 66.5691 | 75.11 | 9.389 | 1161.6627 | 45525.5703 |
75
+ | 30000 | 0.4848 | 3696.7100 | 30946.9785 | 15.6346 | 66.7012 | 74.961 | 9.37 | 1160.7031 | 45525.5703 |
76
+ | 33000 | 0.5333 | 3696.1389 | 30981.8828 | 15.6346 | 66.5211 | 75.164 | 9.396 | 1160.1277 | 45525.5703 |
77
+ | 36000 | 0.5818 | 3700.1489 | 30929.5703 | 15.6331 | 66.5006 | 75.187 | 9.398 | 1162.6237 | 45525.5703 |
78
+ | 39000 | 0.6303 | 3694.4192 | 30964.4258 | 15.6344 | 66.3802 | 75.324 | 9.415 | 1160.5111 | 45501.2617 |
79
+ | 42000 | 0.6788 | 3696.7100 | 30946.9785 | 15.6346 | 66.6702 | 74.996 | 9.375 | 1160.7031 | 45525.5703 |
80
+ | 45000 | 0.7273 | 3696.7100 | 30981.8828 | 15.6347 | 66.7768 | 74.876 | 9.36 | 1161.0868 | 45525.5703 |
81
+ | 48000 | 0.7758 | 3694.4192 | 30929.5703 | 15.6331 | 66.6573 | 75.011 | 9.376 | 1160.7031 | 45525.5703 |
82
+ | 51000 | 0.8242 | 3692.7039 | 30981.8828 | 15.6344 | 66.8297 | 74.817 | 9.352 | 1159.7439 | 45501.2617 |
83
+ | 54000 | 0.8727 | 3692.1333 | 30946.9785 | 15.6344 | 66.8788 | 74.762 | 9.345 | 1158.7859 | 45501.2617 |
84
+ | 57000 | 0.9212 | 3696.7100 | 30946.9785 | 15.6346 | 66.8322 | 74.814 | 9.352 | 1160.7031 | 45501.2617 |
85
+ | 60000 | 0.9697 | 3707.0330 | 30929.5703 | 15.6328 | 66.9377 | 74.696 | 9.337 | 1165.3177 | 45525.5703 |
86
+ | 61875 | 1.0 | 3702.4434 | 30929.5703 | 15.6331 | 66.5826 | 75.095 | 9.387 | 1163.3928 | 45501.2617 |
87
 
88
  ### Framework versions
89
  - Distily 0.2.0
logs/attn_loss_fn=mse, attn_weight=10.0, hs_loss_fn=raw_mse, hs_weight=10.0, learning_rate=0.0004, warmup_ratio=0/events.out.tfevents.1723776757.b7d545513dcf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:097e6d52c1b4dd5d6eb0fc4c0858d22c2bdc5f3d78102c9def7dd7f526676cd7
3
+ size 312