lapp0 commited on
Commit
5e729fe
1 Parent(s): 926ac2d

End of training

Browse files
README.md CHANGED
@@ -16,14 +16,14 @@ This student model is distilled from the teacher model [gpt2](https://huggingfac
16
  The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
17
 
18
  It achieves the following results on the evaluation set:
19
- - eval_enwikippl: 1784.0
20
- - eval_frwikippl: 9792.0
21
- - eval_zhwikippl: 72192.0
22
- - eval_tinystoriesppl: 1448.0
23
- - eval_loss: 2.5122
24
- - eval_runtime: 17.041
25
- - eval_samples_per_second: 58.682
26
- - eval_steps_per_second: 7.335
27
 
28
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
29
  should probably proofread and complete it, then remove this comment.
@@ -46,10 +46,10 @@ More information needed
46
  ### Training hyperparameters
47
 
48
  The following hyperparameters were used during training:
49
- - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=1.0, loss_fn=mse, layer_mapper=last, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
50
  - train_embeddings: True
51
- - learning_rate: 0.0004
52
- - train_batch_size: 8
53
  - eval_batch_size: 8
54
  - seed: 42
55
  - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
@@ -58,26 +58,38 @@ The following hyperparameters were used during training:
58
  - num_epochs: 1.0
59
 
60
  ### Resource Usage
61
- Peak GPU Memory: 8.0892 GB
62
 
63
  ### Eval-Phase Metrics
64
  | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
65
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
66
  | **teacher eval** | | 43.75 | 61.75 | | | | | 11.8125 | 19.125 |
67
- | 0 | 0 | 2336462209024.0 | 122045790683136.0 | 22.4230 | 17.051 | 58.648 | 7.331 | 4429185024.0 | 25975962206208.0 |
68
- | 1000 | 0.0808 | 588.0 | 3680.0 | 1.8545 | 17.0585 | 58.622 | 7.328 | 612.0 | 2880.0 |
69
- | 2000 | 0.1616 | 988.0 | 5600.0 | 2.1657 | 17.0179 | 58.762 | 7.345 | 816.0 | 3200.0 |
70
- | 3000 | 0.2424 | 1744.0 | 8640.0 | 2.5064 | 17.083 | 58.538 | 7.317 | 1544.0 | 46080.0 |
71
- | 4000 | 0.3232 | 1864.0 | 8896.0 | 2.5506 | 17.0876 | 58.522 | 7.315 | 1544.0 | 63744.0 |
72
- | 5000 | 0.4040 | 1728.0 | 8832.0 | 2.4970 | 17.0783 | 58.554 | 7.319 | 1520.0 | 51200.0 |
73
- | 6000 | 0.4848 | 1936.0 | 9216.0 | 2.5779 | 17.0361 | 58.699 | 7.337 | 1688.0 | 63744.0 |
74
- | 7000 | 0.5657 | 2224.0 | 9792.0 | 2.6441 | 17.0202 | 58.754 | 7.344 | 1832.0 | 82944.0 |
75
- | 8000 | 0.6465 | 1936.0 | 8832.0 | 2.5707 | 17.057 | 58.627 | 7.328 | 1808.0 | 115200.0 |
76
- | 9000 | 0.7273 | 1784.0 | 9792.0 | 2.5122 | 17.041 | 58.682 | 7.335 | 1448.0 | 72192.0 |
77
- | 10000 | 0.8081 | 2064.0 | 9664.0 | 2.5934 | 17.147 | 58.319 | 7.29 | 1552.0 | 91648.0 |
78
- | 11000 | 0.8889 | 2064.0 | 10240.0 | 2.6004 | 17.0431 | 58.675 | 7.334 | 1720.0 | 80896.0 |
79
- | 12000 | 0.9697 | 2064.0 | 11008.0 | 2.6036 | 17.1142 | 58.431 | 7.304 | 1800.0 | 68608.0 |
80
- | 12375 | 1.0 | 1992.0 | 9280.0 | 2.5963 | 17.0677 | 58.59 | 7.324 | 1752.0 | 70144.0 |
 
 
 
 
 
 
 
 
 
 
 
 
81
 
82
  ### Framework versions
83
  - Distily 0.2.0
 
16
  The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
17
 
18
  It achieves the following results on the evaluation set:
19
+ - eval_enwikippl: 84.5
20
+ - eval_frwikippl: 356.0
21
+ - eval_zhwikippl: 135.0
22
+ - eval_tinystoriesppl: 72.0
23
+ - eval_loss: 0.6795
24
+ - eval_runtime: 16.7299
25
+ - eval_samples_per_second: 59.773
26
+ - eval_steps_per_second: 7.472
27
 
28
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
29
  should probably proofread and complete it, then remove this comment.
 
46
  ### Training hyperparameters
47
 
48
  The following hyperparameters were used during training:
49
+ - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
50
  - train_embeddings: True
51
+ - learning_rate: 0.0001
52
+ - train_batch_size: 4
53
  - eval_batch_size: 8
54
  - seed: 42
55
  - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
 
58
  - num_epochs: 1.0
59
 
60
  ### Resource Usage
61
+ Peak GPU Memory: 7.4226 GB
62
 
63
  ### Eval-Phase Metrics
64
  | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
65
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
66
  | **teacher eval** | | 43.75 | 61.75 | | | | | 11.8125 | 19.125 |
67
+ | 0 | 0 | 3126736191488.0 | 129742372077568.0 | 20.7540 | 16.7407 | 59.735 | 7.467 | 6677331968.0 | 80264348827648.0 |
68
+ | 1000 | 0.0404 | 320.0 | 1504.0 | 1.5025 | 16.7846 | 59.579 | 7.447 | 245.0 | 280.0 |
69
+ | 2000 | 0.0808 | 220.0 | 800.0 | 1.3040 | 16.7756 | 59.61 | 7.451 | 189.0 | 201.0 |
70
+ | 3000 | 0.1212 | 180.0 | 648.0 | 1.1450 | 16.7863 | 59.572 | 7.447 | 153.0 | 149.0 |
71
+ | 4000 | 0.1616 | 148.0 | 552.0 | 1.0301 | 16.7242 | 59.794 | 7.474 | 121.5 | 153.0 |
72
+ | 5000 | 0.2020 | 129.0 | 452.0 | 0.9348 | 16.7817 | 59.589 | 7.449 | 105.0 | 176.0 |
73
+ | 6000 | 0.2424 | 115.5 | 442.0 | 0.8587 | 16.8358 | 59.397 | 7.425 | 86.0 | 139.0 |
74
+ | 7000 | 0.2828 | 103.0 | 432.0 | 0.8002 | 16.7689 | 59.634 | 7.454 | 78.5 | 139.0 |
75
+ | 8000 | 0.3232 | 96.5 | 418.0 | 0.7424 | 16.7778 | 59.602 | 7.45 | 73.5 | 126.0 |
76
+ | 9000 | 0.3636 | 84.5 | 356.0 | 0.6795 | 16.7299 | 59.773 | 7.472 | 72.0 | 135.0 |
77
+ | 10000 | 0.4040 | 81.5 | 304.0 | 0.6324 | 16.7186 | 59.813 | 7.477 | 66.0 | 125.5 |
78
+ | 11000 | 0.4444 | 77.5 | 282.0 | 0.5972 | 16.777 | 59.605 | 7.451 | 59.25 | 121.5 |
79
+ | 12000 | 0.4848 | 72.5 | 288.0 | 0.5723 | 16.7347 | 59.756 | 7.47 | 56.75 | 118.0 |
80
+ | 13000 | 0.5253 | 69.5 | 256.0 | 0.5577 | 16.7525 | 59.693 | 7.462 | 55.5 | 141.0 |
81
+ | 14000 | 0.5657 | 68.5 | 237.0 | 0.5389 | 16.7317 | 59.767 | 7.471 | 54.75 | 286.0 |
82
+ | 15000 | 0.6061 | 67.5 | 252.0 | 0.5187 | 16.7326 | 59.764 | 7.47 | 52.25 | 98.5 |
83
+ | 16000 | 0.6465 | 69.0 | 235.0 | 0.5174 | 16.8095 | 59.49 | 7.436 | 54.75 | 125.5 |
84
+ | 17000 | 0.6869 | 67.0 | 231.0 | 0.5048 | 16.7326 | 59.764 | 7.47 | 50.5 | 116.0 |
85
+ | 18000 | 0.7273 | 66.0 | 225.0 | 0.4909 | 16.7575 | 59.675 | 7.459 | 49.75 | 132.0 |
86
+ | 19000 | 0.7677 | 66.5 | 247.0 | 0.4894 | 16.8313 | 59.413 | 7.427 | 49.75 | 112.0 |
87
+ | 20000 | 0.8081 | 66.5 | 233.0 | 0.4870 | 16.7365 | 59.75 | 7.469 | 51.5 | 103.5 |
88
+ | 21000 | 0.8485 | 65.0 | 221.0 | 0.4831 | 16.703 | 59.869 | 7.484 | 50.75 | 181.0 |
89
+ | 22000 | 0.8889 | 65.5 | 199.0 | 0.4740 | 16.7629 | 59.656 | 7.457 | 49.5 | 95.5 |
90
+ | 23000 | 0.9293 | 67.0 | 223.0 | 0.4752 | 16.7201 | 59.808 | 7.476 | 46.5 | 174.0 |
91
+ | 24000 | 0.9697 | 65.0 | 207.0 | 0.4700 | 16.8026 | 59.515 | 7.439 | 46.75 | 98.5 |
92
+ | 24750 | 1.0 | 67.0 | 207.0 | 0.4672 | 16.7876 | 59.568 | 7.446 | 47.0 | 185.0 |
93
 
94
  ### Framework versions
95
  - Distily 0.2.0
logs/hs_layer_mapper=last, hs_loss_fn=mse, hs_weight=1.0/completed.flag ADDED
File without changes
logs/hs_layer_mapper=last, hs_loss_fn=mse, hs_weight=1.0/events.out.tfevents.1724120424.02dbb11e2dcc CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:17fa436ebc1eb4365904057f5c458c8b6f4cf88a344275c2f2d39b2e7adbee57
3
- size 307
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f0548ed2cc98e960d04aaad36c1656a5ed5efec67db1194915cd0dec5f97a398
3
+ size 578
logs/learning_rate=0.0001, per_device_train_batch_size=4/events.out.tfevents.1724120712.02dbb11e2dcc ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5e9c8969c4f9fe244feb19558b61c74aba8824ef7e3537cd0de7874987e7c981
3
+ size 11775797
logs/learning_rate=0.0001, per_device_train_batch_size=4/events.out.tfevents.1724126717.02dbb11e2dcc ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a4a5bc57413201154a418160e43c6fab883cbe496ab8ceedae8968250c513ada
3
+ size 312
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:37c3b92069a0289ce0fe5fa4499a9cebb361e815f330c008341ac16f01d9011b
3
  size 248894656
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8d5d3a044ceaf584c050ee3384ec8ee57d7df959abf234e7bbf8033c970b2dc6
3
  size 248894656
training_args.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:9531816dc55f316fb63cc72d98043291696e2ead9cdabe9d3689be3f93afea91
3
  size 1017899144
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:16aa1decdeea3f49b2565ec28903e638e5000d81c2746c93df0e3698f552931e
3
  size 1017899144