lapp0 commited on
Commit
8d47ec7
1 Parent(s): 3546915

Training in progress, step 61875

Browse files
README.md CHANGED
@@ -15,14 +15,14 @@ This student model is distilled from the teacher model [roneneldan/TinyStories-3
15
  The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
16
 
17
  It achieves the following results on the evaluation set:
18
- - eval_enwikippl: 3694.4192
19
- - eval_frwikippl: 30929.5703
20
- - eval_zhwikippl: 45501.2617
21
- - eval_tinystoriesppl: 1160.7031
22
- - eval_loss: 15.6337
23
- - eval_runtime: 66.5963
24
- - eval_samples_per_second: 75.079
25
- - eval_steps_per_second: 9.385
26
 
27
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
28
  should probably proofread and complete it, then remove this comment.
@@ -45,10 +45,10 @@ More information needed
45
  ### Training hyperparameters
46
 
47
  The following hyperparameters were used during training:
48
- - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=10.0, loss_fn=raw_mse, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=10.0, loss_fn=mse, layer_mapper=None, projector=None))
49
  - train_embeddings: True
50
- - learning_rate: 0.0004
51
- - train_batch_size: 8
52
  - eval_batch_size: 8
53
  - seed: 42
54
  - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
@@ -56,34 +56,21 @@ The following hyperparameters were used during training:
56
  - num_epochs: 1.0
57
 
58
  ### Resource Usage
59
- Peak GPU Memory: 8.2666 GB
60
 
61
  ### Eval-Phase Metrics
62
  | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
63
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
64
  | **teacher eval** | | 169.9865 | 47377.9414 | | | | | 3.9789 | 4998.1294 |
65
- | 0 | 0 | 19278.7617 | 60268.5703 | 17.3716 | 66.6062 | 75.068 | 9.384 | 9660.0908 | 53858.2383 |
66
- | 3000 | 0.0485 | 3702.4434 | 30929.5703 | 15.6332 | 66.4884 | 75.201 | 9.4 | 1163.3928 | 45525.5703 |
67
- | 6000 | 0.0970 | 3702.4434 | 30929.5703 | 15.6346 | 67.0021 | 74.625 | 9.328 | 1163.0084 | 45525.5703 |
68
- | 9000 | 0.1455 | 3694.4192 | 30929.5703 | 15.6337 | 66.5963 | 75.079 | 9.385 | 1160.7031 | 45501.2617 |
69
- | 12000 | 0.1939 | 3696.7100 | 30981.8828 | 15.6348 | 66.5218 | 75.163 | 9.395 | 1161.0868 | 45525.5703 |
70
- | 15000 | 0.2424 | 3697.8560 | 30929.5703 | 15.6342 | 66.6948 | 74.968 | 9.371 | 1161.8550 | 45525.5703 |
71
- | 18000 | 0.2909 | 3696.7100 | 30981.8828 | 15.6348 | 66.2258 | 75.499 | 9.437 | 1161.2789 | 45525.5703 |
72
- | 21000 | 0.3394 | 3697.8560 | 30946.9785 | 15.6344 | 66.5835 | 75.094 | 9.387 | 1161.4711 | 45525.5703 |
73
- | 24000 | 0.3879 | 3697.8560 | 30929.5703 | 15.6334 | 66.8279 | 74.819 | 9.352 | 1162.0472 | 45525.5703 |
74
- | 27000 | 0.4364 | 3697.8560 | 30981.8828 | 15.6346 | 66.5691 | 75.11 | 9.389 | 1161.6627 | 45525.5703 |
75
- | 30000 | 0.4848 | 3696.7100 | 30946.9785 | 15.6346 | 66.7012 | 74.961 | 9.37 | 1160.7031 | 45525.5703 |
76
- | 33000 | 0.5333 | 3696.1389 | 30981.8828 | 15.6346 | 66.5211 | 75.164 | 9.396 | 1160.1277 | 45525.5703 |
77
- | 36000 | 0.5818 | 3700.1489 | 30929.5703 | 15.6331 | 66.5006 | 75.187 | 9.398 | 1162.6237 | 45525.5703 |
78
- | 39000 | 0.6303 | 3694.4192 | 30964.4258 | 15.6344 | 66.3802 | 75.324 | 9.415 | 1160.5111 | 45501.2617 |
79
- | 42000 | 0.6788 | 3696.7100 | 30946.9785 | 15.6346 | 66.6702 | 74.996 | 9.375 | 1160.7031 | 45525.5703 |
80
- | 45000 | 0.7273 | 3696.7100 | 30981.8828 | 15.6347 | 66.7768 | 74.876 | 9.36 | 1161.0868 | 45525.5703 |
81
- | 48000 | 0.7758 | 3694.4192 | 30929.5703 | 15.6331 | 66.6573 | 75.011 | 9.376 | 1160.7031 | 45525.5703 |
82
- | 51000 | 0.8242 | 3692.7039 | 30981.8828 | 15.6344 | 66.8297 | 74.817 | 9.352 | 1159.7439 | 45501.2617 |
83
- | 54000 | 0.8727 | 3692.1333 | 30946.9785 | 15.6344 | 66.8788 | 74.762 | 9.345 | 1158.7859 | 45501.2617 |
84
- | 57000 | 0.9212 | 3696.7100 | 30946.9785 | 15.6346 | 66.8322 | 74.814 | 9.352 | 1160.7031 | 45501.2617 |
85
- | 60000 | 0.9697 | 3707.0330 | 30929.5703 | 15.6328 | 66.9377 | 74.696 | 9.337 | 1165.3177 | 45525.5703 |
86
- | 61875 | 1.0 | 3702.4434 | 30929.5703 | 15.6331 | 66.5826 | 75.095 | 9.387 | 1163.3928 | 45501.2617 |
87
 
88
  ### Framework versions
89
  - Distily 0.2.0
 
15
  The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
16
 
17
  It achieves the following results on the evaluation set:
18
+ - eval_enwikippl: 22177.1309
19
+ - eval_frwikippl: 73852.3203
20
+ - eval_zhwikippl: 62537.2148
21
+ - eval_tinystoriesppl: 11654.0615
22
+ - eval_loss: 6.9194
23
+ - eval_runtime: 32.7141
24
+ - eval_samples_per_second: 76.42
25
+ - eval_steps_per_second: 9.568
26
 
27
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
28
  should probably proofread and complete it, then remove this comment.
 
45
  ### Training hyperparameters
46
 
47
  The following hyperparameters were used during training:
48
+ - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=10.0, loss_fn=mse, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=10.0, loss_fn=raw_mse, layer_mapper=None, projector=None))
49
  - train_embeddings: True
50
+ - learning_rate: 4e-05
51
+ - train_batch_size: 16
52
  - eval_batch_size: 8
53
  - seed: 42
54
  - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
 
56
  - num_epochs: 1.0
57
 
58
  ### Resource Usage
59
+ Peak GPU Memory: 16.2515 GB
60
 
61
  ### Eval-Phase Metrics
62
  | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
63
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
64
  | **teacher eval** | | 169.9865 | 47377.9414 | | | | | 3.9789 | 4998.1294 |
65
+ | 0 | 0 | 27548.9473 | 81896.5703 | 7.1177 | 32.6791 | 76.501 | 9.578 | 15431.1592 | 64176.7344 |
66
+ | 2000 | 0.1293 | 22177.1309 | 73852.3203 | 6.9194 | 32.645 | 76.582 | 9.588 | 11654.0615 | 62537.2148 |
67
+ | 4000 | 0.2586 | 22177.1309 | 73852.3203 | 6.9194 | 32.649 | 76.572 | 9.587 | 11654.0615 | 62537.2148 |
68
+ | 6000 | 0.3879 | 22177.1309 | 73852.3203 | 6.9194 | 32.8099 | 76.196 | 9.54 | 11654.0615 | 62537.2148 |
69
+ | 8000 | 0.5172 | 22177.1309 | 73852.3203 | 6.9194 | 32.7141 | 76.42 | 9.568 | 11654.0615 | 62537.2148 |
70
+ | 10000 | 0.6465 | 22177.1309 | 73852.3203 | 6.9194 | 32.6497 | 76.57 | 9.587 | 11654.0615 | 62537.2148 |
71
+ | 12000 | 0.7757 | 22177.1309 | 73852.3203 | 6.9194 | 32.6408 | 76.591 | 9.589 | 11654.0615 | 62537.2148 |
72
+ | 14000 | 0.9050 | 22177.1309 | 73852.3203 | 6.9194 | 32.6631 | 76.539 | 9.583 | 11654.0615 | 62537.2148 |
73
+ | 15469 | 1.0 | 22177.1309 | 73852.3203 | 6.9194 | 32.6508 | 76.568 | 9.586 | 11654.0615 | 62537.2148 |
 
 
 
 
 
 
 
 
 
 
 
 
 
74
 
75
  ### Framework versions
76
  - Distily 0.2.0
logs/attn_loss_fn=mse, attn_weight=10.0, hidden_weight=10.0, hs_loss_fn=raw_mse, learning_rate=0.0001, warmup_ratio=0.1/events.out.tfevents.1723766650.5f530b1cf724 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f2dbc4a399e0f4219822c3bbcd57f2589851012339194f88c722fb44cfa57431
3
+ size 6171
logs/attn_loss_fn=mse, attn_weight=10.0, hidden_weight=10.0, hs_loss_fn=raw_mse, learning_rate=0.0001, warmup_ratio=0/events.out.tfevents.1723766444.5f530b1cf724 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0aaaa8c7a7099fcfe4908336b715c23bff916c0e09271348a1e0e1eef510b605
3
+ size 6167
logs/attn_loss_fn=mse, attn_weight=10.0, hs_loss_fn=raw_mse, hs_weight=10.0, learning_rate=0.0001, warmup_ratio=0/events.out.tfevents.1723766882.5f530b1cf724 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:33bea79c8ea59d0503ab05c1894665f73409e819abfdb91aba1438fd92208488
3
+ size 12618293
logs/attn_loss_fn=mse, attn_weight=10.0, hs_loss_fn=raw_mse, hs_weight=10.0, learning_rate=0.001, warmup_ratio=0/events.out.tfevents.1723774186.5f530b1cf724 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:13c80ee2512c681d998e5d2644b7aa3def790eec14b37709a43a8f9af6ba9b0c
3
+ size 16923352
logs/attn_loss_fn=raw_mse, attn_weight=10.0, hs_loss_fn=cos, hs_weight=10.0, learning_rate=4e-05/events.out.tfevents.1723766213.5f530b1cf724 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5f78a48faaed4ffc0fdc873e040be8a57c8246d408cf8e51c4cdfb494fcd68f5
3
+ size 6148
logs/attn_loss_fn=raw_mse, attn_weight=10.0, hs_loss_fn=mse, hs_weight=10.0, learning_rate=4e-05/events.out.tfevents.1723766051.5f530b1cf724 CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:90e143c8f8a2001681b81ace6d7867a2978c824702d66a7c9fbaef8f0c6ef47b
3
- size 307
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3059893a9acea1ac2f590bead7c9a0f3117e43ed5a0d1f4a41b5ab589b2f1c5c
3
+ size 578
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:9218a59ae4f858d40b4417133a24c66b9633847dc9c92574f7cbd3f72b848f35
3
  size 137033984
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e536919613b3b4f9e76519f3ca3e4effec3088c0d4b2ca7613f361ffc185cf9b
3
  size 137033984
training_args.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:c94dfda9104b5cee2c69f29fd41ba84612c310dd645555e3e221c4ae07c87b0c
3
  size 1017948104
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5e9fc10eac5690311752511e3268f064327bfb0dcabcd1d7e7eb6dbe9be1169a
3
  size 1017948104