End of training

Browse files

Files changed (7) hide show

README.md +38 -26
logs/hs_layer_mapper=last, hs_loss_fn=mse, hs_weight=1.0/completed.flag +0 -0
logs/hs_layer_mapper=last, hs_loss_fn=mse, hs_weight=1.0/events.out.tfevents.1724120424.02dbb11e2dcc +2 -2
logs/learning_rate=0.0001, per_device_train_batch_size=4/events.out.tfevents.1724120712.02dbb11e2dcc +3 -0
logs/learning_rate=0.0001, per_device_train_batch_size=4/events.out.tfevents.1724126717.02dbb11e2dcc +3 -0
model.safetensors +1 -1
training_args.bin +1 -1

README.md CHANGED Viewed

@@ -16,14 +16,14 @@ This student model is distilled from the teacher model [gpt2](https://huggingfac
 The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
 It achieves the following results on the evaluation set:
-- eval_enwikippl: 1784.0
-- eval_frwikippl: 9792.0
-- eval_zhwikippl: 72192.0
-- eval_tinystoriesppl: 1448.0
-- eval_loss: 2.5122
-- eval_runtime: 17.041
-- eval_samples_per_second: 58.682
-- eval_steps_per_second: 7.335
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment.
@@ -46,10 +46,10 @@ More information needed
 ### Training hyperparameters
 The following hyperparameters were used during training:
-- distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=1.0, loss_fn=mse, layer_mapper=last, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
 - train_embeddings: True
-- learning_rate: 0.0004
-- train_batch_size: 8
 - eval_batch_size: 8
 - seed: 42
 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
@@ -58,26 +58,38 @@ The following hyperparameters were used during training:
 - num_epochs: 1.0
 ### Resource Usage
-Peak GPU Memory: 8.0892 GB
 ### Eval-Phase Metrics
 | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | **teacher eval** |  | 43.75 | 61.75 |  |  |  |  | 11.8125 | 19.125 |
-| 0 | 0 | 2336462209024.0 | 122045790683136.0 | 22.4230 | 17.051 | 58.648 | 7.331 | 4429185024.0 | 25975962206208.0 |
-| 1000 | 0.0808 | 588.0 | 3680.0 | 1.8545 | 17.0585 | 58.622 | 7.328 | 612.0 | 2880.0 |
-| 2000 | 0.1616 | 988.0 | 5600.0 | 2.1657 | 17.0179 | 58.762 | 7.345 | 816.0 | 3200.0 |
-| 3000 | 0.2424 | 1744.0 | 8640.0 | 2.5064 | 17.083 | 58.538 | 7.317 | 1544.0 | 46080.0 |
-| 4000 | 0.3232 | 1864.0 | 8896.0 | 2.5506 | 17.0876 | 58.522 | 7.315 | 1544.0 | 63744.0 |
-| 5000 | 0.4040 | 1728.0 | 8832.0 | 2.4970 | 17.0783 | 58.554 | 7.319 | 1520.0 | 51200.0 |
-| 6000 | 0.4848 | 1936.0 | 9216.0 | 2.5779 | 17.0361 | 58.699 | 7.337 | 1688.0 | 63744.0 |
-| 7000 | 0.5657 | 2224.0 | 9792.0 | 2.6441 | 17.0202 | 58.754 | 7.344 | 1832.0 | 82944.0 |
-| 8000 | 0.6465 | 1936.0 | 8832.0 | 2.5707 | 17.057 | 58.627 | 7.328 | 1808.0 | 115200.0 |
-| 9000 | 0.7273 | 1784.0 | 9792.0 | 2.5122 | 17.041 | 58.682 | 7.335 | 1448.0 | 72192.0 |
-| 10000 | 0.8081 | 2064.0 | 9664.0 | 2.5934 | 17.147 | 58.319 | 7.29 | 1552.0 | 91648.0 |
-| 11000 | 0.8889 | 2064.0 | 10240.0 | 2.6004 | 17.0431 | 58.675 | 7.334 | 1720.0 | 80896.0 |
-| 12000 | 0.9697 | 2064.0 | 11008.0 | 2.6036 | 17.1142 | 58.431 | 7.304 | 1800.0 | 68608.0 |
-| 12375 | 1.0 | 1992.0 | 9280.0 | 2.5963 | 17.0677 | 58.59 | 7.324 | 1752.0 | 70144.0 |
 ### Framework versions
 - Distily 0.2.0

 The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
 It achieves the following results on the evaluation set:
+- eval_enwikippl: 84.5
+- eval_frwikippl: 356.0
+- eval_zhwikippl: 135.0
+- eval_tinystoriesppl: 72.0
+- eval_loss: 0.6795
+- eval_runtime: 16.7299
+- eval_samples_per_second: 59.773
+- eval_steps_per_second: 7.472
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment.
 ### Training hyperparameters
 The following hyperparameters were used during training:
+- distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
 - train_embeddings: True
+- learning_rate: 0.0001
+- train_batch_size: 4
 - eval_batch_size: 8
 - seed: 42
 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
 - num_epochs: 1.0
 ### Resource Usage
+Peak GPU Memory: 7.4226 GB
 ### Eval-Phase Metrics
 | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | **teacher eval** |  | 43.75 | 61.75 |  |  |  |  | 11.8125 | 19.125 |
+| 0 | 0 | 3126736191488.0 | 129742372077568.0 | 20.7540 | 16.7407 | 59.735 | 7.467 | 6677331968.0 | 80264348827648.0 |
+| 1000 | 0.0404 | 320.0 | 1504.0 | 1.5025 | 16.7846 | 59.579 | 7.447 | 245.0 | 280.0 |
+| 2000 | 0.0808 | 220.0 | 800.0 | 1.3040 | 16.7756 | 59.61 | 7.451 | 189.0 | 201.0 |
+| 3000 | 0.1212 | 180.0 | 648.0 | 1.1450 | 16.7863 | 59.572 | 7.447 | 153.0 | 149.0 |
+| 4000 | 0.1616 | 148.0 | 552.0 | 1.0301 | 16.7242 | 59.794 | 7.474 | 121.5 | 153.0 |
+| 5000 | 0.2020 | 129.0 | 452.0 | 0.9348 | 16.7817 | 59.589 | 7.449 | 105.0 | 176.0 |
+| 6000 | 0.2424 | 115.5 | 442.0 | 0.8587 | 16.8358 | 59.397 | 7.425 | 86.0 | 139.0 |
+| 7000 | 0.2828 | 103.0 | 432.0 | 0.8002 | 16.7689 | 59.634 | 7.454 | 78.5 | 139.0 |
+| 8000 | 0.3232 | 96.5 | 418.0 | 0.7424 | 16.7778 | 59.602 | 7.45 | 73.5 | 126.0 |
+| 9000 | 0.3636 | 84.5 | 356.0 | 0.6795 | 16.7299 | 59.773 | 7.472 | 72.0 | 135.0 |
+| 10000 | 0.4040 | 81.5 | 304.0 | 0.6324 | 16.7186 | 59.813 | 7.477 | 66.0 | 125.5 |
+| 11000 | 0.4444 | 77.5 | 282.0 | 0.5972 | 16.777 | 59.605 | 7.451 | 59.25 | 121.5 |
+| 12000 | 0.4848 | 72.5 | 288.0 | 0.5723 | 16.7347 | 59.756 | 7.47 | 56.75 | 118.0 |
+| 13000 | 0.5253 | 69.5 | 256.0 | 0.5577 | 16.7525 | 59.693 | 7.462 | 55.5 | 141.0 |
+| 14000 | 0.5657 | 68.5 | 237.0 | 0.5389 | 16.7317 | 59.767 | 7.471 | 54.75 | 286.0 |
+| 15000 | 0.6061 | 67.5 | 252.0 | 0.5187 | 16.7326 | 59.764 | 7.47 | 52.25 | 98.5 |
+| 16000 | 0.6465 | 69.0 | 235.0 | 0.5174 | 16.8095 | 59.49 | 7.436 | 54.75 | 125.5 |
+| 17000 | 0.6869 | 67.0 | 231.0 | 0.5048 | 16.7326 | 59.764 | 7.47 | 50.5 | 116.0 |
+| 18000 | 0.7273 | 66.0 | 225.0 | 0.4909 | 16.7575 | 59.675 | 7.459 | 49.75 | 132.0 |
+| 19000 | 0.7677 | 66.5 | 247.0 | 0.4894 | 16.8313 | 59.413 | 7.427 | 49.75 | 112.0 |
+| 20000 | 0.8081 | 66.5 | 233.0 | 0.4870 | 16.7365 | 59.75 | 7.469 | 51.5 | 103.5 |
+| 21000 | 0.8485 | 65.0 | 221.0 | 0.4831 | 16.703 | 59.869 | 7.484 | 50.75 | 181.0 |
+| 22000 | 0.8889 | 65.5 | 199.0 | 0.4740 | 16.7629 | 59.656 | 7.457 | 49.5 | 95.5 |
+| 23000 | 0.9293 | 67.0 | 223.0 | 0.4752 | 16.7201 | 59.808 | 7.476 | 46.5 | 174.0 |
+| 24000 | 0.9697 | 65.0 | 207.0 | 0.4700 | 16.8026 | 59.515 | 7.439 | 46.75 | 98.5 |
+| 24750 | 1.0 | 67.0 | 207.0 | 0.4672 | 16.7876 | 59.568 | 7.446 | 47.0 | 185.0 |
 ### Framework versions
 - Distily 0.2.0

logs/hs_layer_mapper=last, hs_loss_fn=mse, hs_weight=1.0/completed.flag ADDED Viewed

File without changes

logs/hs_layer_mapper=last, hs_loss_fn=mse, hs_weight=1.0/events.out.tfevents.1724120424.02dbb11e2dcc CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:17fa436ebc1eb4365904057f5c458c8b6f4cf88a344275c2f2d39b2e7adbee57
-size 307

 version https://git-lfs.github.com/spec/v1
+oid sha256:f0548ed2cc98e960d04aaad36c1656a5ed5efec67db1194915cd0dec5f97a398
+size 578

logs/learning_rate=0.0001, per_device_train_batch_size=4/events.out.tfevents.1724120712.02dbb11e2dcc ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5e9c8969c4f9fe244feb19558b61c74aba8824ef7e3537cd0de7874987e7c981
+size 11775797

logs/learning_rate=0.0001, per_device_train_batch_size=4/events.out.tfevents.1724126717.02dbb11e2dcc ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a4a5bc57413201154a418160e43c6fab883cbe496ab8ceedae8968250c513ada
+size 312

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:37c3b92069a0289ce0fe5fa4499a9cebb361e815f330c008341ac16f01d9011b
 size 248894656

 version https://git-lfs.github.com/spec/v1
+oid sha256:8d5d3a044ceaf584c050ee3384ec8ee57d7df959abf234e7bbf8033c970b2dc6
 size 248894656

training_args.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:9531816dc55f316fb63cc72d98043291696e2ead9cdabe9d3689be3f93afea91
 size 1017899144

 version https://git-lfs.github.com/spec/v1
+oid sha256:16aa1decdeea3f49b2565ec28903e638e5000d81c2746c93df0e3698f552931e
 size 1017899144