lapp0 commited on
Commit
f71c803
1 Parent(s): 1038895

End of training

Browse files
README.md CHANGED
@@ -16,14 +16,14 @@ This student model is distilled from the teacher model [gpt2](https://huggingfac
16
  The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
17
 
18
  It achieves the following results on the evaluation set:
19
- - eval_enwikippl: 5056.0
20
- - eval_frwikippl: 3696.0
21
- - eval_zhwikippl: 29312.0
22
- - eval_tinystoriesppl: 4672.0
23
- - eval_loss: 1.3042
24
- - eval_runtime: 16.773
25
- - eval_samples_per_second: 59.62
26
- - eval_steps_per_second: 7.452
27
 
28
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
29
  should probably proofread and complete it, then remove this comment.
@@ -48,8 +48,8 @@ More information needed
48
  The following hyperparameters were used during training:
49
  - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
50
  - train_embeddings: True
51
- - learning_rate: 0.0004
52
- - train_batch_size: 8
53
  - eval_batch_size: 8
54
  - seed: 42
55
  - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
@@ -58,26 +58,38 @@ The following hyperparameters were used during training:
58
  - num_epochs: 1.0
59
 
60
  ### Resource Usage
61
- Peak GPU Memory: 7.9368 GB
62
 
63
  ### Eval-Phase Metrics
64
  | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
65
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
66
  | **teacher eval** | | 43.75 | 61.75 | | | | | 11.8125 | 19.125 |
67
- | 0 | 0 | 2336462209024.0 | 122045790683136.0 | 24.1200 | 16.7622 | 59.658 | 7.457 | 4429185024.0 | 25975962206208.0 |
68
- | 1000 | 0.0808 | 3312.0 | 3184.0 | 1.1229 | 16.7657 | 59.646 | 7.456 | 2720.0 | 2992.0 |
69
- | 2000 | 0.1616 | 5248.0 | 4048.0 | 1.2528 | 16.7825 | 59.586 | 7.448 | 5408.0 | 9984.0 |
70
- | 3000 | 0.2424 | 5600.0 | 3744.0 | 1.2812 | 16.7695 | 59.632 | 7.454 | 5312.0 | 23680.0 |
71
- | 4000 | 0.3232 | 5408.0 | 3920.0 | 1.2832 | 16.8694 | 59.279 | 7.41 | 5440.0 | 33280.0 |
72
- | 5000 | 0.4040 | 5376.0 | 3952.0 | 1.2841 | 16.7361 | 59.751 | 7.469 | 5408.0 | 27008.0 |
73
- | 6000 | 0.4848 | 5344.0 | 3680.0 | 1.2770 | 16.7635 | 59.653 | 7.457 | 5440.0 | 29312.0 |
74
- | 7000 | 0.5657 | 5024.0 | 3760.0 | 1.2800 | 16.7492 | 59.704 | 7.463 | 5184.0 | 39936.0 |
75
- | 8000 | 0.6465 | 4992.0 | 3712.0 | 1.2922 | 16.7445 | 59.721 | 7.465 | 5088.0 | 26752.0 |
76
- | 9000 | 0.7273 | 5056.0 | 3696.0 | 1.3042 | 16.773 | 59.62 | 7.452 | 4672.0 | 29312.0 |
77
- | 10000 | 0.8081 | 5824.0 | 3648.0 | 1.3192 | 16.7669 | 59.641 | 7.455 | 5312.0 | 24448.0 |
78
- | 11000 | 0.8889 | 5568.0 | 3872.0 | 1.3215 | 16.8215 | 59.448 | 7.431 | 5504.0 | 40704.0 |
79
- | 12000 | 0.9697 | 5440.0 | 3792.0 | 1.3263 | 16.7825 | 59.586 | 7.448 | 5120.0 | 72704.0 |
80
- | 12375 | 1.0 | 5696.0 | 3936.0 | 1.3389 | 16.7852 | 59.576 | 7.447 | 5696.0 | 40192.0 |
 
 
 
 
 
 
 
 
 
 
 
 
81
 
82
  ### Framework versions
83
  - Distily 0.2.0
 
16
  The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
17
 
18
  It achieves the following results on the evaluation set:
19
+ - eval_enwikippl: 560.0
20
+ - eval_frwikippl: 644.0
21
+ - eval_zhwikippl: 488.0
22
+ - eval_tinystoriesppl: 284.0
23
+ - eval_loss: 0.6086
24
+ - eval_runtime: 16.7587
25
+ - eval_samples_per_second: 59.67
26
+ - eval_steps_per_second: 7.459
27
 
28
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
29
  should probably proofread and complete it, then remove this comment.
 
48
  The following hyperparameters were used during training:
49
  - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
50
  - train_embeddings: True
51
+ - learning_rate: 0.0001
52
+ - train_batch_size: 4
53
  - eval_batch_size: 8
54
  - seed: 42
55
  - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
 
58
  - num_epochs: 1.0
59
 
60
  ### Resource Usage
61
+ Peak GPU Memory: 7.4226 GB
62
 
63
  ### Eval-Phase Metrics
64
  | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
65
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
66
  | **teacher eval** | | 43.75 | 61.75 | | | | | 11.8125 | 19.125 |
67
+ | 0 | 0 | 1408749273088.0 | 96207267430400.0 | 20.4380 | 16.6447 | 60.079 | 7.51 | 7482638336.0 | 43430709297152.0 |
68
+ | 1000 | 0.0404 | 1408.0 | 1432.0 | 0.8546 | 16.7128 | 59.835 | 7.479 | 788.0 | 1056.0 |
69
+ | 2000 | 0.0808 | 988.0 | 928.0 | 0.7631 | 16.6827 | 59.942 | 7.493 | 520.0 | 302.0 |
70
+ | 3000 | 0.1212 | 836.0 | 760.0 | 0.7155 | 16.633 | 60.121 | 7.515 | 402.0 | 196.0 |
71
+ | 4000 | 0.1616 | 732.0 | 676.0 | 0.6800 | 16.6836 | 59.939 | 7.492 | 378.0 | 157.0 |
72
+ | 5000 | 0.2020 | 676.0 | 668.0 | 0.6574 | 16.6514 | 60.055 | 7.507 | 322.0 | 227.0 |
73
+ | 6000 | 0.2424 | 648.0 | 732.0 | 0.6383 | 16.6833 | 59.94 | 7.493 | 286.0 | 190.0 |
74
+ | 7000 | 0.2828 | 612.0 | 632.0 | 0.6373 | 16.8106 | 59.486 | 7.436 | 286.0 | 169.0 |
75
+ | 8000 | 0.3232 | 588.0 | 704.0 | 0.6243 | 16.6588 | 60.028 | 7.504 | 266.0 | 596.0 |
76
+ | 9000 | 0.3636 | 560.0 | 644.0 | 0.6086 | 16.7587 | 59.67 | 7.459 | 284.0 | 488.0 |
77
+ | 10000 | 0.4040 | 532.0 | 564.0 | 0.5994 | 16.6696 | 59.989 | 7.499 | 256.0 | 142.0 |
78
+ | 11000 | 0.4444 | 544.0 | 628.0 | 0.5916 | 16.7004 | 59.879 | 7.485 | 252.0 | 153.0 |
79
+ | 12000 | 0.4848 | 540.0 | 612.0 | 0.5828 | 16.7602 | 59.665 | 7.458 | 252.0 | 568.0 |
80
+ | 13000 | 0.5253 | 528.0 | 612.0 | 0.5735 | 16.6596 | 60.025 | 7.503 | 260.0 | 160.0 |
81
+ | 14000 | 0.5657 | 528.0 | 576.0 | 0.5628 | 16.7207 | 59.806 | 7.476 | 246.0 | 250.0 |
82
+ | 15000 | 0.6061 | 478.0 | 524.0 | 0.5511 | 16.736 | 59.752 | 7.469 | 232.0 | 170.0 |
83
+ | 16000 | 0.6465 | 442.0 | 552.0 | 0.5270 | 16.7225 | 59.8 | 7.475 | 228.0 | 214.0 |
84
+ | 17000 | 0.6869 | 420.0 | 524.0 | 0.4692 | 16.6506 | 60.058 | 7.507 | 212.0 | 174.0 |
85
+ | 18000 | 0.7273 | 384.0 | 478.0 | 0.4115 | 16.7225 | 59.8 | 7.475 | 208.0 | 144.0 |
86
+ | 19000 | 0.7677 | 362.0 | 400.0 | 0.3610 | 16.6691 | 59.991 | 7.499 | 195.0 | 128.0 |
87
+ | 20000 | 0.8081 | 344.0 | 346.0 | 0.3370 | 16.6695 | 59.99 | 7.499 | 184.0 | 107.5 |
88
+ | 21000 | 0.8485 | 306.0 | 302.0 | 0.3061 | 16.7054 | 59.861 | 7.483 | 161.0 | 110.5 |
89
+ | 22000 | 0.8889 | 300.0 | 318.0 | 0.2974 | 16.6709 | 59.985 | 7.498 | 160.0 | 84.0 |
90
+ | 23000 | 0.9293 | 290.0 | 298.0 | 0.2890 | 16.7049 | 59.863 | 7.483 | 162.0 | 103.0 |
91
+ | 24000 | 0.9697 | 300.0 | 290.0 | 0.2970 | 16.6771 | 59.963 | 7.495 | 164.0 | 85.5 |
92
+ | 24750 | 1.0 | 280.0 | 290.0 | 0.2782 | 16.74 | 59.737 | 7.467 | 162.0 | 91.5 |
93
 
94
  ### Framework versions
95
  - Distily 0.2.0
logs/dataset_subset=default, dataset_uri=distily_c4_multilingual_1M, learning_rate=0.0001, per_device_train_batch_size=4/events.out.tfevents.1724143970.02dbb11e2dcc ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4c5f2d8b25dfeef10e710f50dd3b67c11b795b907fed8a1442f9bc2d6103ea7c
3
+ size 312