lapp0 commited on
Commit
560827c
1 Parent(s): e268b55

End of training

Browse files
README.md CHANGED
@@ -16,13 +16,13 @@ This student model is distilled from the teacher model [gpt2](https://huggingfac
16
  The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
17
 
18
  It achieves the following results on the evaluation set:
19
- - eval_enwikippl: 284.4397
20
- - eval_frwikippl: 1798.9139
21
- - eval_zhwikippl: 1751.0190
22
- - eval_loss: 1.6942
23
- - eval_runtime: 17.3017
24
- - eval_samples_per_second: 57.798
25
- - eval_steps_per_second: 7.225
26
 
27
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
28
  should probably proofread and complete it, then remove this comment.
@@ -45,7 +45,7 @@ More information needed
45
  ### Training hyperparameters
46
 
47
  The following hyperparameters were used during training:
48
- - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=2.0, loss_fn=kl, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
49
  - train_embeddings: True
50
  - learning_rate: 4e-05
51
  - train_batch_size: 8
@@ -62,20 +62,20 @@ Peak GPU Memory: 8.0903 GB
62
  | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | zhwikippl |
63
  | --- | --- | --- | --- | --- | --- | --- | --- | --- |
64
  | **teacher eval** | | 30.2086 | 57.2728 | | | | | 18.1784 |
65
- | 0 | 0 | 55429.6875 | 57698.8047 | 18.2420 | 17.3446 | 57.655 | 7.207 | 56988.9141 |
66
- | 1000 | 0.0808 | 1049.7837 | 5325.7505 | 2.5718 | 17.2452 | 57.987 | 7.248 | 23942.7031 |
67
- | 2000 | 0.1616 | 708.8521 | 4184.6489 | 2.3106 | 17.24 | 58.005 | 7.251 | 6811.0176 |
68
- | 3000 | 0.2424 | 565.6559 | 3478.3425 | 2.1518 | 17.2499 | 57.971 | 7.246 | 4053.6387 |
69
- | 4000 | 0.3232 | 491.7915 | 3097.2148 | 2.0371 | 17.2528 | 57.962 | 7.245 | 1955.7367 |
70
- | 5000 | 0.4040 | 422.5880 | 2832.7336 | 1.9384 | 17.2681 | 57.91 | 7.239 | 1698.9636 |
71
- | 6000 | 0.4848 | 368.1481 | 2462.5984 | 1.8605 | 17.2945 | 57.822 | 7.228 | 1325.2998 |
72
- | 7000 | 0.5657 | 335.3666 | 2266.9673 | 1.7960 | 17.2752 | 57.886 | 7.236 | 1323.1775 |
73
- | 8000 | 0.6465 | 307.7420 | 1958.8436 | 1.7437 | 17.2402 | 58.004 | 7.25 | 4681.2642 |
74
- | 9000 | 0.7273 | 284.4397 | 1798.9139 | 1.6942 | 17.3017 | 57.798 | 7.225 | 1751.0190 |
75
- | 10000 | 0.8081 | 267.8474 | 1581.1575 | 1.6577 | 17.3091 | 57.773 | 7.222 | 2190.8152 |
76
- | 11000 | 0.8889 | 255.3764 | 1641.3605 | 1.6201 | 17.2717 | 57.898 | 7.237 | 1329.1990 |
77
- | 12000 | 0.9697 | 236.4612 | 1527.0337 | 1.5844 | 17.3033 | 57.793 | 7.224 | 930.5598 |
78
- | 12375 | 1.0 | 230.9265 | 1449.1996 | 1.5675 | 17.274 | 57.89 | 7.236 | 871.8522 |
79
 
80
  ### Framework versions
81
  - Distily 0.2.0
 
16
  The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
17
 
18
  It achieves the following results on the evaluation set:
19
+ - eval_enwikippl: 361.1823
20
+ - eval_frwikippl: 2386.7026
21
+ - eval_zhwikippl: 4744.1943
22
+ - eval_loss: 1.9157
23
+ - eval_runtime: 17.7641
24
+ - eval_samples_per_second: 56.293
25
+ - eval_steps_per_second: 7.037
26
 
27
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
28
  should probably proofread and complete it, then remove this comment.
 
45
  ### Training hyperparameters
46
 
47
  The following hyperparameters were used during training:
48
+ - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=2.0, loss_fn=jsd, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
49
  - train_embeddings: True
50
  - learning_rate: 4e-05
51
  - train_batch_size: 8
 
62
  | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | zhwikippl |
63
  | --- | --- | --- | --- | --- | --- | --- | --- | --- |
64
  | **teacher eval** | | 30.2086 | 57.2728 | | | | | 18.1784 |
65
+ | 0 | 0 | 55429.6875 | 57698.8047 | 175.5960 | 17.7132 | 56.455 | 7.057 | 56988.9141 |
66
+ | 1000 | 0.0808 | 1378.2804 | 7985.9370 | 2.9261 | 17.7408 | 56.367 | 7.046 | 50670.5117 |
67
+ | 2000 | 0.1616 | 847.0040 | 4771.7144 | 2.5598 | 17.7417 | 56.364 | 7.046 | 14207.9209 |
68
+ | 3000 | 0.2424 | 688.7217 | 4189.9639 | 2.4037 | 17.7422 | 56.363 | 7.045 | 14306.9199 |
69
+ | 4000 | 0.3232 | 592.5426 | 3683.2917 | 2.2774 | 17.7332 | 56.391 | 7.049 | 10059.0352 |
70
+ | 5000 | 0.4040 | 523.8404 | 3382.5713 | 2.1775 | 17.7584 | 56.311 | 7.039 | 5015.1992 |
71
+ | 6000 | 0.4848 | 467.2578 | 3189.8525 | 2.0989 | 17.7486 | 56.343 | 7.043 | 3572.5811 |
72
+ | 7000 | 0.5657 | 428.3366 | 2903.5076 | 2.0325 | 17.7274 | 56.41 | 7.051 | 4927.5718 |
73
+ | 8000 | 0.6465 | 387.8995 | 2702.7917 | 1.9715 | 17.7229 | 56.424 | 7.053 | 5649.6235 |
74
+ | 9000 | 0.7273 | 361.1823 | 2386.7026 | 1.9157 | 17.7641 | 56.293 | 7.037 | 4744.1943 |
75
+ | 10000 | 0.8081 | 339.6383 | 2152.0254 | 1.8789 | 17.7218 | 56.428 | 7.053 | 2744.7366 |
76
+ | 11000 | 0.8889 | 329.6602 | 2210.7751 | 1.8416 | 17.7377 | 56.377 | 7.047 | 3214.8923 |
77
+ | 12000 | 0.9697 | 309.4193 | 2047.5265 | 1.8192 | 17.7465 | 56.349 | 7.044 | 2113.7966 |
78
+ | 12375 | 1.0 | 306.6684 | 1978.8336 | 1.8053 | 17.7467 | 56.348 | 7.044 | 1737.9734 |
79
 
80
  ### Framework versions
81
  - Distily 0.2.0
logs/hs_loss_fn=jsd, hs_weight=2.0/events.out.tfevents.1723679648.5f530b1cf724 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:95c137ce056c52c10799055a81ed8b06f4192f82ba39feb67746645152b8a9b3
3
+ size 249