File size: 3,344 Bytes
afa82e8
 
34d95d5
 
afa82e8
 
 
 
 
 
 
 
 
34d95d5
 
 
 
afa82e8
237ac9e
 
 
 
 
7639551
 
34d95d5
 
 
afa82e8
 
 
 
 
 
 
 
 
 
 
 
34d95d5
afa82e8
 
 
 
 
 
237ac9e
34d95d5
afa82e8
 
 
 
 
 
 
 
34d95d5
ffddc25
34d95d5
 
 
 
 
237ac9e
 
 
 
 
 
 
 
 
 
 
 
 
 
afa82e8
 
34d95d5
afa82e8
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
---
base_model: gpt2
library_name: Distily
license: mit
tags:
- generated_from_trainer
model-index:
- name: distily_bench_gpt2_activation_loss_b
  results: []
---

# distily_bench_gpt2_activation_loss_b

This student model is distilled from the teacher model [gpt2](https://huggingface.co/gpt2) using the dataset (unspecified).

The [Distily](https://github.com/lapp0/distily) library was used for this distillation.

It achieves the following results on the evaluation set:
- eval_enwikippl: 225.9773
- eval_frwikippl: 1391.1320
- eval_zhwikippl: 821.2236
- eval_loss: 19.6630
- eval_runtime: 17.2806
- eval_samples_per_second: 57.868
- eval_steps_per_second: 7.234

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment.

## Model description

More information needed

## Intended uses & limitations

More information needed

## Training and evaluation data

More information needed
-->

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=2.0, loss_fn=ce, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
- train_embeddings: True
- learning_rate: 4e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: constant
- num_epochs: 1.0

### Resource Usage
Peak GPU Memory: 8.0903 GB

### Eval-Phase Metrics
| step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | zhwikippl |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| **teacher eval** |  | 30.2086 | 57.2728 |  |  |  |  | 18.1784 |
| 0 | 0 | 55429.6875 | 57698.8047 | 24.5150 | 17.2943 | 57.823 | 7.228 | 56988.9141 |
| 1000 | 0.0808 | 713.7677 | 4453.7666 | 20.3910 | 17.3531 | 57.627 | 7.203 | 17866.8926 |
| 2000 | 0.1616 | 521.2028 | 3308.0386 | 20.2010 | 17.3798 | 57.538 | 7.192 | 2471.2515 |
| 3000 | 0.2424 | 433.2541 | 2722.2993 | 20.1000 | 17.3672 | 57.58 | 7.197 | 1283.4985 |
| 4000 | 0.3232 | 387.5081 | 2569.3728 | 20.0170 | 17.3651 | 57.587 | 7.198 | 1167.0867 |
| 5000 | 0.4040 | 332.2302 | 2197.1006 | 19.9310 | 17.283 | 57.86 | 7.233 | 1141.8051 |
| 6000 | 0.4848 | 292.5944 | 1835.8154 | 19.8590 | 17.2939 | 57.824 | 7.228 | 905.3102 |
| 7000 | 0.5657 | 266.3748 | 1648.5508 | 19.7820 | 17.3184 | 57.742 | 7.218 | 844.8045 |
| 8000 | 0.6465 | 244.8321 | 1513.9550 | 19.7310 | 17.3028 | 57.794 | 7.224 | 1150.9904 |
| 9000 | 0.7273 | 225.9773 | 1391.1320 | 19.6630 | 17.2806 | 57.868 | 7.234 | 821.2236 |
| 10000 | 0.8081 | 209.6788 | 1266.0754 | 19.6040 | 17.3446 | 57.655 | 7.207 | 718.9499 |
| 11000 | 0.8889 | 196.7588 | 1248.5234 | 19.5620 | 17.3611 | 57.6 | 7.2 | 611.5998 |
| 12000 | 0.9697 | 179.4194 | 1137.2484 | 19.5120 | 17.3767 | 57.548 | 7.194 | 572.3267 |
| 12375 | 1.0 | 175.7241 | 1080.9574 | 19.4920 | 17.3076 | 57.778 | 7.222 | 584.9987 |

### Framework versions
- Distily 0.2.0
- Transformers 4.44.0
- Pytorch 2.3.0
- Datasets 2.21.0