lapp0 commited on
Commit
f181b18
1 Parent(s): 39101e0

End of training

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ benchmarks.shelve.dat filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,158 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: gpt2
3
+ datasets:
4
+ - wikimedia/wikipedia
5
+ library_name: Distily
6
+ license: mit
7
+ tags:
8
+ - bitnet
9
+ - 1.58b
10
+ - generated_from_trainer
11
+ model-index:
12
+ - name: distily_multi_attn_experiment_ortho
13
+ results: []
14
+ ---
15
+
16
+
17
+ # Summary
18
+
19
+ Distilled with [Distily](https://github.com/lapp0/distily) library
20
+ using teacher model [gpt2](https://huggingface.co/gpt2)
21
+ on dataset [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia).
22
+
23
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
24
+ should probably proofread and complete it, then remove this comment.
25
+
26
+ # Model description
27
+
28
+ More information needed
29
+
30
+ # Intended uses & limitations
31
+
32
+ More information needed
33
+ -->
34
+
35
+ # Model Architecture:
36
+ - **Architecture**: `GPT2LMHeadModel`
37
+ - **Total Parameters**: 124,439,808
38
+ - **Data Type (dtype)**: torch.bfloat16
39
+ - **Model Size**: 0.24 GB
40
+
41
+
42
+ # Benchmark Metrics Comparison
43
+
44
+ | Metric | attn_layer_mapper=layer-2, attn_loss_fn=raw_mse, attn_projector=orthogonal, attn_weight=25.0 | teacher |
45
+ | :--- | :--- | :--- |
46
+ | ai2_arc (acc) | 0.305 | 0.354 |
47
+ | ai2_arc (acc_norm) | 0.302 | 0.339 |
48
+ | arc_challenge (acc) | 0.173 | 0.188 |
49
+ | arc_challenge (acc_norm) | 0.223 | 0.222 |
50
+ | arc_easy (acc) | 0.37 | 0.436 |
51
+ | arc_easy (acc_norm) | 0.34 | 0.396 |
52
+ | boolq (acc) | 0.387 | 0.51 |
53
+ | cola (mcc) | 0.044 | 0.01 |
54
+ | glue (acc) | 0.412 | 0.403 |
55
+ | glue (f1) | 0.451 | 0.529 |
56
+ | glue (mcc) | 0.044 | 0.01 |
57
+ | hellaswag (acc) | 0.315 | 0.343 |
58
+ | hellaswag (acc_norm) | 0.344 | 0.393 |
59
+ | mnli (acc) | 0.338 | 0.338 |
60
+ | mnli_mismatch (acc) | 0.351 | 0.346 |
61
+ | mrpc (acc) | 0.353 | 0.515 |
62
+ | mrpc (f1) | 0.143 | 0.631 |
63
+ | qnli (acc) | 0.497 | 0.491 |
64
+ | qqp (acc) | 0.406 | 0.367 |
65
+ | qqp (f1) | 0.501 | 0.512 |
66
+ | rte (acc) | 0.549 | 0.516 |
67
+ | sst2 (acc) | 0.545 | 0.511 |
68
+ | wikitext (bits_per_byte) | 1.127 | 0.98 |
69
+ | wikitext (byte_perplexity) | 2.184 | 1.973 |
70
+ | wikitext (word_perplexity) | 65.25 | 37.82 |
71
+ | wnli (acc) | 0.451 | 0.451 |
72
+
73
+ # Resource Usage Comparison
74
+
75
+ - VRAM Use: 7.7830 GB
76
+
77
+ # Distillation (Teacher -> Student) Architecture Difference:
78
+
79
+ - **Architecture**: `GPT2LMHeadModel` -> `GPT2LMHeadModel`
80
+ - **Total Parameters**: 124,439,808 -> 124,439,808
81
+ - **Data Type (dtype)**: torch.bfloat16 -> torch.bfloat16
82
+ - **Model Size**: 0.24 GB -> 0.24 GB
83
+
84
+ <details>
85
+ <summary>Module Diff Details</summary>
86
+
87
+ ```diff
88
+
89
+ ```
90
+
91
+ </details>
92
+ <br/>
93
+
94
+ # Train Dataset
95
+ Trained on 145,724,804 tokens from the [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) dataset.
96
+
97
+ - Num Samples: `247,500`
98
+ - Subset: `20231101.en`
99
+ - Split: `train`
100
+
101
+
102
+ # Training Objective
103
+
104
+ ```
105
+ DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl), attn_loss_component=LossComponent(label=attn, weight=25.0, loss_fn=raw_mse, layer_mapper=layer-2))
106
+ ```
107
+
108
+ # Hyperparameters
109
+ The following hyperparameters were used during training:
110
+
111
+ <details>
112
+ <summary>Expand</summary>
113
+
114
+ - learning_rate: `0.0001`
115
+ - train_batch_size: `4`
116
+ - eval_batch_size: `8`
117
+ - seed: `42`
118
+ - optimizer: `Adam with betas=(0.9,0.999) and epsilon=1e-08`
119
+ - lr_scheduler_type: `cosine_with_min_lr`
120
+ - lr_scheduler_warmup_ratio: `0.5`
121
+ - num_epochs: `1.0`
122
+ - distillation_objective: `DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl), attn_loss_component=LossComponent(label=attn, weight=25.0, loss_fn=raw_mse, layer_mapper=layer-2))`
123
+ - train_embeddings: `True`
124
+ - lr_scheduler: `<torch.optim.lr_scheduler.LambdaLR object at 0x7f0d1223cb50>`
125
+ - student_model_name_or_path: `None`
126
+ - student_config_name_or_path: `None`
127
+ - student_model_config: `None`
128
+ - reinitialize_weights: `None`
129
+ - copy_teacher_modules: `[('lm_head', False)]`
130
+ - student_model_as_bitnet: `True`
131
+ - student_model_compile: `False`
132
+ - dropout: `None`
133
+ - teacher_model_name_or_path: `gpt2`
134
+ - teacher_load_in_8bit: `False`
135
+ - teacher_load_in_4bit: `False`
136
+ - teacher_model_compile: `False`
137
+ - dataset_uri: `wikimedia/wikipedia`
138
+ - dataset_subset: `20231101.en`
139
+ - dataset_split: `train`
140
+ - dataset_column_name: `text`
141
+ - dataset_sample_size: `250000`
142
+ - dataset_test_size: `0.01`
143
+ - gradient_accumulation_steps: `1`
144
+ - weight_decay: `0.0`
145
+ - max_grad_norm: `1.0`
146
+ - warmup_ratio: `0.5`
147
+ - warmup_steps: `0`
148
+ - gradient_checkpointing: `True`
149
+
150
+ </details>
151
+ <br/>
152
+
153
+
154
+ # Framework Versions
155
+ - Distily 0.3.0
156
+ - Transformers 4.44.0
157
+ - Pytorch 2.3.0
158
+ - Datasets 2.21.0
benchmarks.shelve.bak CHANGED
@@ -0,0 +1,2 @@
 
 
 
1
+ 'teacher', (0, 26029753)
2
+ 'attn_layer_mapper=layer-2, attn_loss_fn=raw_mse, attn_projector=orthogonal, attn_weight=25.0', (26030080, 26029753)
benchmarks.shelve.dat CHANGED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b08da2c1102a7b8635c1aac31997fbdc32e594beca1614e4a38096dec1f9bf07
3
+ size 52059833
benchmarks.shelve.dir CHANGED
@@ -0,0 +1,2 @@
 
 
 
1
+ 'teacher', (0, 26029753)
2
+ 'attn_layer_mapper=layer-2, attn_loss_fn=raw_mse, attn_projector=orthogonal, attn_weight=25.0', (26030080, 26029753)
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 50256,
4
+ "eos_token_id": 50256,
5
+ "transformers_version": "4.44.0"
6
+ }
tokenizer.json CHANGED
@@ -1,19 +1,7 @@
1
  {
2
  "version": "1.0",
3
- "truncation": {
4
- "direction": "Right",
5
- "max_length": 1023,
6
- "strategy": "LongestFirst",
7
- "stride": 0
8
- },
9
- "padding": {
10
- "strategy": "BatchLongest",
11
- "direction": "Right",
12
- "pad_to_multiple_of": null,
13
- "pad_id": 50256,
14
- "pad_type_id": 0,
15
- "pad_token": "<|endoftext|>"
16
- },
17
  "added_tokens": [
18
  {
19
  "id": 50256,
 
1
  {
2
  "version": "1.0",
3
+ "truncation": null,
4
+ "padding": null,
 
 
 
 
 
 
 
 
 
 
 
 
5
  "added_tokens": [
6
  {
7
  "id": 50256,