lapp0 commited on
Commit
eaee4c9
1 Parent(s): 068b652

End of training

Browse files
README.md ADDED
@@ -0,0 +1,223 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: HuggingFaceTB/SmolLM-135M
3
+ datasets:
4
+ - wikimedia/wikipedia
5
+ library_name: Distily
6
+ license: creativeml-openrail-m
7
+ tags:
8
+ - generated_from_trainer
9
+ - Distily
10
+ base_model_relation: finetune
11
+ model-index:
12
+ - name: distily_profile_smollm_tritoned
13
+ results: []
14
+ ---
15
+
16
+
17
+ # Summary
18
+
19
+ Distilled with [Distily](https://github.com/lapp0/distily) library
20
+ using teacher model [HuggingFaceTB/SmolLM-135M](https://huggingface.co/HuggingFaceTB/SmolLM-135M)
21
+ on dataset [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia).
22
+
23
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
24
+ should probably proofread and complete it, then remove this comment.
25
+
26
+ # Model description
27
+
28
+ More information needed
29
+
30
+ # Intended uses & limitations
31
+
32
+ More information needed
33
+ -->
34
+
35
+ # Model Architecture:
36
+ - **Architecture**: `LlamaForCausalLM`
37
+ - **Total Parameters**: 81,413,568
38
+ - **Data Type (dtype)**: torch.bfloat16
39
+ - **Model Size**: 0.15 GB
40
+
41
+ <details>
42
+ <summary>Student Model Details</summary>
43
+
44
+ ```
45
+ LlamaForCausalLM(
46
+ (model): LlamaModel(
47
+ (embed_tokens): Embedding(49152, 576)
48
+ (layers): ModuleList(
49
+ (0-14): 15 x LlamaDecoderLayer(
50
+ (self_attn): LlamaSdpaAttention(
51
+ (q_proj): Linear(in_features=576, out_features=576, bias=False)
52
+ (k_proj): Linear(in_features=576, out_features=192, bias=False)
53
+ (v_proj): Linear(in_features=576, out_features=192, bias=False)
54
+ (o_proj): Linear(in_features=576, out_features=576, bias=False)
55
+ (rotary_emb): LlamaRotaryEmbedding()
56
+ )
57
+ (mlp): LigerSwiGLUMLP(
58
+ (gate_proj): Linear(in_features=576, out_features=1536, bias=False)
59
+ (up_proj): Linear(in_features=576, out_features=1536, bias=False)
60
+ (down_proj): Linear(in_features=1536, out_features=576, bias=False)
61
+ )
62
+ (input_layernorm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
63
+ (post_attention_layernorm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
64
+ )
65
+ )
66
+ (norm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
67
+ (rotary_emb): LlamaRotaryEmbedding()
68
+ )
69
+ (lm_head): Linear(in_features=576, out_features=49152, bias=False)
70
+ )
71
+ ```
72
+
73
+ </details>
74
+ <br/>
75
+
76
+
77
+
78
+ # Resource Usage
79
+
80
+ - Max Train VRAM Use: 12.7772 GB
81
+ - Available VRAM: 23.4329 GB
82
+ - GPUs:
83
+ - 1x NVIDIA GeForce RTX 4090
84
+ - CPUs: 64
85
+ - CPU Memory: 251.7299 GB
86
+ - CPU Memory Bandwidth: 1600 GB/s
87
+
88
+ # Distillation (Teacher -> Student) Architecture Difference:
89
+
90
+ - **Architecture**: `LlamaForCausalLM` -> `LlamaForCausalLM`
91
+ - **Total Parameters**: 134,515,008 -> 81,413,568
92
+ - **Data Type (dtype)**: torch.bfloat16 -> torch.bfloat16
93
+ - **Model Size**: 0.25 GB -> 0.15 GB
94
+
95
+ <details>
96
+ <summary>Module Diff Details</summary>
97
+
98
+ ```diff
99
+ --- teacher model modules
100
+ +++ student model modules
101
+ @@ -2,7 +2,7 @@
102
+ (model): LlamaModel(
103
+ (embed_tokens): Embedding(49152, 576)
104
+ (layers): ModuleList(
105
+ - (0-29): 30 x LlamaDecoderLayer(
106
+ + (0-14): 15 x LlamaDecoderLayer(
107
+ (self_attn): LlamaSdpaAttention(
108
+ (q_proj): Linear(in_features=576, out_features=576, bias=False)
109
+ (k_proj): Linear(in_features=576, out_features=192, bias=False)
110
+ @@ -10,17 +10,16 @@
111
+ (o_proj): Linear(in_features=576, out_features=576, bias=False)
112
+ (rotary_emb): LlamaRotaryEmbedding()
113
+ )
114
+ - (mlp): LlamaMLP(
115
+ + (mlp): LigerSwiGLUMLP(
116
+ (gate_proj): Linear(in_features=576, out_features=1536, bias=False)
117
+ (up_proj): Linear(in_features=576, out_features=1536, bias=False)
118
+ (down_proj): Linear(in_features=1536, out_features=576, bias=False)
119
+ - (act_fn): SiLU()
120
+ )
121
+ - (input_layernorm): LlamaRMSNorm((576,), eps=1e-05)
122
+ - (post_attention_layernorm): LlamaRMSNorm((576,), eps=1e-05)
123
+ + (input_layernorm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
124
+ + (post_attention_layernorm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
125
+ )
126
+ )
127
+ - (norm): LlamaRMSNorm((576,), eps=1e-05)
128
+ + (norm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
129
+ (rotary_emb): LlamaRotaryEmbedding()
130
+ )
131
+ (lm_head): Linear(in_features=576, out_features=49152, bias=False)
132
+
133
+ ```
134
+
135
+ </details>
136
+ <br/>
137
+
138
+ # Train Dataset
139
+ Trained on 44,060,170 tokens from the [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) dataset.
140
+
141
+ - Num Samples: `49,900`
142
+ - Subset: `20231101.en`
143
+ - Split: `train`
144
+
145
+
146
+ # Training Objective
147
+
148
+ ```
149
+ DistillationObjective(
150
+ logits_loss_component=LossComponent(
151
+ weight=1,
152
+ loss_fn='kl'
153
+ ),
154
+ hs_loss_component=LossComponent(
155
+ weight=0
156
+ ),
157
+ attn_loss_component=LossComponent(
158
+ weight=0
159
+ )
160
+ )
161
+ ```
162
+
163
+ # Hyperparameters
164
+ The following hyperparameters were used during training:
165
+
166
+ <details>
167
+ <summary>Expand</summary>
168
+
169
+ - learning_rate: `0.0002`
170
+ - train_batch_size: `4`
171
+ - eval_batch_size: `2`
172
+ - seed: `42`
173
+ - optimizer: `Adam with betas=(0.9,0.999) and epsilon=1e-08`
174
+ - lr_scheduler_type: `polynomial`
175
+ - num_epochs: `1.0`
176
+ - distillation_objective: `DistillationObjective(
177
+ logits_loss_component=LossComponent(
178
+ weight=1,
179
+ loss_fn='kl'
180
+ ),
181
+ hs_loss_component=LossComponent(
182
+ weight=0
183
+ ),
184
+ attn_loss_component=LossComponent(
185
+ weight=0
186
+ )
187
+ )`
188
+ - lr_scheduler: `<torch.optim.lr_scheduler.LambdaLR object at 0x7c610d513ac0>`
189
+ - student_model_name_or_path: `None`
190
+ - student_config_name_or_path: `None`
191
+ - student_model_config: `{'num_hidden_layers': 15}`
192
+ - reinitialize_weights: `None`
193
+ - copy_teacher_modules: `[('lm_head', False)]`
194
+ - student_model_as_bitnet: `False`
195
+ - student_use_liger_kernel: `True`
196
+ - teacher_model_name_or_path: `HuggingFaceTB/SmolLM-135M`
197
+ - teacher_load_in_8bit: `False`
198
+ - teacher_load_in_4bit: `False`
199
+ - dataset_uri: `wikimedia/wikipedia`
200
+ - dataset_subset: `20231101.en`
201
+ - dataset_split: `train`
202
+ - dataset_column_name: `text`
203
+ - dataset_sample_size: `50000`
204
+ - dataset_test_size: `0.002`
205
+ - dataset_shuffle: `False`
206
+ - dataset_shuffle_seed: `42`
207
+ - dataset_trust_remote_code: `False`
208
+ - gradient_accumulation_steps: `1`
209
+ - weight_decay: `0.0`
210
+ - max_grad_norm: `1.0`
211
+ - warmup_ratio: `0.0`
212
+ - warmup_steps: `0`
213
+ - gradient_checkpointing: `True`
214
+
215
+ </details>
216
+ <br/>
217
+
218
+
219
+ # Framework Versions
220
+ - Distily 0.5.0
221
+ - Transformers 4.45.0.dev0
222
+ - Pytorch 2.5.0.dev20240910+cu121
223
+ - Datasets 2.21.0
benchmarks.shelve.bak ADDED
File without changes
benchmarks.shelve.dat ADDED
File without changes
benchmarks.shelve.dir ADDED
File without changes
generation_config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 0,
4
+ "eos_token_id": 0,
5
+ "transformers_version": "4.45.0.dev0",
6
+ "use_cache": false
7
+ }
logs/attn_weight=0, per_device_train_batch_size=4, run_name=baseline/events.out.tfevents.1726164356.1c1a426a2fee CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:b7b579e05a5281f8954021c690032486d68594fd3598ccf86b45add299dd5858
3
- size 275442
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:03b0454e6c3ed15f74f50c5d0a0be6bde7b0cbbde8cc85931804492521932ff2
3
+ size 343081
logs/attn_weight=0, per_device_train_batch_size=4, run_name=baseline/events.out.tfevents.1726167546.1c1a426a2fee ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4856a8ec2d2045cc9d92705c3b3fcd1b667cf56a64105ca2578a7836f042e93a
3
+ size 249
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:cb06bb1c75f6e8583645ed45e18f2d9cce2f49141d1432e3b2f2a5f597901035
3
  size 162842416
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ac00b5ce3744b461a31c74a8fd2b26f2abdc48b37f3067d6596640211a08902a
3
  size 162842416