davidxmle commited on
Commit
194a4c3
1 Parent(s): 61834b6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +355 -0
README.md CHANGED
@@ -1,5 +1,360 @@
1
  ---
 
 
 
 
 
 
2
  license: other
3
  license_name: llama-3
4
  license_link: https://huggingface.co/meta-llama/Meta-Llama-3-70B/blob/main/README.md
 
 
 
 
 
 
 
 
 
 
5
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ base_model: meta-llama/Meta-Llama-3-8B
3
+ inference: false
4
+ model_creator: astronomer-io
5
+ model_name: Meta-Llama-3-8B
6
+ model_type: llama
7
+ pipeline_tag: text-generation
8
  license: other
9
  license_name: llama-3
10
  license_link: https://huggingface.co/meta-llama/Meta-Llama-3-70B/blob/main/README.md
11
+ tags:
12
+ - llama
13
+ - llama-3
14
+ - facebook
15
+ - meta
16
+ - astronomer
17
+ - pretrained
18
+ - finetuned
19
+ - autotrain_compatible
20
+ - endpoints_compatible
21
  ---
22
+ <!-- header start -->
23
+ <div style="width: auto; margin-left: auto; margin-right: auto">
24
+ <img src="https://www.astronomer.io/logo/astronomer-logo-RGB-standard-1200px.png" alt="Astronomer" style="width: 60%; min-width: 400px; display: block; margin: auto;">
25
+ </div>
26
+ <div style="margin-top: 1.0em; margin-bottom: 1.0em;"></div>
27
+
28
+ <div style="text-align:center; margin-top: 0em; margin-bottom: 0em"><p style="margin-top: 0.25em; margin-bottom: 0em;">This model is generously created and made open source by <a href="https://astronomer.io">Astronomer</a>.</p></div>
29
+ <div style="text-align:center; margin-top: 0em; margin-bottom: 0em"><p style="margin-top: 0.25em; margin-bottom: 0em;">Astronomer is the de facto company for <a href="https://airflow.apache.org/">Apache Airflow</a>, the most trusted open-source framework for data orchestration and MLOps.</p></div>
30
+ <hr style="margin-top: 1.0em; margin-bottom: 1.0em;">
31
+ <!-- header end -->
32
+
33
+ # Llama-3-70B-Special-Tokens-Adjusted
34
+ - Ideal and stable Llama-3-70B for fine-tuning.
35
+ - Original Model creator: [Meta](https://huggingface.co/meta-llama)
36
+ - Original model: [meta-llama/Meta-Llama-3-70B](https://huggingface.co/meta-llama/Meta-Llama-3-70B)
37
+ - The usage of this model must abide by the [Llama 3 Community License](https://huggingface.co/meta-llama/Meta-Llama-3-70B/blob/main/LICENSE).
38
+ - Built with Meta Llama 3
39
+ - Created by [David Xue](https://www.linkedin.com/in/david-xue-uva/) from [Astronomer](https://astronomer.io)
40
+
41
+ ## Description
42
+ This is the exact same model ([meta-llama/Meta-Llama-3-70B](https://huggingface.co/meta-llama/Meta-Llama-3-70B)) with the weights for the input and output embeddings from lm head and embedding matrix adjusted using the mean of the trained tokens for certain tokens that were untrained, which caused widespread issues for people attempting to fine-tune this base model with either adding their own tokens or using existing special tokens.
43
+
44
+ ## Why We Made This Model
45
+
46
+ The Llama 3 base (non-instruct) model, while powerful, came with a significant oversight that some special tokens for instruction following within its architecture were left untrained, potentially derailing further fine-tuning processes. This was first noted by [Daniel Han on X](https://twitter.com/danielhanchen/status/1781395882925343058), highlighting a critical but fixable flaw in a widely used model.
47
+
48
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/655ad0f8727df37c77a09cb9/1U2rRrx60p1pNeeAZw8Rd.png" alt="graph" width="400"/>
49
+
50
+ The primary goal of releasing a patched version of this model was to address this issue so that the community can utilize the Llama 3 model without facing training instabilities, such as sudden gradient explosions or `NaN` gradients, or having to go through complicated processes to fix the model themselves before fine-tuning.
51
+
52
+ Note: specifically for the 70B model, the untrained special tokens did not have all zero values for the embedding weights. So the significance of this problem may not be as severe as it is on the base 8B model. This model was made anyway by the request of the community, though in theory directly fine-tuning should be ok.
53
+ ## Details of the Adjustment
54
+
55
+ The [meta-llama/Meta-Llama-3-70B](https://huggingface.co/meta-llama/Meta-Llama-3-70B) model was pulled directly from HuggingFace and loaded using transformers. Then, the input embedding and output embedding values are retrieved using `model.get_input_embeddings().weight.data` and `model.get_output_embeddings().weight.data`. These 2 matrics are identical in shape, with each row representing a token id, and each column representing an embedding feature.
56
+
57
+ The special (untrained & problematic) tokens can be found by locating the rows where the entire row of the embedding values are ~~~all zeros~~~ less than 9e-7 (for the 70B model, no row had all zeros, so thresholding using 9e-7 was done to fine under-trained tokens), which imply they were not trained during the pretraining phase of the model from Meta. Such untrained tokens could lead to heavy computational issues, like gradient explosions or `NaN` gradients, during downstream fine-tuning on specific tasks.
58
+
59
+
60
+ <details>
61
+ <summary>See here for a list of the tokens we found that has fit the "untrained" profile described:</summary>
62
+ ['À',
63
+ 'Á',
64
+ 'õ',
65
+ 'ö',
66
+ '÷',
67
+ 'ø',
68
+ 'ù',
69
+ 'ú',
70
+ 'û',
71
+ 'ü',
72
+ 'ý',
73
+ 'þ',
74
+ 'ÿ',
75
+ '">ččĊ',
76
+ ';čččĊ',
77
+ 'ĉTokenNameIdentifier',
78
+ 'ĠForCanBeConverted',
79
+ 'ĠForCanBeConvertedToF',
80
+ 'PostalCodesNL',
81
+ '$PostalCodesNL',
82
+ 'useRalative',
83
+ 'Û±Û',
84
+ 'аÑĢакÑĤ',
85
+ 'аÑĤиÑģÑı',
86
+ 'иÑĤÐ��ÑģÑı',
87
+ 'ávajÃŃcÃŃ',
88
+ 'Ä°TESÄ°',
89
+ 'илакÑĤи',
90
+ 'илаÑģÑı',
91
+ 'ÑĭÑŁN',
92
+ 'ÐİÑĭÑŁN',
93
+ 'ılmaktadır',
94
+ 'ÐİÑĭÑŁNÐİÑĭÑŁN',
95
+ 'ıldıģında',
96
+ '<|reserved_special_token_0|>',
97
+ '<|reserved_special_token_1|>',
98
+ '<|reserved_special_token_2|>',
99
+ '<|reserved_special_token_3|>',
100
+ '<|start_header_id|>',
101
+ '<|end_header_id|>',
102
+ '<|reserved_special_token_4|>',
103
+ '<|eot_id|>',
104
+ '<|reserved_special_token_5|>',
105
+ '<|reserved_special_token_6|>',
106
+ '<|reserved_special_token_7|>',
107
+ '<|reserved_special_token_8|>',
108
+ '<|reserved_special_token_9|>',
109
+ '<|reserved_special_token_10|>',
110
+ '<|reserved_special_token_11|>',
111
+ '<|reserved_special_token_12|>',
112
+ '<|reserved_special_token_13|>',
113
+ '<|reserved_special_token_14|>',
114
+ '<|reserved_special_token_15|>',
115
+ '<|reserved_special_token_16|>',
116
+ '<|reserved_special_token_17|>',
117
+ '<|reserved_special_token_18|>',
118
+ '<|reserved_special_token_19|>',
119
+ '<|reserved_special_token_20|>',
120
+ '<|reserved_special_token_21|>',
121
+ '<|reserved_special_token_22|>',
122
+ '<|reserved_special_token_23|>',
123
+ '<|reserved_special_token_24|>',
124
+ '<|reserved_special_token_25|>',
125
+ '<|reserved_special_token_26|>',
126
+ '<|reserved_special_token_27|>',
127
+ '<|reserved_special_token_28|>',
128
+ '<|reserved_special_token_29|>',
129
+ '<|reserved_special_token_30|>',
130
+ '<|reserved_special_token_31|>',
131
+ '<|reserved_special_token_32|>',
132
+ '<|reserved_special_token_33|>',
133
+ '<|reserved_special_token_34|>',
134
+ '<|reserved_special_token_35|>',
135
+ '<|reserved_special_token_36|>',
136
+ '<|reserved_special_token_37|>',
137
+ '<|reserved_special_token_38|>',
138
+ '<|reserved_special_token_39|>',
139
+ '<|reserved_special_token_40|>',
140
+ '<|reserved_special_token_41|>',
141
+ '<|reserved_special_token_42|>',
142
+ '<|reserved_special_token_43|>',
143
+ '<|reserved_special_token_44|>',
144
+ '<|reserved_special_token_45|>',
145
+ '<|reserved_special_token_46|>',
146
+ '<|reserved_special_token_47|>',
147
+ '<|reserved_special_token_48|>',
148
+ '<|reserved_special_token_49|>',
149
+ '<|reserved_special_token_50|>',
150
+ '<|reserved_special_token_51|>',
151
+ '<|reserved_special_token_52|>',
152
+ '<|reserved_special_token_53|>',
153
+ '<|reserved_special_token_54|>',
154
+ '<|reserved_special_token_55|>',
155
+ '<|reserved_special_token_56|>',
156
+ '<|reserved_special_token_57|>',
157
+ '<|reserved_special_token_58|>',
158
+ '<|reserved_special_token_59|>',
159
+ '<|reserved_special_token_60|>',
160
+ '<|reserved_special_token_61|>',
161
+ '<|reserved_special_token_62|>',
162
+ '<|reserved_special_token_63|>',
163
+ '<|reserved_special_token_64|>',
164
+ '<|reserved_special_token_65|>',
165
+ '<|reserved_special_token_66|>',
166
+ '<|reserved_special_token_67|>',
167
+ '<|reserved_special_token_68|>',
168
+ '<|reserved_special_token_69|>',
169
+ '<|reserved_special_token_70|>',
170
+ '<|reserved_special_token_71|>',
171
+ '<|reserved_special_token_72|>',
172
+ '<|reserved_special_token_73|>',
173
+ '<|reserved_special_token_74|>',
174
+ '<|reserved_special_token_75|>',
175
+ '<|reserved_special_token_76|>',
176
+ '<|reserved_special_token_77|>',
177
+ '<|reserved_special_token_78|>',
178
+ '<|reserved_special_token_79|>',
179
+ '<|reserved_special_token_80|>',
180
+ '<|reserved_special_token_81|>',
181
+ '<|reserved_special_token_82|>',
182
+ '<|reserved_special_token_83|>',
183
+ '<|reserved_special_token_84|>',
184
+ '<|reserved_special_token_85|>',
185
+ '<|reserved_special_token_86|>',
186
+ '<|reserved_special_token_87|>',
187
+ '<|reserved_special_token_88|>',
188
+ '<|reserved_special_token_89|>',
189
+ '<|reserved_special_token_90|>',
190
+ '<|reserved_special_token_91|>',
191
+ '<|reserved_special_token_92|>',
192
+ '<|reserved_special_token_93|>',
193
+ '<|reserved_special_token_94|>',
194
+ '<|reserved_special_token_95|>',
195
+ '<|reserved_special_token_96|>',
196
+ '<|reserved_special_token_97|>',
197
+ '<|reserved_special_token_98|>',
198
+ '<|reserved_special_token_99|>',
199
+ '<|reserved_special_token_100|>',
200
+ '<|reserved_special_token_101|>',
201
+ '<|reserved_special_token_102|>',
202
+ '<|reserved_special_token_103|>',
203
+ '<|reserved_special_token_104|>',
204
+ '<|reserved_special_token_105|>',
205
+ '<|reserved_special_token_106|>',
206
+ '<|reserved_special_token_107|>',
207
+ '<|reserved_special_token_108|>',
208
+ '<|reserved_special_token_109|>',
209
+ '<|reserved_special_token_110|>',
210
+ '<|reserved_special_token_111|>',
211
+ '<|reserved_special_token_112|>',
212
+ '<|reserved_special_token_113|>',
213
+ '<|reserved_special_token_114|>',
214
+ '<|reserved_special_token_115|>',
215
+ '<|reserved_special_token_116|>',
216
+ '<|reserved_special_token_117|>',
217
+ '<|reserved_special_token_118|>',
218
+ '<|reserved_special_token_119|>',
219
+ '<|reserved_special_token_120|>',
220
+ '<|reserved_special_token_121|>',
221
+ '<|reserved_special_token_122|>',
222
+ '<|reserved_special_token_123|>',
223
+ '<|reserved_special_token_124|>',
224
+ '<|reserved_special_token_125|>',
225
+ '<|reserved_special_token_126|>',
226
+ '<|reserved_special_token_127|>',
227
+ '<|reserved_special_token_128|>',
228
+ '<|reserved_special_token_129|>',
229
+ '<|reserved_special_token_130|>',
230
+ '<|reserved_special_token_131|>',
231
+ '<|reserved_special_token_132|>',
232
+ '<|reserved_special_token_133|>',
233
+ '<|reserved_special_token_134|>',
234
+ '<|reserved_special_token_135|>',
235
+ '<|reserved_special_token_136|>',
236
+ '<|reserved_special_token_137|>',
237
+ '<|reserved_special_token_138|>',
238
+ '<|reserved_special_token_139|>',
239
+ '<|reserved_special_token_140|>',
240
+ '<|reserved_special_token_141|>',
241
+ '<|reserved_special_token_142|>',
242
+ '<|reserved_special_token_143|>',
243
+ '<|reserved_special_token_144|>',
244
+ '<|reserved_special_token_145|>',
245
+ '<|reserved_special_token_146|>',
246
+ '<|reserved_special_token_147|>',
247
+ '<|reserved_special_token_148|>',
248
+ '<|reserved_special_token_149|>',
249
+ '<|reserved_special_token_150|>',
250
+ '<|reserved_special_token_151|>',
251
+ '<|reserved_special_token_152|>',
252
+ '<|reserved_special_token_153|>',
253
+ '<|reserved_special_token_154|>',
254
+ '<|reserved_special_token_155|>',
255
+ '<|reserved_special_token_156|>',
256
+ '<|reserved_special_token_157|>',
257
+ '<|reserved_special_token_158|>',
258
+ '<|reserved_special_token_159|>',
259
+ '<|reserved_special_token_160|>',
260
+ '<|reserved_special_token_161|>',
261
+ '<|reserved_special_token_162|>',
262
+ '<|reserved_special_token_163|>',
263
+ '<|reserved_special_token_164|>',
264
+ '<|reserved_special_token_165|>',
265
+ '<|reserved_special_token_166|>',
266
+ '<|reserved_special_token_167|>',
267
+ '<|reserved_special_token_168|>',
268
+ '<|reserved_special_token_169|>',
269
+ '<|reserved_special_token_170|>',
270
+ '<|reserved_special_token_171|>',
271
+ '<|reserved_special_token_172|>',
272
+ '<|reserved_special_token_173|>',
273
+ '<|reserved_special_token_174|>',
274
+ '<|reserved_special_token_175|>',
275
+ '<|reserved_special_token_176|>',
276
+ '<|reserved_special_token_177|>',
277
+ '<|reserved_special_token_178|>',
278
+ '<|reserved_special_token_179|>',
279
+ '<|reserved_special_token_180|>',
280
+ '<|reserved_special_token_181|>',
281
+ '<|reserved_special_token_182|>',
282
+ '<|reserved_special_token_183|>',
283
+ '<|reserved_special_token_184|>',
284
+ '<|reserved_special_token_185|>',
285
+ '<|reserved_special_token_186|>',
286
+ '<|reserved_special_token_187|>',
287
+ '<|reserved_special_token_188|>',
288
+ '<|reserved_special_token_189|>',
289
+ '<|reserved_special_token_190|>',
290
+ '<|reserved_special_token_191|>',
291
+ '<|reserved_special_token_192|>',
292
+ '<|reserved_special_token_193|>',
293
+ '<|reserved_special_token_194|>',
294
+ '<|reserved_special_token_195|>',
295
+ '<|reserved_special_token_196|>',
296
+ '<|reserved_special_token_197|>',
297
+ '<|reserved_special_token_198|>',
298
+ '<|reserved_special_token_199|>',
299
+ '<|reserved_special_token_200|>',
300
+ '<|reserved_special_token_201|>',
301
+ '<|reserved_special_token_202|>',
302
+ '<|reserved_special_token_203|>',
303
+ '<|reserved_special_token_204|>',
304
+ '<|reserved_special_token_205|>',
305
+ '<|reserved_special_token_206|>',
306
+ '<|reserved_special_token_207|>',
307
+ '<|reserved_special_token_208|>',
308
+ '<|reserved_special_token_209|>',
309
+ '<|reserved_special_token_210|>',
310
+ '<|reserved_special_token_211|>',
311
+ '<|reserved_special_token_212|>',
312
+ '<|reserved_special_token_213|>',
313
+ '<|reserved_special_token_214|>',
314
+ '<|reserved_special_token_215|>',
315
+ '<|reserved_special_token_216|>',
316
+ '<|reserved_special_token_217|>',
317
+ '<|reserved_special_token_218|>',
318
+ '<|reserved_special_token_219|>',
319
+ '<|reserved_special_token_220|>',
320
+ '<|reserved_special_token_221|>',
321
+ '<|reserved_special_token_222|>',
322
+ '<|reserved_special_token_223|>',
323
+ '<|reserved_special_token_224|>',
324
+ '<|reserved_special_token_225|>',
325
+ '<|reserved_special_token_226|>',
326
+ '<|reserved_special_token_227|>',
327
+ '<|reserved_special_token_228|>',
328
+ '<|reserved_special_token_229|>',
329
+ '<|reserved_special_token_230|>',
330
+ '<|reserved_special_token_231|>',
331
+ '<|reserved_special_token_232|>',
332
+ '<|reserved_special_token_233|>',
333
+ '<|reserved_special_token_234|>',
334
+ '<|reserved_special_token_235|>',
335
+ '<|reserved_special_token_236|>',
336
+ '<|reserved_special_token_237|>',
337
+ '<|reserved_special_token_238|>',
338
+ '<|reserved_special_token_239|>',
339
+ '<|reserved_special_token_240|>',
340
+ '<|reserved_special_token_241|>',
341
+ '<|reserved_special_token_242|>',
342
+ '<|reserved_special_token_243|>',
343
+ '<|reserved_special_token_244|>',
344
+ '<|reserved_special_token_245|>',
345
+ '<|reserved_special_token_246|>',
346
+ '<|reserved_special_token_247|>',
347
+ '<|reserved_special_token_248|>',
348
+ '<|reserved_special_token_249|>',
349
+ '<|reserved_special_token_250|>']
350
+ </details>
351
+
352
+
353
+ Once these untrained tokens are identified, the average of trained tokens can be calculated by using the sums of embedding values of trained tokens for each feature/column and divided by the number of trained. This is done for both input and output matrices.
354
+
355
+ Lastly, the problematic token's rows in the 2 embedding matrics are set to the computed mean, thus completing the adjustment.
356
+
357
+ ## Contributors
358
+ - [David Xue](https://www.linkedin.com/in/david-xue-uva/), Machine Learning Engineer from [Astronomer](https://astronomer.io)
359
+
360
+