Triangle104 commited on
Commit
46c3971
1 Parent(s): 2675755

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +345 -0
README.md CHANGED
@@ -33,6 +33,351 @@ tags:
33
  This model was converted to GGUF format from [`PrimeIntellect/INTELLECT-1-Instruct`](https://huggingface.co/PrimeIntellect/INTELLECT-1-Instruct) using llama.cpp via the ggml.ai's [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space.
34
  Refer to the [original model card](https://huggingface.co/PrimeIntellect/INTELLECT-1-Instruct) for more details on the model.
35
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
  ## Use with llama.cpp
37
  Install llama.cpp through brew (works on Mac and Linux)
38
 
 
33
  This model was converted to GGUF format from [`PrimeIntellect/INTELLECT-1-Instruct`](https://huggingface.co/PrimeIntellect/INTELLECT-1-Instruct) using llama.cpp via the ggml.ai's [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space.
34
  Refer to the [original model card](https://huggingface.co/PrimeIntellect/INTELLECT-1-Instruct) for more details on the model.
35
 
36
+ ---
37
+ Model details:
38
+ -
39
+ INTELLECT-1 is the first collaboratively trained 10
40
+ billion parameter language model trained from scratch on 1 trillion
41
+ tokens of English text and code.
42
+
43
+
44
+
45
+
46
+
47
+ This is an instruct model. The base model associated with it is INTELLECT-1.
48
+
49
+
50
+ INTELLECT-1 was trained on up to 14 concurrent nodes
51
+ distributed across 3 continents, with contributions from 30 independent
52
+ community contributors providing compute.
53
+ The training code utilizes the prime framework,
54
+ a scalable distributed training framework designed for fault-tolerant,
55
+ dynamically scaling, high-perfomance training on unreliable, globally
56
+ distributed workers.
57
+ The key abstraction that allows dynamic scaling is the ElasticDeviceMesh
58
+ which manages dynamic global process groups for fault-tolerant
59
+ communication across the internet and local process groups for
60
+ communication within a node.
61
+ The model was trained using the DiLoCo
62
+ algorithms with 100 inner steps. The global all-reduce was done with
63
+ custom int8 all-reduce kernels to reduce the communication payload
64
+ required, greatly reducing the communication overhead by a factor 400x.
65
+
66
+
67
+ For more detailed technical insights, please refer to our technical paper.
68
+
69
+
70
+ Note: You must add a BOS token at the beginning of each sample. Performance may be impacted otherwise.
71
+
72
+
73
+
74
+
75
+
76
+
77
+
78
+ Usage
79
+
80
+
81
+
82
+
83
+ import torch
84
+ from transformers import AutoModelForCausalLM, AutoTokenizer
85
+
86
+ torch.set_default_device("cuda")
87
+ model = AutoModelForCausalLM.from_pretrained("PrimeIntellect/INTELLECT-1-Instruct")
88
+ tokenizer = AutoTokenizer.from_pretrained("PrimeIntellect/INTELLECT-1-Instruct")
89
+
90
+ input_text = "What is the Metamorphosis of Prime Intellect about?"
91
+ input_ids = tokenizer.encode(input_text, return_tensors="pt")
92
+ output_ids = model.generate(input_ids, max_length=50, num_return_sequences=1)
93
+ output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
94
+
95
+ print(output_text)
96
+
97
+
98
+
99
+
100
+
101
+
102
+
103
+
104
+ Example text generation pipeline
105
+
106
+
107
+
108
+
109
+ import torch
110
+ from transformers import pipeline
111
+ torch.set_default_device("cuda")
112
+
113
+ pipe = pipeline("text-generation", model="PrimeIntellect/INTELLECT-1")
114
+ print(pipe("What is prime intellect ?"))
115
+
116
+
117
+
118
+
119
+
120
+
121
+
122
+
123
+ Model Details
124
+
125
+
126
+
127
+
128
+ Compute Contributors: Prime Intellect, Arcee AI,
129
+ kotaro, skre_0, marlo, rodeo, Herb, Olas, superchillen, Hugging Face,
130
+ mev_pete, 0xfr_, dj, primeprimeint1234, Marco Giglio, realtek,
131
+ Hyperbolic, hecataeus, NWO, Virtual Machine, droll, SemiAnalysis, waiting_, toptickcrypto, sto, Johannes, washout_segment_0b, klee
132
+ Release Date: 29 Nov 2024
133
+ Model License: Apache 2.0
134
+
135
+
136
+
137
+
138
+
139
+
140
+
141
+ Technical Specifications
142
+
143
+
144
+
145
+
146
+
147
+
148
+
149
+ Parameter
150
+ Value
151
+
152
+
153
+
154
+ Parameter Size
155
+ 10B
156
+
157
+
158
+ Number of Layers
159
+ 42
160
+
161
+
162
+ Number of Attention Heads
163
+ 32
164
+
165
+
166
+ Hidden Size
167
+ 4096
168
+
169
+
170
+ Context Length
171
+ 8192
172
+
173
+
174
+ Vocabulary Size
175
+ 128256
176
+
177
+
178
+
179
+
180
+
181
+
182
+ Training Details:
183
+
184
+
185
+ Dataset: 55% fineweb-edu, 10% fineweb, 20% Stack V1, 10% dclm-baseline, 5% open-web-math
186
+ Tokens: 1 Trillion
187
+ Optimizer: Diloco/LocalSGD - Inner Optimizer: AdamW, Outer Optmizer: Nesterov SGD
188
+
189
+
190
+
191
+
192
+
193
+
194
+
195
+ Post-training
196
+
197
+
198
+
199
+
200
+ The post-training has been handled by arcee
201
+
202
+
203
+ After completing the globally distributed pretraining phase, we
204
+ applied several post-training techniques to enhance INTELLECT-1's
205
+ capabilities and task-specific performance. Our post-training
206
+ methodology consisted of three main phases.
207
+
208
+
209
+ First, we conducted an extensive series of 16 Supervised Fine-Tuning
210
+ (SFT) trainings, with individual runs ranging from 1 to 3.3 billion
211
+ tokens each. The most successful configuration used 2.4 billion training
212
+ tokens over 3 epochs. We used MergeKit, EvolKit, and DistillKit from
213
+ Arcee AI to combine the models, generate the data sets, and distill the
214
+ logits, respectively. For training data, we used a diverse set of
215
+ high-quality datasets:
216
+
217
+
218
+ New Datasets (released with INTELLECT-1):
219
+
220
+
221
+ arcee-ai/EvolKit-75k (generated via EvolKit)
222
+ arcee-ai/Llama-405B-Logits
223
+ arcee-ai/The-Tomb
224
+
225
+
226
+ Instruction Following:
227
+
228
+
229
+ mlabonne/open-perfectblend-fixed (generalist capabilities)
230
+ microsoft/orca-agentinstruct-1M-v1-cleaned (Chain-of-Thought)
231
+ Post-training-Data-Flywheel/AutoIF-instruct-61k-with-funcs
232
+
233
+
234
+ Domain-Specific:
235
+
236
+
237
+ Team-ACE/ToolACE (function calling)
238
+ Synthia coder (programming)
239
+ ServiceNow-AI/M2Lingual (multilingual)
240
+ AI-MO/NuminaMath-TIR (mathematics)
241
+
242
+
243
+ Tulu-3 Persona Datasets:
244
+
245
+
246
+ allenai/tulu-3-sft-personas-code
247
+ allenai/tulu-3-sft-personas-math
248
+ allenai/tulu-3-sft-personas-math-grade
249
+ allenai/tulu-3-sft-personas-algebra
250
+
251
+
252
+
253
+
254
+ Second, we execute 8 distinct Direct Preference Optimization (DPO)
255
+ runs with various combinations of data sets to enhance specific
256
+ performance metrics and align the model with human preferences. A key
257
+ advantage in our post-training process was INTELLECT-1's use of the
258
+ Llama-3 tokenizer, which allowed us to utilize logits from
259
+ Llama-3.1-405B to heal and maintain precision during the post-training
260
+ process via DistillKit.
261
+
262
+
263
+ Finally, we performed 16 strategic merges between candidate models
264
+ using MergeKit to create superior combined models that leverage the
265
+ strengths of different training runs. During the post-training phase, we
266
+ observed that when using a ChatML template without an explicit BOS
267
+ (begin-of-sequence) token, the initial loss was approximately 15.
268
+ However, when switching to the Llama 3.1 chat template, the loss for
269
+ these trainings started much lower at approximately 1.1, indicating
270
+ better alignment with the underlying Llama 3 tokenizer.
271
+
272
+
273
+ The combination of these post-training techniques resulted in
274
+ significant improvements in various benchmarks, particularly in
275
+ knowledge retrieval, grade school math, instruction following and
276
+ reasoning.
277
+
278
+
279
+ Performance on benchmarks
280
+
281
+
282
+
283
+
284
+
285
+ Model
286
+ Size
287
+ Tokens
288
+ MMLU
289
+ GPQA
290
+ GSM8K
291
+ ARC-C
292
+ Hellaswag
293
+
294
+
295
+
296
+ INTELLECT-Instruct
297
+ 10B
298
+ 1T
299
+ 49.89
300
+ 28.32
301
+ 38.58
302
+ 54.52
303
+ 71.42
304
+
305
+
306
+ MPT-7B-Chat
307
+ 7B
308
+ 1T
309
+ 36.29
310
+ 26.79
311
+ 8.26
312
+ 51.02
313
+ 75.88
314
+
315
+
316
+ Falcon-7B-Instruct
317
+ 7B
318
+ 1.5T
319
+ 25.21
320
+ 26.34
321
+ 4.93
322
+ 45.82
323
+ 70.61
324
+
325
+
326
+ LLM360-AmberChat
327
+ 7B
328
+ 1.4T
329
+ 36.02
330
+ 27.23
331
+ 6.14
332
+ 43.94
333
+ 73.94
334
+
335
+
336
+ LLaMA2-7B-Chat
337
+ 7B
338
+ 2T
339
+ 47.20
340
+ 28.57
341
+ 23.96
342
+ 53.33
343
+ 78.69
344
+
345
+
346
+ LLaMA2-13B-Chat
347
+ 13B
348
+ 2T
349
+ 53.51
350
+ 28.35
351
+ 37.15
352
+ 59.73
353
+ 82.47
354
+
355
+
356
+
357
+
358
+
359
+
360
+
361
+
362
+
363
+
364
+
365
+ Citations
366
+
367
+
368
+
369
+
370
+ If you use this model in your research, please cite it as follows:
371
+
372
+
373
+ @article{jaghouar2024intellect,
374
+ title={INTELLECT-1 Technical Report.},
375
+ author={Jaghouar, Sami and Ong, Jack Min and Basra, Manveer and Obeid, Fares and Straube, Jannik and Keiblinger, Michael and Bakouch, Elie and Atkins, Lucas and Panahi, Maziyar and Goddard, Charles and Ryabinin, Max and Hagemann, Johannes},
376
+ journal={arXiv preprint},
377
+ year={2024}
378
+ }
379
+
380
+ ---
381
  ## Use with llama.cpp
382
  Install llama.cpp through brew (works on Mac and Linux)
383