Triangle104 commited on
Commit
ca080dc
1 Parent(s): 46c3971

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -165
README.md CHANGED
@@ -40,13 +40,8 @@ INTELLECT-1 is the first collaboratively trained 10
40
  billion parameter language model trained from scratch on 1 trillion
41
  tokens of English text and code.
42
 
43
-
44
-
45
-
46
-
47
  This is an instruct model. The base model associated with it is INTELLECT-1.
48
 
49
-
50
  INTELLECT-1 was trained on up to 14 concurrent nodes
51
  distributed across 3 continents, with contributions from 30 independent
52
  community contributors providing compute.
@@ -63,18 +58,10 @@ The model was trained using the DiLoCo
63
  custom int8 all-reduce kernels to reduce the communication payload
64
  required, greatly reducing the communication overhead by a factor 400x.
65
 
66
-
67
  For more detailed technical insights, please refer to our technical paper.
68
 
69
-
70
  Note: You must add a BOS token at the beginning of each sample. Performance may be impacted otherwise.
71
 
72
-
73
-
74
-
75
-
76
-
77
-
78
  Usage
79
 
80
 
@@ -94,13 +81,6 @@ output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
94
 
95
  print(output_text)
96
 
97
-
98
-
99
-
100
-
101
-
102
-
103
-
104
  Example text generation pipeline
105
 
106
 
@@ -113,13 +93,6 @@ torch.set_default_device("cuda")
113
  pipe = pipeline("text-generation", model="PrimeIntellect/INTELLECT-1")
114
  print(pipe("What is prime intellect ?"))
115
 
116
-
117
-
118
-
119
-
120
-
121
-
122
-
123
  Model Details
124
 
125
 
@@ -132,12 +105,6 @@ Hyperbolic, hecataeus, NWO, Virtual Machine, droll, SemiAnalysis, waiting_, topt
132
  Release Date: 29 Nov 2024
133
  Model License: Apache 2.0
134
 
135
-
136
-
137
-
138
-
139
-
140
-
141
  Technical Specifications
142
 
143
 
@@ -146,52 +113,33 @@ Model License: Apache 2.0
146
 
147
 
148
 
149
- Parameter
150
  Value
151
-
152
-
153
 
154
- Parameter Size
155
  10B
156
 
157
-
158
- Number of Layers
159
  42
160
 
161
-
162
- Number of Attention Heads
163
  32
164
 
165
-
166
- Hidden Size
167
  4096
168
 
169
-
170
- Context Length
171
  8192
172
 
173
-
174
- Vocabulary Size
175
  128256
176
 
177
-
178
-
179
-
180
-
181
-
182
  Training Details:
183
-
184
-
185
  Dataset: 55% fineweb-edu, 10% fineweb, 20% Stack V1, 10% dclm-baseline, 5% open-web-math
186
  Tokens: 1 Trillion
187
  Optimizer: Diloco/LocalSGD - Inner Optimizer: AdamW, Outer Optmizer: Nesterov SGD
188
 
189
-
190
-
191
-
192
-
193
-
194
-
195
  Post-training
196
 
197
 
@@ -214,43 +162,32 @@ Arcee AI to combine the models, generate the data sets, and distill the
214
  logits, respectively. For training data, we used a diverse set of
215
  high-quality datasets:
216
 
217
-
218
  New Datasets (released with INTELLECT-1):
219
-
220
-
221
  arcee-ai/EvolKit-75k (generated via EvolKit)
222
  arcee-ai/Llama-405B-Logits
223
  arcee-ai/The-Tomb
224
 
225
-
226
  Instruction Following:
227
-
228
-
229
  mlabonne/open-perfectblend-fixed (generalist capabilities)
230
  microsoft/orca-agentinstruct-1M-v1-cleaned (Chain-of-Thought)
231
  Post-training-Data-Flywheel/AutoIF-instruct-61k-with-funcs
232
 
233
-
234
  Domain-Specific:
235
-
236
-
237
  Team-ACE/ToolACE (function calling)
238
  Synthia coder (programming)
239
  ServiceNow-AI/M2Lingual (multilingual)
240
  AI-MO/NuminaMath-TIR (mathematics)
241
 
242
-
243
  Tulu-3 Persona Datasets:
244
-
245
-
246
  allenai/tulu-3-sft-personas-code
247
  allenai/tulu-3-sft-personas-math
248
  allenai/tulu-3-sft-personas-math-grade
249
  allenai/tulu-3-sft-personas-algebra
250
 
251
-
252
-
253
-
254
  Second, we execute 8 distinct Direct Preference Optimization (DPO)
255
  runs with various combinations of data sets to enhance specific
256
  performance metrics and align the model with human preferences. A key
@@ -259,7 +196,6 @@ Llama-3 tokenizer, which allowed us to utilize logits from
259
  Llama-3.1-405B to heal and maintain precision during the post-training
260
  process via DistillKit.
261
 
262
-
263
  Finally, we performed 16 strategic merges between candidate models
264
  using MergeKit to create superior combined models that leverage the
265
  strengths of different training runs. During the post-training phase, we
@@ -269,99 +205,11 @@ However, when switching to the Llama 3.1 chat template, the loss for
269
  these trainings started much lower at approximately 1.1, indicating
270
  better alignment with the underlying Llama 3 tokenizer.
271
 
272
-
273
  The combination of these post-training techniques resulted in
274
  significant improvements in various benchmarks, particularly in
275
  knowledge retrieval, grade school math, instruction following and
276
  reasoning.
277
 
278
-
279
- Performance on benchmarks
280
-
281
-
282
-
283
-
284
-
285
- Model
286
- Size
287
- Tokens
288
- MMLU
289
- GPQA
290
- GSM8K
291
- ARC-C
292
- Hellaswag
293
-
294
-
295
-
296
- INTELLECT-Instruct
297
- 10B
298
- 1T
299
- 49.89
300
- 28.32
301
- 38.58
302
- 54.52
303
- 71.42
304
-
305
-
306
- MPT-7B-Chat
307
- 7B
308
- 1T
309
- 36.29
310
- 26.79
311
- 8.26
312
- 51.02
313
- 75.88
314
-
315
-
316
- Falcon-7B-Instruct
317
- 7B
318
- 1.5T
319
- 25.21
320
- 26.34
321
- 4.93
322
- 45.82
323
- 70.61
324
-
325
-
326
- LLM360-AmberChat
327
- 7B
328
- 1.4T
329
- 36.02
330
- 27.23
331
- 6.14
332
- 43.94
333
- 73.94
334
-
335
-
336
- LLaMA2-7B-Chat
337
- 7B
338
- 2T
339
- 47.20
340
- 28.57
341
- 23.96
342
- 53.33
343
- 78.69
344
-
345
-
346
- LLaMA2-13B-Chat
347
- 13B
348
- 2T
349
- 53.51
350
- 28.35
351
- 37.15
352
- 59.73
353
- 82.47
354
-
355
-
356
-
357
-
358
-
359
-
360
-
361
-
362
-
363
-
364
-
365
  Citations
366
 
367
 
@@ -369,7 +217,6 @@ LLaMA2-13B-Chat
369
 
370
  If you use this model in your research, please cite it as follows:
371
 
372
-
373
  @article{jaghouar2024intellect,
374
  title={INTELLECT-1 Technical Report.},
375
  author={Jaghouar, Sami and Ong, Jack Min and Basra, Manveer and Obeid, Fares and Straube, Jannik and Keiblinger, Michael and Bakouch, Elie and Atkins, Lucas and Panahi, Maziyar and Goddard, Charles and Ryabinin, Max and Hagemann, Johannes},
 
40
  billion parameter language model trained from scratch on 1 trillion
41
  tokens of English text and code.
42
 
 
 
 
 
43
  This is an instruct model. The base model associated with it is INTELLECT-1.
44
 
 
45
  INTELLECT-1 was trained on up to 14 concurrent nodes
46
  distributed across 3 continents, with contributions from 30 independent
47
  community contributors providing compute.
 
58
  custom int8 all-reduce kernels to reduce the communication payload
59
  required, greatly reducing the communication overhead by a factor 400x.
60
 
 
61
  For more detailed technical insights, please refer to our technical paper.
62
 
 
63
  Note: You must add a BOS token at the beginning of each sample. Performance may be impacted otherwise.
64
 
 
 
 
 
 
 
65
  Usage
66
 
67
 
 
81
 
82
  print(output_text)
83
 
 
 
 
 
 
 
 
84
  Example text generation pipeline
85
 
86
 
 
93
  pipe = pipeline("text-generation", model="PrimeIntellect/INTELLECT-1")
94
  print(pipe("What is prime intellect ?"))
95
 
 
 
 
 
 
 
 
96
  Model Details
97
 
98
 
 
105
  Release Date: 29 Nov 2024
106
  Model License: Apache 2.0
107
 
 
 
 
 
 
 
108
  Technical Specifications
109
 
110
 
 
113
 
114
 
115
 
116
+ Parameter:
117
  Value
 
 
118
 
119
+ Parameter Size:
120
  10B
121
 
122
+ Number of Layers:
 
123
  42
124
 
125
+ Number of Attention Heads:
 
126
  32
127
 
128
+ Hidden Size:
 
129
  4096
130
 
131
+ Context Length:
 
132
  8192
133
 
134
+ Vocabulary Size:
 
135
  128256
136
 
 
 
 
 
 
137
  Training Details:
138
+ -
 
139
  Dataset: 55% fineweb-edu, 10% fineweb, 20% Stack V1, 10% dclm-baseline, 5% open-web-math
140
  Tokens: 1 Trillion
141
  Optimizer: Diloco/LocalSGD - Inner Optimizer: AdamW, Outer Optmizer: Nesterov SGD
142
 
 
 
 
 
 
 
143
  Post-training
144
 
145
 
 
162
  logits, respectively. For training data, we used a diverse set of
163
  high-quality datasets:
164
 
 
165
  New Datasets (released with INTELLECT-1):
166
+ -
 
167
  arcee-ai/EvolKit-75k (generated via EvolKit)
168
  arcee-ai/Llama-405B-Logits
169
  arcee-ai/The-Tomb
170
 
 
171
  Instruction Following:
172
+ -
 
173
  mlabonne/open-perfectblend-fixed (generalist capabilities)
174
  microsoft/orca-agentinstruct-1M-v1-cleaned (Chain-of-Thought)
175
  Post-training-Data-Flywheel/AutoIF-instruct-61k-with-funcs
176
 
 
177
  Domain-Specific:
178
+ -
 
179
  Team-ACE/ToolACE (function calling)
180
  Synthia coder (programming)
181
  ServiceNow-AI/M2Lingual (multilingual)
182
  AI-MO/NuminaMath-TIR (mathematics)
183
 
 
184
  Tulu-3 Persona Datasets:
185
+ -
 
186
  allenai/tulu-3-sft-personas-code
187
  allenai/tulu-3-sft-personas-math
188
  allenai/tulu-3-sft-personas-math-grade
189
  allenai/tulu-3-sft-personas-algebra
190
 
 
 
 
191
  Second, we execute 8 distinct Direct Preference Optimization (DPO)
192
  runs with various combinations of data sets to enhance specific
193
  performance metrics and align the model with human preferences. A key
 
196
  Llama-3.1-405B to heal and maintain precision during the post-training
197
  process via DistillKit.
198
 
 
199
  Finally, we performed 16 strategic merges between candidate models
200
  using MergeKit to create superior combined models that leverage the
201
  strengths of different training runs. During the post-training phase, we
 
205
  these trainings started much lower at approximately 1.1, indicating
206
  better alignment with the underlying Llama 3 tokenizer.
207
 
 
208
  The combination of these post-training techniques resulted in
209
  significant improvements in various benchmarks, particularly in
210
  knowledge retrieval, grade school math, instruction following and
211
  reasoning.
212
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
213
  Citations
214
 
215
 
 
217
 
218
  If you use this model in your research, please cite it as follows:
219
 
 
220
  @article{jaghouar2024intellect,
221
  title={INTELLECT-1 Technical Report.},
222
  author={Jaghouar, Sami and Ong, Jack Min and Basra, Manveer and Obeid, Fares and Straube, Jannik and Keiblinger, Michael and Bakouch, Elie and Atkins, Lucas and Panahi, Maziyar and Goddard, Charles and Ryabinin, Max and Hagemann, Johannes},