Triangle104
/

INTELLECT-1-Instruct-Q4_K_S-GGUF

@@ -40,13 +40,8 @@ INTELLECT-1 is the first collaboratively trained 10
 billion parameter language model trained from scratch on 1 trillion
 tokens of English text and code.
 This is an instruct model. The base model associated with it is INTELLECT-1.
 INTELLECT-1 was trained on up to 14 concurrent nodes
  distributed across 3 continents, with contributions from 30 independent
  community contributors providing compute.
@@ -63,18 +58,10 @@ The model was trained using the DiLoCo
 custom int8 all-reduce kernels to reduce the communication payload
 required, greatly reducing the communication overhead by a factor 400x.
 For more detailed technical insights, please refer to our technical paper.
 Note: You must add a BOS token at the beginning of each sample. Performance may be impacted otherwise.
 		Usage
@@ -94,13 +81,6 @@ output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
 print(output_text)
 		Example text generation pipeline
@@ -113,13 +93,6 @@ torch.set_default_device("cuda")
 pipe = pipeline("text-generation", model="PrimeIntellect/INTELLECT-1")
 print(pipe("What is prime intellect ?"))
 		Model Details
@@ -132,12 +105,6 @@ Hyperbolic, hecataeus, NWO, Virtual Machine, droll, SemiAnalysis, waiting_, topt
 Release Date: 29 Nov 2024
 Model License: Apache 2.0
 		Technical Specifications
@@ -146,52 +113,33 @@ Model License: Apache 2.0
-Parameter
 Value
-Parameter Size
 10B
-Number of Layers
 42
-Number of Attention Heads
 32
-Hidden Size
 4096
-Context Length
 8192
-Vocabulary Size
 128256
 Training Details:
 Dataset: 55% fineweb-edu, 10% fineweb, 20% Stack V1, 10% dclm-baseline, 5% open-web-math
 Tokens: 1 Trillion
 Optimizer: Diloco/LocalSGD - Inner Optimizer: AdamW, Outer Optmizer: Nesterov SGD
 		Post-training
@@ -214,43 +162,32 @@ Arcee AI to combine the models, generate the data sets, and distill the
 logits, respectively. For training data, we used a diverse set of
 high-quality datasets:
 New Datasets (released with INTELLECT-1):
 arcee-ai/EvolKit-75k (generated via EvolKit)
 arcee-ai/Llama-405B-Logits
 arcee-ai/The-Tomb
 Instruction Following:
 mlabonne/open-perfectblend-fixed (generalist capabilities)
 microsoft/orca-agentinstruct-1M-v1-cleaned (Chain-of-Thought)
 Post-training-Data-Flywheel/AutoIF-instruct-61k-with-funcs
 Domain-Specific:
 Team-ACE/ToolACE (function calling)
 Synthia coder (programming)
 ServiceNow-AI/M2Lingual (multilingual)
 AI-MO/NuminaMath-TIR (mathematics)
 Tulu-3 Persona Datasets:
 allenai/tulu-3-sft-personas-code
 allenai/tulu-3-sft-personas-math
 allenai/tulu-3-sft-personas-math-grade
 allenai/tulu-3-sft-personas-algebra
 Second, we execute 8 distinct Direct Preference Optimization (DPO)
 runs with various combinations of data sets to enhance specific
 performance metrics and align the model with human preferences. A key
@@ -259,7 +196,6 @@ Llama-3 tokenizer, which allowed us to utilize logits from
 Llama-3.1-405B to heal and maintain precision during the post-training
 process via DistillKit.
 Finally, we performed 16 strategic merges between candidate models
 using MergeKit to create superior combined models that leverage the
 strengths of different training runs. During the post-training phase, we
@@ -269,99 +205,11 @@ However, when switching to the Llama 3.1 chat template, the loss for
 these trainings started much lower at approximately 1.1, indicating
 better alignment with the underlying Llama 3 tokenizer.
 The combination of these post-training techniques resulted in
 significant improvements in various benchmarks, particularly in
 knowledge retrieval, grade school math, instruction following and
 reasoning.
-Performance on benchmarks
-Model
-Size
-Tokens
-MMLU
-GPQA
-GSM8K
-ARC-C
-Hellaswag
-INTELLECT-Instruct
-10B
-1T
-49.89
-28.32
-38.58
-54.52
-71.42
-MPT-7B-Chat
-7B
-1T
-36.29
-26.79
-8.26
-51.02
-75.88
-Falcon-7B-Instruct
-7B
-1.5T
-25.21
-26.34
-4.93
-45.82
-70.61
-LLM360-AmberChat
-7B
-1.4T
-36.02
-27.23
-6.14
-43.94
-73.94
-LLaMA2-7B-Chat
-7B
-2T
-47.20
-28.57
-23.96
-53.33
-78.69
-LLaMA2-13B-Chat
-13B
-2T
-53.51
-28.35
-37.15
-59.73
-82.47
 		Citations
@@ -369,7 +217,6 @@ LLaMA2-13B-Chat
 If you use this model in your research, please cite it as follows:
 @article{jaghouar2024intellect,
   title={INTELLECT-1 Technical Report.},
   author={Jaghouar, Sami and Ong, Jack Min and Basra, Manveer and Obeid, Fares and Straube, Jannik and Keiblinger, Michael and Bakouch, Elie and Atkins, Lucas and Panahi, Maziyar and Goddard, Charles and Ryabinin, Max and Hagemann, Johannes},

 billion parameter language model trained from scratch on 1 trillion
 tokens of English text and code.
 This is an instruct model. The base model associated with it is INTELLECT-1.
 INTELLECT-1 was trained on up to 14 concurrent nodes
  distributed across 3 continents, with contributions from 30 independent
  community contributors providing compute.
 custom int8 all-reduce kernels to reduce the communication payload
 required, greatly reducing the communication overhead by a factor 400x.
 For more detailed technical insights, please refer to our technical paper.
 Note: You must add a BOS token at the beginning of each sample. Performance may be impacted otherwise.
 		Usage
 print(output_text)
 		Example text generation pipeline
 pipe = pipeline("text-generation", model="PrimeIntellect/INTELLECT-1")
 print(pipe("What is prime intellect ?"))
 		Model Details
 Release Date: 29 Nov 2024
 Model License: Apache 2.0
 		Technical Specifications
+Parameter:
 Value
+Parameter Size:
 10B
+Number of Layers:
 42
+Number of Attention Heads:
 32
+Hidden Size:
 4096
+Context Length:
 8192
+Vocabulary Size:
 128256
 Training Details:
+-
 Dataset: 55% fineweb-edu, 10% fineweb, 20% Stack V1, 10% dclm-baseline, 5% open-web-math
 Tokens: 1 Trillion
 Optimizer: Diloco/LocalSGD - Inner Optimizer: AdamW, Outer Optmizer: Nesterov SGD
 		Post-training
 logits, respectively. For training data, we used a diverse set of
 high-quality datasets:
 New Datasets (released with INTELLECT-1):
+-
 arcee-ai/EvolKit-75k (generated via EvolKit)
 arcee-ai/Llama-405B-Logits
 arcee-ai/The-Tomb
 Instruction Following:
+-
 mlabonne/open-perfectblend-fixed (generalist capabilities)
 microsoft/orca-agentinstruct-1M-v1-cleaned (Chain-of-Thought)
 Post-training-Data-Flywheel/AutoIF-instruct-61k-with-funcs
 Domain-Specific:
+-
 Team-ACE/ToolACE (function calling)
 Synthia coder (programming)
 ServiceNow-AI/M2Lingual (multilingual)
 AI-MO/NuminaMath-TIR (mathematics)
 Tulu-3 Persona Datasets:
+-
 allenai/tulu-3-sft-personas-code
 allenai/tulu-3-sft-personas-math
 allenai/tulu-3-sft-personas-math-grade
 allenai/tulu-3-sft-personas-algebra
 Second, we execute 8 distinct Direct Preference Optimization (DPO)
 runs with various combinations of data sets to enhance specific
 performance metrics and align the model with human preferences. A key
 Llama-3.1-405B to heal and maintain precision during the post-training
 process via DistillKit.
 Finally, we performed 16 strategic merges between candidate models
 using MergeKit to create superior combined models that leverage the
 strengths of different training runs. During the post-training phase, we
 these trainings started much lower at approximately 1.1, indicating
 better alignment with the underlying Llama 3 tokenizer.
 The combination of these post-training techniques resulted in
 significant improvements in various benchmarks, particularly in
 knowledge retrieval, grade school math, instruction following and
 reasoning.
 		Citations
 If you use this model in your research, please cite it as follows:
 @article{jaghouar2024intellect,
   title={INTELLECT-1 Technical Report.},
   author={Jaghouar, Sami and Ong, Jack Min and Basra, Manveer and Obeid, Fares and Straube, Jannik and Keiblinger, Michael and Bakouch, Elie and Atkins, Lucas and Panahi, Maziyar and Goddard, Charles and Ryabinin, Max and Hagemann, Johannes},