gradientai
/

Llama-3-8B-Instruct-Gradient-1048k

@@ -9,7 +9,7 @@ license: llama3
 ---
 <img src="https://cdn-uploads.huggingface.co/production/uploads/655bb613e8a8971e89944f3e/TSa3V8YpoVagnTYgxiLaO.png" width="200"/>
-# Llama-3 8B Instruct 1048k
 Gradient incorporates your data to deploy autonomous assistants that power critical operations across your business. To learn more or collaborate on a custom model, drop us a message at contact@gradient.ai.
 This model extends LLama-3 8B's context length from 8k to > 1040K, developed by Gradient, sponsored by compute from [Crusoe Energy](https://huggingface.co/crusoeai). It demonstrates that SOTA LLMs can learn to operate on long context with minimal training by appropriately adjusting RoPE theta. We trained on 320M total tokens, which is < 0.002% of Lamma-3's original pre-training data.
@@ -39,13 +39,13 @@ For training data, we generate long contexts by augmenting [SlimPajama](https://
 | Initialize From        | LLaMA-3 7B| 65K       | 262K      | 524k      |
 | Sequence Length 2^N    | 16        | 18        | 19        | 20        |
 | RoPE theta             | 15.3 M    | 207.1 M   | 1.06B     | 2.80B     |
-| batch_size             | 1         | 1         | 2         | 2         |
-| gradient_accumulation_steps | 32  | 16        | 1         | 1         |
 | Steps                  | 30        | 24        | 50        | 50        |
 | Total Tokens           | 62914560  | 100663296 | 419430400 | 838860800 |
-| learning_rate          | 2.00E-05  | 2.00E-05  | 2.00E-05  | 2.00E-05  |
 | # GPUs                 | 8         | 32        | 512       | 512       |
-| Ring or Data parallelism | 1       | 1         | 8         | 8         |
 | GPU Type               | NVIDIA L40S | NVIDIA L40S | NVIDIA L40S | NVIDIA L40S |
 | Minutes to Train (Wall)| 202       | 555       | 61        | 87        |

 ---
 <img src="https://cdn-uploads.huggingface.co/production/uploads/655bb613e8a8971e89944f3e/TSa3V8YpoVagnTYgxiLaO.png" width="200"/>
+# Llama-3 8B Gradient Instruct 1048k
 Gradient incorporates your data to deploy autonomous assistants that power critical operations across your business. To learn more or collaborate on a custom model, drop us a message at contact@gradient.ai.
 This model extends LLama-3 8B's context length from 8k to > 1040K, developed by Gradient, sponsored by compute from [Crusoe Energy](https://huggingface.co/crusoeai). It demonstrates that SOTA LLMs can learn to operate on long context with minimal training by appropriately adjusting RoPE theta. We trained on 320M total tokens, which is < 0.002% of Lamma-3's original pre-training data.
 | Initialize From        | LLaMA-3 7B| 65K       | 262K      | 524k      |
 | Sequence Length 2^N    | 16        | 18        | 19        | 20        |
 | RoPE theta             | 15.3 M    | 207.1 M   | 1.06B     | 2.80B     |
+| Batch Size             | 1         | 1         | 2         | 2         |
+| Gradient Accumulation Steps | 32  | 16        | 1         | 1         |
 | Steps                  | 30        | 24        | 50        | 50        |
 | Total Tokens           | 62914560  | 100663296 | 419430400 | 838860800 |
+| Learning Rate          | 2.00E-05  | 2.00E-05  | 2.00E-05  | 2.00E-05  |
 | # GPUs                 | 8         | 32        | 512       | 512       |
+| Ring parallelism | 1       | 1         | 8         | 8         |
 | GPU Type               | NVIDIA L40S | NVIDIA L40S | NVIDIA L40S | NVIDIA L40S |
 | Minutes to Train (Wall)| 202       | 555       | 61        | 87        |