gradientai
/

Llama-3-8B-Instruct-262k

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

leo-pekelis-gradient commited on Apr 25

Commit

9411de7

•

1 Parent(s): ab6d0a2

Update README.md

Files changed (1) hide show

README.md +8 -6

README.md CHANGED Viewed

@@ -7,9 +7,10 @@ tags:
 - llama-3
 ---
-**[NIAH eval figure here]**
-This model extends LLama-3's context length from 8k to > 130K, developed by Gradient, sponsored by compute from Crusoe Energy. It demonstrates that SOTA LLMs can learn to operate on long context with minimal training (< 200M tokens) by appropriately adjusting RoPE theta.
 **Approach:**
@@ -25,17 +26,18 @@ We build on top of the EasyContext Blockwise RingAttention library [3] to scalab
 For training data, we generate long contexts by augmenting [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B).
 | Parameter                   | 65K        | 262K       |
 |-----------------------------|------------|------------|
-| Initialize From             | LLaMA-3 7B | 65K        |
 | Sequence Length             | 2^16       | 2^18       |
 | RoPE theta                  | 15.3 M     | 207.1 M    |
-| batch_size                  | 1          | 1          |
-| gradient_accumulation_steps | 32         | 16         |
 | Steps                       | 30         | 24         |
 | Total Tokens                | 63 M       | 101 M      |
-| learning_rate               | 2.00E-05   | 2.00E-05   |
 | # GPUs                      | 8          | 8          |
 | GPU Type                    | NVIDIA L40S| NVIDIA L40S|

 - llama-3
 ---
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/6585dc9be92bc5f258156bd6/F2WLF8_jOx_gttxbPtLK1.png)
+This model extends LLama-3 8B's context length from 8k to > 130K, developed by Gradient, sponsored by compute from Crusoe Energy. It demonstrates that SOTA LLMs can learn to operate on long context with minimal training (< 200M tokens) by appropriately adjusting RoPE theta.
 **Approach:**
 For training data, we generate long contexts by augmenting [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B).
+**Progressive Training Details:**
 | Parameter                   | 65K        | 262K       |
 |-----------------------------|------------|------------|
+| Initialize From             | LLaMA-3 8B | 65K        |
 | Sequence Length             | 2^16       | 2^18       |
 | RoPE theta                  | 15.3 M     | 207.1 M    |
+| Batch Size                  | 1          | 1          |
+| Gradient Accumulation Steps | 32         | 16         |
 | Steps                       | 30         | 24         |
 | Total Tokens                | 63 M       | 101 M      |
+| Learning Rate               | 2.00E-05   | 2.00E-05   |
 | # GPUs                      | 8          | 8          |
 | GPU Type                    | NVIDIA L40S| NVIDIA L40S|