gradientai
/

Llama-3-8B-Instruct-262k

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

leo-pekelis-gradient commited on Apr 25

Commit

7976e88

•

1 Parent(s): 1ab8322

Update README.md

Files changed (1) hide show

README.md +10 -5

README.md CHANGED Viewed

@@ -7,18 +7,23 @@ tags:
 - llama-3
 ---
-**COPY THIS REPO BEFORE MAKING PUBLIC**
-[NIAH eval figure here]
 This model extends LLama-3's context length from 8k to > 130K, developed by Gradient, sponsored by compute from Crusoe Energy. It demonstrates that SOTA LLMs can learn to operate on long context with minimal training (< 200M tokens) by appropriately adjusting RoPE theta.
-We used:
 - [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) as the base
 - NTK-aware interpolation [1] to initialize an optimal schedule for RoPE theta, followed by a new data-driven RoPE theta optimization technique
 - progressive training on increasing context lengths similar to the [Large World Model](https://huggingface.co/LargeWorldModel) [2]
-We build on top of the EasyContext Blockwise RingAttention library to scalably and efficiently train on contexts up to 256k in length on Crusoe Energy's high performance L40S cluster. For training data, we generate long contexts from the slimpajama dataset.
 | Parameter                   | 65K        | 262K       |

 - llama-3
 ---
+**[NIAH eval figure here]**
 This model extends LLama-3's context length from 8k to > 130K, developed by Gradient, sponsored by compute from Crusoe Energy. It demonstrates that SOTA LLMs can learn to operate on long context with minimal training (< 200M tokens) by appropriately adjusting RoPE theta.
+Approach:
 - [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) as the base
 - NTK-aware interpolation [1] to initialize an optimal schedule for RoPE theta, followed by a new data-driven RoPE theta optimization technique
 - progressive training on increasing context lengths similar to the [Large World Model](https://huggingface.co/LargeWorldModel) [2]
+Infra:
+We build on top of the EasyContext Blockwise RingAttention library [3] to scalably and efficiently train on contexts up to 256k in length on Crusoe Energy's high performance L40S cluster.
+Data:
+For training data, we generate long contexts by augmenting [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B).
 | Parameter                   | 65K        | 262K       |