gradientai
/

Llama-3-8B-Instruct-262k

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

michaelfeil commited on Apr 25

Commit

5e409cf

•

1 Parent(s): 42919cd

Update Readme.md (#4)

- Update Readme.md (b0fad48cb7c5a016142ccaada446ac74551eb8e9)
- Update README.md (ccad071348dda197f6b4687b74da5b04fd0b0ebe)

Files changed (1) hide show

README.md +3 -3

README.md CHANGED Viewed

@@ -9,17 +9,17 @@ tags:
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6585dc9be92bc5f258156bd6/hiHWva3CbsrnPvZTp5-lu.png)
-This model extends LLama-3 8B's context length from 8k to > 160K, developed by Gradient, sponsored by compute from Crusoe Energy. It demonstrates that SOTA LLMs can learn to operate on long context with minimal training (< 200M tokens) by appropriately adjusting RoPE theta.
 **Approach:**
 - [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) as the base
 - NTK-aware interpolation [1] to initialize an optimal schedule for RoPE theta, followed by a new data-driven RoPE theta optimization technique
-- progressive training on increasing context lengths similar to the [Large World Model](https://huggingface.co/LargeWorldModel) [2]
 **Infra:**
-We build on top of the EasyContext Blockwise RingAttention library [3] to scalably and efficiently train on contexts up to 256k in length on Crusoe Energy's high performance L40S cluster.
 **Data:**

 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6585dc9be92bc5f258156bd6/hiHWva3CbsrnPvZTp5-lu.png)
+This model extends LLama-3 8B's context length from 8k to > 160K, developed by Gradient, sponsored by compute from [Crusoe Energy](https://huggingface.co/crusoeai). It demonstrates that SOTA LLMs can learn to operate on long context with minimal training (< 200M tokens) by appropriately adjusting RoPE theta.
 **Approach:**
 - [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) as the base
 - NTK-aware interpolation [1] to initialize an optimal schedule for RoPE theta, followed by a new data-driven RoPE theta optimization technique
+- Progressive training on increasing context lengths similar to the [Large World Model](https://huggingface.co/LargeWorldModel) [2] (See details below)
 **Infra:**
+We build on top of the EasyContext Blockwise RingAttention library [3] to scalably and efficiently train on contexts up to 262144 tokens on [Crusoe Energy](https://huggingface.co/crusoeai) high performance L40S cluster.
 **Data:**