michaelfeil
commited on
Commit
•
5e409cf
1
Parent(s):
42919cd
Update Readme.md (#4)
Browse files- Update Readme.md (b0fad48cb7c5a016142ccaada446ac74551eb8e9)
- Update README.md (ccad071348dda197f6b4687b74da5b04fd0b0ebe)
README.md
CHANGED
@@ -9,17 +9,17 @@ tags:
|
|
9 |
|
10 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6585dc9be92bc5f258156bd6/hiHWva3CbsrnPvZTp5-lu.png)
|
11 |
|
12 |
-
This model extends LLama-3 8B's context length from 8k to > 160K, developed by Gradient, sponsored by compute from Crusoe Energy. It demonstrates that SOTA LLMs can learn to operate on long context with minimal training (< 200M tokens) by appropriately adjusting RoPE theta.
|
13 |
|
14 |
**Approach:**
|
15 |
|
16 |
- [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) as the base
|
17 |
- NTK-aware interpolation [1] to initialize an optimal schedule for RoPE theta, followed by a new data-driven RoPE theta optimization technique
|
18 |
-
-
|
19 |
|
20 |
**Infra:**
|
21 |
|
22 |
-
We build on top of the EasyContext Blockwise RingAttention library [3] to scalably and efficiently train on contexts up to
|
23 |
|
24 |
**Data:**
|
25 |
|
|
|
9 |
|
10 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6585dc9be92bc5f258156bd6/hiHWva3CbsrnPvZTp5-lu.png)
|
11 |
|
12 |
+
This model extends LLama-3 8B's context length from 8k to > 160K, developed by Gradient, sponsored by compute from [Crusoe Energy](https://huggingface.co/crusoeai). It demonstrates that SOTA LLMs can learn to operate on long context with minimal training (< 200M tokens) by appropriately adjusting RoPE theta.
|
13 |
|
14 |
**Approach:**
|
15 |
|
16 |
- [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) as the base
|
17 |
- NTK-aware interpolation [1] to initialize an optimal schedule for RoPE theta, followed by a new data-driven RoPE theta optimization technique
|
18 |
+
- Progressive training on increasing context lengths similar to the [Large World Model](https://huggingface.co/LargeWorldModel) [2] (See details below)
|
19 |
|
20 |
**Infra:**
|
21 |
|
22 |
+
We build on top of the EasyContext Blockwise RingAttention library [3] to scalably and efficiently train on contexts up to 262144 tokens on [Crusoe Energy](https://huggingface.co/crusoeai) high performance L40S cluster.
|
23 |
|
24 |
**Data:**
|
25 |
|