leo-pekelis-gradient
commited on
Commit
•
7976e88
1
Parent(s):
1ab8322
Update README.md
Browse files
README.md
CHANGED
@@ -7,18 +7,23 @@ tags:
|
|
7 |
- llama-3
|
8 |
---
|
9 |
|
10 |
-
**
|
11 |
-
|
12 |
-
[NIAH eval figure here]
|
13 |
|
14 |
This model extends LLama-3's context length from 8k to > 130K, developed by Gradient, sponsored by compute from Crusoe Energy. It demonstrates that SOTA LLMs can learn to operate on long context with minimal training (< 200M tokens) by appropriately adjusting RoPE theta.
|
15 |
|
16 |
-
|
|
|
17 |
- [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) as the base
|
18 |
- NTK-aware interpolation [1] to initialize an optimal schedule for RoPE theta, followed by a new data-driven RoPE theta optimization technique
|
19 |
- progressive training on increasing context lengths similar to the [Large World Model](https://huggingface.co/LargeWorldModel) [2]
|
20 |
|
21 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
22 |
|
23 |
|
24 |
| Parameter | 65K | 262K |
|
|
|
7 |
- llama-3
|
8 |
---
|
9 |
|
10 |
+
**[NIAH eval figure here]**
|
|
|
|
|
11 |
|
12 |
This model extends LLama-3's context length from 8k to > 130K, developed by Gradient, sponsored by compute from Crusoe Energy. It demonstrates that SOTA LLMs can learn to operate on long context with minimal training (< 200M tokens) by appropriately adjusting RoPE theta.
|
13 |
|
14 |
+
Approach:
|
15 |
+
|
16 |
- [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) as the base
|
17 |
- NTK-aware interpolation [1] to initialize an optimal schedule for RoPE theta, followed by a new data-driven RoPE theta optimization technique
|
18 |
- progressive training on increasing context lengths similar to the [Large World Model](https://huggingface.co/LargeWorldModel) [2]
|
19 |
|
20 |
+
Infra:
|
21 |
+
|
22 |
+
We build on top of the EasyContext Blockwise RingAttention library [3] to scalably and efficiently train on contexts up to 256k in length on Crusoe Energy's high performance L40S cluster.
|
23 |
+
|
24 |
+
Data:
|
25 |
+
|
26 |
+
For training data, we generate long contexts by augmenting [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B).
|
27 |
|
28 |
|
29 |
| Parameter | 65K | 262K |
|