Update README.md
Browse files
README.md
CHANGED
@@ -1,26 +1,28 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
-
|
4 |
tags:
|
5 |
- generated_from_trainer
|
6 |
metrics:
|
7 |
- accuracy
|
8 |
-
model-index:
|
9 |
-
- name: jamba-H1024_L12-v0.12-fineweb-100k-xlong_16k-knowledge-inoc-concat-v1-vN
|
10 |
-
results: []
|
11 |
---
|
12 |
|
13 |
-
|
14 |
-
|
|
|
|
|
15 |
|
16 |
-
|
|
|
|
|
|
|
17 |
|
18 |
-
|
19 |
-
It achieves the following results on the evaluation set:
|
20 |
- Loss: 3.0366
|
21 |
- Accuracy: 0.4514
|
22 |
- Num Input Tokens Seen: 1975517184
|
23 |
|
|
|
24 |
|
25 |
## Quick eval
|
26 |
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
+
|
4 |
tags:
|
5 |
- generated_from_trainer
|
6 |
metrics:
|
7 |
- accuracy
|
|
|
|
|
|
|
8 |
---
|
9 |
|
10 |
+
# jamba-H1024_L12-v0.13-KIx2
|
11 |
+
|
12 |
+
|
13 |
+
This is a pretraining experiment on the `jamba` arch as a "smol MoE". Details:
|
14 |
|
15 |
+
- pretrained at context length 16384
|
16 |
+
- seen approx 20b tokens
|
17 |
+
- uses Claude3 tokenizer (as hf GPT2 tokenizer)
|
18 |
+
- hidden size 1024, 12 layers, 8 experts
|
19 |
|
20 |
+
most recent dataset, achieves the following results on the evaluation set:
|
|
|
21 |
- Loss: 3.0366
|
22 |
- Accuracy: 0.4514
|
23 |
- Num Input Tokens Seen: 1975517184
|
24 |
|
25 |
+
if I pretrain it further, other versions will be in new repos with incremented version (this is v0.13)
|
26 |
|
27 |
## Quick eval
|
28 |
|