Update english benchmark results and format of data table
Browse files
README.md
CHANGED
@@ -44,7 +44,7 @@ The evaluation was done **five-shot** with native prompts and only a sample of 1
|
|
44 |
|
45 |
**BHASA**
|
46 |
|
47 |
-
|
48 |
|
49 |
**English**
|
50 |
|
@@ -64,23 +64,23 @@ The evaluation was done **five-shot** with native prompts and only a sample of 1
|
|
64 |
|
65 |
LLaMA3 8B SEA-LIONv2 base model was continued pre-trained on 48B tokens of the following data:
|
66 |
|
67 |
-
| Data Source | Unique Tokens | Multiplier | Total Tokens | Percentage |
|
68 |
-
|
69 |
-
| Dolma RefinedWeb - English| 7.
|
70 |
-
| Dolma C4 - English | 1.
|
71 |
-
| Dolma Reddit - English | 1.
|
72 |
-
| Dolma Semantic Scholar | 0.
|
73 |
-
| Dolma arXiv | 0.
|
74 |
-
| Dolma StarCoder | 4.
|
75 |
-
| SEA-LION Pile - Indonesian| 3.
|
76 |
-
| Wiki* - Indonesian | 0.
|
77 |
-
| SEA-LION Pile - Tamil | 5.
|
78 |
-
| Wiki* + News - Tamil | 0.
|
79 |
-
| SEA-LION Pile - Thai | 2.
|
80 |
-
| WangChanBERTa - Thai |
|
81 |
-
| Wiki* - Thai | 0.
|
82 |
-
| SEA-LION Pile - Vietnamese| 6.
|
83 |
-
| Wiki* - Vietnamese | 0.
|
84 |
|
85 |
Note:
|
86 |
- All token counts are counted using LLaMA3 tokenizer
|
|
|
44 |
|
45 |
**BHASA**
|
46 |
|
47 |
+
To be released soon
|
48 |
|
49 |
**English**
|
50 |
|
|
|
64 |
|
65 |
LLaMA3 8B SEA-LIONv2 base model was continued pre-trained on 48B tokens of the following data:
|
66 |
|
67 |
+
| Data Source | Unique Tokens (B) | Multiplier | Total Tokens (B) | Percentage (%) |
|
68 |
+
|---------------------------|:-----------------:|:----------:|:----------------:|:--------------:|
|
69 |
+
| Dolma RefinedWeb - English| 7.650 | 1 | 7.650 | 15.90 |
|
70 |
+
| Dolma C4 - English | 1.160 | 1 | 1 | 9.21 |
|
71 |
+
| Dolma Reddit - English | 1.339 | 1 | 14.7 | 2.42 |
|
72 |
+
| Dolma Semantic Scholar | 0.959 | 1 | 2.9 | 2.79 |
|
73 |
+
| Dolma arXiv | 0.469 | 1 | 5.3 | 1.99 |
|
74 |
+
| Dolma StarCoder | 4.422 | 1 | 4.9 | 0.98 |
|
75 |
+
| SEA-LION Pile - Indonesian| 3.4 | 1 | 6.8 | 14.17 |
|
76 |
+
| Wiki* - Indonesian | 0.3 | 4 | 1.2 | 2.50 |
|
77 |
+
| SEA-LION Pile - Tamil | 5.6 | 1 | 5.6 | 11.67 |
|
78 |
+
| Wiki* + News - Tamil | 0.6 | 4 | 2.4 | 5.00 |
|
79 |
+
| SEA-LION Pile - Thai | 2.28 | 1 | 2.28 | 4.75 |
|
80 |
+
| WangChanBERTa - Thai | 5 | 1 | 5 | 10.42 |
|
81 |
+
| Wiki* - Thai | 0.18 | 4 | 0.72 | 1.50 |
|
82 |
+
| SEA-LION Pile - Vietnamese| 6.76 | 1 | 6.76 | 14.08 |
|
83 |
+
| Wiki* - Vietnamese | 0.31 | 4 | 1.24 | 2.58 |
|
84 |
|
85 |
Note:
|
86 |
- All token counts are counted using LLaMA3 tokenizer
|