RaymondAISG
commited on
Commit
•
5c5fb7c
1
Parent(s):
20b39c9
Update README.md
Browse files
README.md
CHANGED
@@ -49,11 +49,11 @@ Llama3 8B CPT SEA-LIONv2 base model was continued pre-trained on 48B tokens of t
|
|
49 |
| Data Source | Unique Tokens (B) | Multiplier | Total Tokens (B) | Percentage (%) |
|
50 |
|---------------------------|:-----------------:|:----------:|:----------------:|:--------------:|
|
51 |
| Dolma RefinedWeb - English| 7.650 | 1 | 7.650 | 15.90 |
|
52 |
-
| Dolma C4 - English | 1.160 | 1 |
|
53 |
-
| Dolma Reddit - English | 1.339 | 1 |
|
54 |
-
| Dolma Semantic Scholar | 0.959 | 1 |
|
55 |
-
| Dolma arXiv | 0.469 | 1 |
|
56 |
-
| Dolma StarCoder | 4.422 | 1 |
|
57 |
| SEA-LION Pile - Indonesian| 3.4 | 2 | 6.8 | 14.17 |
|
58 |
| Wiki* - Indonesian | 0.3 | 4 | 1.2 | 2.50 |
|
59 |
| SEA-LION Pile - Tamil | 5.6 | 1 | 5.6 | 11.67 |
|
|
|
49 |
| Data Source | Unique Tokens (B) | Multiplier | Total Tokens (B) | Percentage (%) |
|
50 |
|---------------------------|:-----------------:|:----------:|:----------------:|:--------------:|
|
51 |
| Dolma RefinedWeb - English| 7.650 | 1 | 7.650 | 15.90 |
|
52 |
+
| Dolma C4 - English | 1.160 | 1 | 1.16 | 9.21 |
|
53 |
+
| Dolma Reddit - English | 1.339 | 1 | 1.339 | 2.42 |
|
54 |
+
| Dolma Semantic Scholar | 0.959 | 1 | 0.959 | 2.79 |
|
55 |
+
| Dolma arXiv | 0.469 | 1 | 0.469 | 1.99 |
|
56 |
+
| Dolma StarCoder | 4.422 | 1 | 4.422 | 0.98 |
|
57 |
| SEA-LION Pile - Indonesian| 3.4 | 2 | 6.8 | 14.17 |
|
58 |
| Wiki* - Indonesian | 0.3 | 4 | 1.2 | 2.50 |
|
59 |
| SEA-LION Pile - Tamil | 5.6 | 1 | 5.6 | 11.67 |
|