Update README.md
Browse files
README.md
CHANGED
@@ -27,7 +27,7 @@ mC4 multilingual colossal, cleaned Common Crawl https://huggingface.co/datasets/
|
|
27 |
|
28 |
|
29 |
|
30 |
-
**Sampling ratios**
|
31 |
|
32 |
|Dataset | Chars | Ratio | Weight | W.Ratio |
|
33 |
|----------|--------|---------|--------|---------|
|
@@ -43,3 +43,5 @@ mC4 multilingual colossal, cleaned Common Crawl https://huggingface.co/datasets/
|
|
43 |
|Suomi24 | 20.6B | 9.9\% | 1.0 | 8.9\%|
|
44 |
|Reddit-Fi | 0.7B | 0.4\% | 1.0 | 0.3\%|
|
45 |
|**TOTAL** | **207.0B** | **100.0\%** | **N/A** | **100.0\%** |
|
|
|
|
|
|
27 |
|
28 |
|
29 |
|
30 |
+
**Sampling ratios for Finnish**
|
31 |
|
32 |
|Dataset | Chars | Ratio | Weight | W.Ratio |
|
33 |
|----------|--------|---------|--------|---------|
|
|
|
43 |
|Suomi24 | 20.6B | 9.9\% | 1.0 | 8.9\%|
|
44 |
|Reddit-Fi | 0.7B | 0.4\% | 1.0 | 0.3\%|
|
45 |
|**TOTAL** | **207.0B** | **100.0\%** | **N/A** | **100.0\%** |
|
46 |
+
|
47 |
+
And for whole continued pretraining, ROOTS is mixed in.
|