File size: 2,262 Bytes
0b4c328
 
 
3a89efd
6709000
3a89efd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21eea95
3a89efd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21eea95
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
---
license: bigscience-bloom-rail-1.0
---
Multilingual Generative Pretrained Transformer with 176B parameters with capacity for Finnish.
This model is built upon pretrained [BLOOM](https://huggingface.co/bigscience/bloom) which is then further pretrained with a combined ROOTS + Finnish (without weighting) dataset for 40B tokens.


**Datasets**

We used a combination of multiple Finnish resources.

* Finnish Internet Parsebank https://turkunlp.org/finnish_nlp.html
mC4 multilingual colossal, cleaned Common Crawl https://huggingface.co/datasets/mc4
* Common Crawl Finnish https://TODO
* Finnish Wikipedia https://fi.wikipedia.org/wiki
* Lönnrot Projekti Lönnrot http://www.lonnrot.net/
* ePub National library ”epub” collection 
* National library ”lehdet” collection 
* Suomi24 The Suomi 24 Corpus 2001-2020 http://urn.fi/urn:nbn:fi:lb-2021101527
* Reddit r/Suomi submissions and comments https://www.reddit.com/r/Suomi
* STT Finnish News Agency Archive 1992-2018 http://urn.fi/urn:nbn:fi:lb-2019041501
* Yle Finnish News Archive 2011-2018 http://urn.fi/urn:nbn:fi:lb-2017070501
* Yle Finnish News Archive 2019-2020 http://urn.fi/urn:nbn:fi:lb-2021050401
* Yle News Archive Easy-to-read Finnish 2011-2018 http://urn.fi/urn:nbn:fi:lb-2019050901
* Yle News Archive Easy-to-read Finnish 2019-2020 http://urn.fi/urn:nbn:fi:lb-2021050701
* [ROOTS](https://arxiv.org/abs/2303.03915) - original BLOOM training corpus
 


**Sampling ratios for Finnish**

|Dataset   |  Chars |  Ratio  | Weight | W.Ratio | 
|----------|--------|---------|--------|---------|
|Parsebank |  35.0B |  16.9\% |    1.5 |   22.7\%| 
|mC4-Fi    |  46.3B |  22.4\% |    1.0 |   20.0\%| 
|CC-Fi     |  79.6B |  38.5\% |    1.0 |   34.4\%| 
|Fiwiki    |   0.8B |   0.4\% |    3.0 |    1.0\%| 
|Lönnrot   |   0.8B |   0.4\% |    3.0 |    1.0\%| 
|Yle       |   1.6B |   0.8\% |    2.0 |    1.4\%| 
|STT       |   2.2B |   1.1\% |    2.0 |    1.9\%| 
|ePub      |  13.5B |   6.5\% |    1.0 |    5.8\%| 
|Lehdet    |   5.8B |   2.8\% |    1.0 |    2.5\%| 
|Suomi24   |  20.6B |   9.9\% |    1.0 |    8.9\%| 
|Reddit-Fi |   0.7B |   0.4\% |    1.0 |    0.3\%|
|**TOTAL**    | **207.0B** | **100.0\%** | **N/A** |  **100.0\%** |

And for whole continued pretraining, ROOTS is mixed in.