---
language:
- fi
pipeline_tag: text-generation
license: apache-2.0
---
Generative Pretrained Transformer with 186M parameteres for Finnish.

TurkuNLP Finnish GPT-3-models are a model family of pretrained monolingual GPT-style language models that are based on BLOOM-architecture.
Note that the models are pure language models, meaning that they are not [instruction finetuned](https://arxiv.org/abs/2203.02155) for dialogue
or answering questions.

These models are intended to be used as foundational models that can be e.g. instruction finetuned to serve as modern chat-models.

All models are trained for 300B tokens.


**Parameters**
| Model | Layers | Dim  | Heads | Params |
|--------|--------|------|-------|--------|
| Small  | 12     | 768  | 12    | 186M   |
| Medium | 24     | 1024 | 16    | 437M   |
| Large  | 24     | 1536 | 16    | 881M   |
| XL     | 24     | 2064 | 24    | 1.5B   |
| ”3B”   | 32     | 2560 | 32    | 2.8B   |
| ”8B”   | 32     | 4096 | 32    | 7.5B   |
| "13B"  | 40     | 5120 | 40    | 13.3B  |


**Datasets**

We used a combination of multiple Finnish resources.

* Finnish Internet Parsebank https://turkunlp.org/finnish_nlp.html
mC4 multilingual colossal, cleaned Common Crawl https://huggingface.co/datasets/mc4
* Common Crawl Finnish https://TODO
* Finnish Wikipedia https://fi.wikipedia.org/wiki
* Lönnrot Projekti Lönnrot http://www.lonnrot.net/
* ePub National library ”epub” collection 
* National library ”lehdet” collection 
* Suomi24 The Suomi 24 Corpus 2001-2020 http://urn.fi/urn:nbn:fi:lb-2021101527
* Reddit r/Suomi submissions and comments https://www.reddit.com/r/Suomi
* STT Finnish News Agency Archive 1992-2018 http://urn.fi/urn:nbn:fi:lb-2019041501
* Yle Finnish News Archive 2011-2018 http://urn.fi/urn:nbn:fi:lb-2017070501
* Yle Finnish News Archive 2019-2020 http://urn.fi/urn:nbn:fi:lb-2021050401
* Yle News Archive Easy-to-read Finnish 2011-2018 http://urn.fi/urn:nbn:fi:lb-2019050901
* Yle News Archive Easy-to-read Finnish 2019-2020 http://urn.fi/urn:nbn:fi:lb-2021050701
* ROOTS TODO 


**Sampling ratios**

|Dataset   |  Chars |  Ratio  | Weight | W.Ratio | 
|----------|--------|---------|--------|---------|
|Parsebank |  35.0B |  16.9\% |    1.5 |   22.7\%| 
|mC4-Fi    |  46.3B |  22.4\% |    1.0 |   20.0\%| 
|CC-Fi     |  79.6B |  38.5\% |    1.0 |   34.4\%| 
|Fiwiki    |   0.8B |   0.4\% |    3.0 |    1.0\%| 
|Lönnrot   |   0.8B |   0.4\% |    3.0 |    1.0\%| 
|Yle       |   1.6B |   0.8\% |    2.0 |    1.4\%| 
|STT       |   2.2B |   1.1\% |    2.0 |    1.9\%| 
|ePub      |  13.5B |   6.5\% |    1.0 |    5.8\%| 
|Lehdet    |   5.8B |   2.8\% |    1.0 |    2.5\%| 
|Suomi24   |  20.6B |   9.9\% |    1.0 |    8.9\%| 
|Reddit-Fi |   0.7B |   0.4\% |    1.0 |    0.3\%|
|**TOTAL**    | **207.0B** | **100.0\%** | **N/A** |  **100.0\%** |


More documentation and a paper coming soon.
# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_TurkuNLP__gpt3-finnish-small)

| Metric                | Value                     |
|-----------------------|---------------------------|
| Avg.                  | 24.25   |
| ARC (25-shot)         | 20.48          |
| HellaSwag (10-shot)   | 28.09    |
| MMLU (5-shot)         | 24.47         |
| TruthfulQA (0-shot)   | 46.47   |
| Winogrande (5-shot)   | 48.22   |
| GSM8K (5-shot)        | 0.0        |
| DROP (3-shot)         | 2.02         |