aapot commited on
Commit
04eaefb
·
1 Parent(s): 7c82228

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +167 -0
README.md ADDED
@@ -0,0 +1,167 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - fi
4
+ license: apache-2.0
5
+ tags:
6
+ - finnish
7
+ - roberta
8
+ datasets:
9
+ - mc4
10
+ - wikipedia
11
+ widget:
12
+ - text: "Moikka olen <mask> kielimalli."
13
+
14
+ ---
15
+
16
+ # RoBERTa large model trained with WECHSEL method for Finnish
17
+
18
+ Pretrained RoBERTa model on Finnish language using a masked language modeling (MLM) objective with WECHSEL method. RoBERTa was introduced in
19
+ [this paper](https://arxiv.org/abs/1907.11692) and first released in
20
+ [this repository](https://github.com/pytorch/fairseq/tree/master/examples/roberta). WECHSEL method (Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models) was introduced in [this paper](https://arxiv.org/abs/2112.06598) and first released in [this repository](https://github.com/CPJKU/wechsel). This model is case-sensitive: it
21
+ makes a difference between finnish and Finnish.
22
+
23
+ ## Model description
24
+
25
+ Finnish RoBERTa is a transformers model pretrained on a large corpus of Finnish data in a self-supervised fashion. This means
26
+ it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of
27
+ publicly available data) with an automatic process to generate inputs and labels from those texts.
28
+
29
+ More precisely, it was pretrained with the Masked language modeling (MLM) objective. Taking a sentence, the model
30
+ randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict
31
+ the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one
32
+ after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to
33
+ learn a bidirectional representation of the sentence.
34
+
35
+ This way, the model learns an inner representation of the Finnish language that can then be used to extract features
36
+ useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard
37
+ classifier using the features produced by the RoBERTa model as inputs.
38
+
39
+ ## WECHSEL method
40
+
41
+ Using the WECHSEL method, we first took the pretrained English [roberta-large](https://huggingface.co/roberta-large) model, changed its tokenizer with our Finnish tokenizer and initialized model's token embeddings such that they are close to semantically similar English tokens by utilizing multilingual static word embeddings (by fastText) covering English and Finnish. We were able to confirm the WECHSEL paper's findings that using this method you can save pretraining time and thus computing resources. To get idea of the WECHSEL method's training time savings you can check the table below illustrating the MLM evaluation accuracies during the pretraining compared to the [Finnish-NLP/roberta-large-finnish-v2](https://huggingface.co/Finnish-NLP/roberta-large-finnish-v2) which was trained from scratch:
42
+
43
+ | | 10k train steps | 100k train steps | 200k train steps | 270k train steps |
44
+ |------------------------------------------|------------------|------------------|------------------|------------------|
45
+ |Finnish-NLP/roberta-large-wechsel-finnish |37.61 eval acc |58.14 eval acc |61.60 eval acc |62.77 eval acc |
46
+ |Finnish-NLP/roberta-large-finnish-v2 |13.83 eval acc |55.87 eval acc |58.58 eval acc |59.47 eval acc |
47
+
48
+ Downstream finetuning text classification tests can be found from the end but there this model trained with WECHSEL method didn't significantly improve the downstream performances. However, based on tens of qualitative fill-mask task example tests we noticed that for fill-mask task this WECHSEL model significantly outperforms our other models trained from scratch.
49
+
50
+ ## Intended uses & limitations
51
+
52
+ You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task.
53
+
54
+ Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
55
+ to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
56
+ generation you should look at model like GPT2.
57
+
58
+ ### How to use
59
+
60
+ You can use this model directly with a pipeline for masked language modeling:
61
+
62
+ ```python
63
+ >>> from transformers import pipeline
64
+ >>> unmasker = pipeline('fill-mask', model='Finnish-NLP/roberta-large-wechsel-finnish')
65
+ >>> unmasker("Moikka olen <mask> kielimalli.")
66
+
67
+ [{'sequence': 'Moikka olen hyvä kielimalli.',
68
+ 'score': 0.07757357507944107,
69
+ 'token': 763,
70
+ 'token_str': ' hyvä'},
71
+ {'sequence': 'Moikka olen suomen kielimalli.',
72
+ 'score': 0.05297883599996567,
73
+ 'token': 3641,
74
+ 'token_str': ' suomen'},
75
+ {'sequence': 'Moikka olen kuin kielimalli.',
76
+ 'score': 0.03747279942035675,
77
+ 'token': 523,
78
+ 'token_str': ' kuin'},
79
+ {'sequence': 'Moikka olen suomalainen kielimalli.',
80
+ 'score': 0.031031042337417603,
81
+ 'token': 4966,
82
+ 'token_str': ' suomalainen'},
83
+ {'sequence': 'Moikka olen myös kielimalli.',
84
+ 'score': 0.026489052921533585,
85
+ 'token': 505,
86
+ 'token_str': ' myös'}]
87
+ ```
88
+
89
+ Here is how to use this model to get the features of a given text in PyTorch:
90
+
91
+ ```python
92
+ from transformers import RobertaTokenizer, RobertaModel
93
+ tokenizer = RobertaTokenizer.from_pretrained('Finnish-NLP/roberta-large-wechsel-finnish')
94
+ model = RobertaModel.from_pretrained('Finnish-NLP/roberta-large-wechsel-finnish')
95
+ text = "Replace me by any text you'd like."
96
+ encoded_input = tokenizer(text, return_tensors='pt')
97
+ output = model(**encoded_input)
98
+ ```
99
+
100
+ and in TensorFlow:
101
+
102
+ ```python
103
+ from transformers import RobertaTokenizer, TFRobertaModel
104
+ tokenizer = RobertaTokenizer.from_pretrained('Finnish-NLP/roberta-large-wechsel-finnish')
105
+ model = TFRobertaModel.from_pretrained('Finnish-NLP/roberta-large-wechsel-finnish', from_pt=True)
106
+ text = "Replace me by any text you'd like."
107
+ encoded_input = tokenizer(text, return_tensors='tf')
108
+ output = model(encoded_input)
109
+ ```
110
+
111
+ ### Limitations and bias
112
+
113
+ The training data used for this model contains a lot of unfiltered content from the internet, which is far from
114
+ neutral. Therefore, the model can have biased predictions.
115
+
116
+ ## Training data
117
+
118
+ This Finnish RoBERTa model was pretrained on the combination of five datasets:
119
+ - [mc4_fi_cleaned](https://huggingface.co/datasets/Finnish-NLP/mc4_fi_cleaned), the dataset mC4 is a multilingual colossal, cleaned version of Common Crawl's web crawl corpus. We used the Finnish subset of the mC4 dataset and further cleaned it with our own text data cleaning codes (check the dataset repo).
120
+ - [wikipedia](https://huggingface.co/datasets/wikipedia) We used the Finnish subset of the wikipedia (August 2021) dataset
121
+ - [Yle Finnish News Archive](http://urn.fi/urn:nbn:fi:lb-2017070501)
122
+ - [Finnish News Agency Archive (STT)](http://urn.fi/urn:nbn:fi:lb-2018121001)
123
+ - [The Suomi24 Sentences Corpus](http://urn.fi/urn:nbn:fi:lb-2020021803)
124
+
125
+ Raw datasets were cleaned to filter out bad quality and non-Finnish examples. Together these cleaned datasets were around 84GB of text.
126
+
127
+ ## Training procedure
128
+
129
+ ### Preprocessing
130
+
131
+ The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50265. The inputs of
132
+ the model take pieces of 512 contiguous token that may span over documents. The beginning of a new document is marked
133
+ with `<s>` and the end of one by `</s>`
134
+
135
+ The details of the masking procedure for each sentence are the following:
136
+ - 15% of the tokens are masked.
137
+ - In 80% of the cases, the masked tokens are replaced by `<mask>`.
138
+
139
+ - In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
140
+ - In the 10% remaining cases, the masked tokens are left as is.
141
+
142
+ Contrary to BERT, the masking is done dynamically during pretraining (e.g., it changes at each epoch and is not fixed).
143
+
144
+ ### Pretraining
145
+
146
+ The model was trained on TPUv3-8 VM, sponsored by the [Google TPU Research Cloud](https://sites.research.google/trc/about/), for 270k steps (a bit over 1 epoch) with a sequence length of 128 and continuing for 180k steps with a sequence length of 512. The optimizer used was Adafactor (to save memory). Learning rate was 2e-4, \\(\beta_{1} = 0.9\\), \\(\beta_{2} = 0.98\\) and \\(\epsilon = 1e-6\\), learning rate warmup for 2500 steps and linear decay of the learning rate after.
147
+
148
+ ## Evaluation results
149
+
150
+ Evaluation was done by fine-tuning the model on downstream text classification task with two different labeled datasets: [Yle News](https://github.com/spyysalo/yle-corpus) and [Eduskunta](https://github.com/aajanki/eduskunta-vkk). Yle News classification fine-tuning was done with two different sequence lengths: 128 and 512 but Eduskunta only with 128 sequence length.
151
+ When fine-tuned on those datasets, this model (the first row of the table) achieves the following accuracy results compared to the [FinBERT (Finnish BERT)](https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1) model and to our previous [Finnish-NLP/roberta-large-finnish-v2](https://huggingface.co/Finnish-NLP/roberta-large-finnish-v2) and [Finnish-NLP/roberta-large-finnish](https://huggingface.co/Finnish-NLP/roberta-large-finnish) models:
152
+
153
+ | | Average | Yle News 128 length | Yle News 512 length | Eduskunta 128 length |
154
+ |------------------------------------------|----------|---------------------|---------------------|----------------------|
155
+ |Finnish-NLP/roberta-large-wechsel-finnish |88.19 |**94.91** |95.18 |74.47 |
156
+ |Finnish-NLP/roberta-large-finnish-v2 |88.17 |94.46 |95.22 |74.83 |
157
+ |Finnish-NLP/roberta-large-finnish |88.02 |94.53 |95.23 |74.30 |
158
+ |TurkuNLP/bert-base-finnish-cased-v1 |**88.82** |94.90 |**95.49** |**76.07** |
159
+
160
+ To conclude, this model didn't significantly improve compared to our previous models which were trained from scratch instead of using the WECHSEL method as in this model. This model is also slightly (~ 1%) losing to the [FinBERT (Finnish BERT)](https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1) model.
161
+
162
+ ## Team Members
163
+
164
+ - Aapo Tanskanen, [Hugging Face profile](https://huggingface.co/aapot), [LinkedIn profile](https://www.linkedin.com/in/aapotanskanen/)
165
+ - Rasmus Toivanen [Hugging Face profile](https://huggingface.co/RASMUS), [LinkedIn profile](https://www.linkedin.com/in/rasmustoivanen/)
166
+
167
+ Feel free to contact us for more details 🤗