pdelobelle
commited on
Commit
·
6cca16c
1
Parent(s):
f4d9a5f
Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,140 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: "nl"
|
3 |
+
thumbnail: "https://github.com/iPieter/RobBERT/raw/master/res/robbert_logo.png"
|
4 |
+
tags:
|
5 |
+
- Dutch
|
6 |
+
- Flemish
|
7 |
+
- RoBERTa
|
8 |
+
- RobBERT
|
9 |
+
license: mit
|
10 |
+
datasets:
|
11 |
+
- oscar
|
12 |
+
- dbrd
|
13 |
+
- lassy-ud
|
14 |
+
- europarl-mono
|
15 |
+
- conll2002
|
16 |
+
widget:
|
17 |
+
- text: "Hallo, ik ben RobBERT-2022, een <mask> taalmodel van de KU Leuven."
|
18 |
+
---
|
19 |
+
|
20 |
+
<p align="center">
|
21 |
+
<img src="https://github.com/iPieter/RobBERT/raw/master/res/robbert_logo_with_name.png" alt="RobBERT-2022: Updating a Dutch Language Model to Account for Evolving Language Use" width="75%">
|
22 |
+
</p>
|
23 |
+
|
24 |
+
# RobBERT-2022: Updating a Dutch Language Model to Account for Evolving Language Use.
|
25 |
+
|
26 |
+
More in-depth information about RobBERT can be found in our [blog post](https://pieter.ai/robbert-2022/), [our paper](https://arxiv.org/abs/2001.06286) and [the original RobBERT Github repository](https://github.com/iPieter/RobBERT).
|
27 |
+
|
28 |
+
|
29 |
+
## How to use
|
30 |
+
|
31 |
+
RobBERT-2022 and RobBERT both use the [RoBERTa](https://arxiv.org/abs/1907.11692) architecture and pre-training but with a Dutch tokenizer and training data. RoBERTa is the robustly optimized English BERT model, making it even more powerful than the original BERT model. Given this same architecture, RobBERT can easily be finetuned and inferenced using [code to finetune RoBERTa](https://huggingface.co/transformers/model_doc/roberta.html) models and most code used for BERT models, e.g. as provided by [HuggingFace Transformers](https://huggingface.co/transformers/) library.
|
32 |
+
|
33 |
+
By default, RobBERT-2022 has the masked language model head used in training. This can be used as a zero-shot way to fill masks in sentences. It can be tested out for free on [RobBERT's Hosted infererence API of Huggingface](https://huggingface.co/pdelobelle/robbert-v2-dutch-base?text=De+hoofdstad+van+Belgi%C3%AB+is+%3Cmask%3E.). You can also create a new prediction head for your own task by using any of HuggingFace's [RoBERTa-runners](https://huggingface.co/transformers/v2.7.0/examples.html#language-model-training), [their fine-tuning notebooks](https://huggingface.co/transformers/v4.1.1/notebooks.html) by changing the model name to `DTAI-KULeuven/robbert-2022-dutch-base`.
|
34 |
+
|
35 |
+
|
36 |
+
```python
|
37 |
+
from transformers import AutoTokenizer, AutoForSequenceClassification
|
38 |
+
tokenizer = RobertaTokenizer.from_pretrained("DTAI-KULeuven/robbert-2022-dutch-base")
|
39 |
+
model = RobertaForSequenceClassification.from_pretrained("DTAI-KULeuven/robbert-2022-dutch-base")
|
40 |
+
```
|
41 |
+
|
42 |
+
You can then use most of [HuggingFace's BERT-based notebooks](https://huggingface.co/transformers/v4.1.1/notebooks.html) for finetuning RobBERT-2022 on your type of Dutch language dataset.
|
43 |
+
|
44 |
+
## Technical Details From The Paper
|
45 |
+
|
46 |
+
|
47 |
+
### Our Performance Evaluation Results
|
48 |
+
|
49 |
+
All experiments are described in more detail in our [paper](https://arxiv.org/abs/2001.06286), with the code in [our GitHub repository](https://github.com/iPieter/RobBERT).
|
50 |
+
|
51 |
+
### Sentiment analysis
|
52 |
+
Predicting whether a review is positive or negative using the [Dutch Book Reviews Dataset](https://github.com/benjaminvdb/110kDBRD).
|
53 |
+
|
54 |
+
| Model | Accuracy [%] |
|
55 |
+
|-------------------|--------------------------|
|
56 |
+
| ULMFiT | 93.8 |
|
57 |
+
| BERTje | 93.0 |
|
58 |
+
| RobBERT v2 | **95.1** |
|
59 |
+
|
60 |
+
### Die/Dat (coreference resolution)
|
61 |
+
|
62 |
+
We measured how well the models are able to do coreference resolution by predicting whether "die" or "dat" should be filled into a sentence.
|
63 |
+
For this, we used the [EuroParl corpus](https://www.statmt.org/europarl/).
|
64 |
+
|
65 |
+
#### Finetuning on whole dataset
|
66 |
+
|
67 |
+
| Model | Accuracy [%] | F1 [%] |
|
68 |
+
|-------------------|--------------------------|--------------|
|
69 |
+
| [Baseline](https://arxiv.org/abs/2001.02943) (LSTM) | | 75.03 |
|
70 |
+
| mBERT | 98.285 | 98.033 |
|
71 |
+
| BERTje | 98.268 | 98.014 |
|
72 |
+
| RobBERT v2 | **99.232** | **99.121** |
|
73 |
+
|
74 |
+
#### Finetuning on 10K examples
|
75 |
+
|
76 |
+
We also measured the performance using only 10K training examples.
|
77 |
+
This experiment clearly illustrates that RobBERT outperforms other models when there is little data available.
|
78 |
+
|
79 |
+
| Model | Accuracy [%] | F1 [%] |
|
80 |
+
|-------------------|--------------------------|--------------|
|
81 |
+
| mBERT | 92.157 | 90.898 |
|
82 |
+
| BERTje | 93.096 | 91.279 |
|
83 |
+
| RobBERT v2 | **97.816** | **97.514** |
|
84 |
+
|
85 |
+
#### Using zero-shot word masking task
|
86 |
+
|
87 |
+
Since BERT models are pre-trained using the word masking task, we can use this to predict whether "die" or "dat" is more likely.
|
88 |
+
This experiment shows that RobBERT has internalised more information about Dutch than other models.
|
89 |
+
|
90 |
+
| Model | Accuracy [%] |
|
91 |
+
|-------------------|--------------------------|
|
92 |
+
| ZeroR | 66.70 |
|
93 |
+
| mBERT | 90.21 |
|
94 |
+
| BERTje | 94.94 |
|
95 |
+
| RobBERT v2 | **98.75** |
|
96 |
+
|
97 |
+
### Part-of-Speech Tagging.
|
98 |
+
|
99 |
+
Using the [Lassy UD dataset](https://universaldependencies.org/treebanks/nl_lassysmall/index.html).
|
100 |
+
|
101 |
+
|
102 |
+
| Model | Accuracy [%] |
|
103 |
+
|-------------------|--------------------------|
|
104 |
+
| Frog | 91.7 |
|
105 |
+
| mBERT | **96.5** |
|
106 |
+
| BERTje | 96.3 |
|
107 |
+
| RobBERT v2 | 96.4 |
|
108 |
+
|
109 |
+
|
110 |
+
## Credits and citation
|
111 |
+
|
112 |
+
This project is created by [Pieter Delobelle](https://people.cs.kuleuven.be/~pieter.delobelle), [Thomas Winters](https://thomaswinters.be) and [Bettina Berendt](https://people.cs.kuleuven.be/~bettina.berendt/).
|
113 |
+
If you would like to cite our paper or model, you can use the following BibTeX:
|
114 |
+
|
115 |
+
```
|
116 |
+
@inproceedings{delobelle2022robbert2022,
|
117 |
+
doi = {10.48550/ARXIV.2211.08192},
|
118 |
+
url = {https://arxiv.org/abs/2211.08192},
|
119 |
+
author = {Delobelle, Pieter and Winters, Thomas and Berendt, Bettina},
|
120 |
+
keywords = {Computation and Language (cs.CL), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
|
121 |
+
title = {RobBERT-2022: Updating a Dutch Language Model to Account for Evolving Language Use},
|
122 |
+
venue = {arXiv},
|
123 |
+
year = {2022},
|
124 |
+
}
|
125 |
+
|
126 |
+
@inproceedings{delobelle2020robbert,
|
127 |
+
title = "{R}ob{BERT}: a {D}utch {R}o{BERT}a-based {L}anguage {M}odel",
|
128 |
+
author = "Delobelle, Pieter and
|
129 |
+
Winters, Thomas and
|
130 |
+
Berendt, Bettina",
|
131 |
+
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
|
132 |
+
month = nov,
|
133 |
+
year = "2020",
|
134 |
+
address = "Online",
|
135 |
+
publisher = "Association for Computational Linguistics",
|
136 |
+
url = "https://www.aclweb.org/anthology/2020.findings-emnlp.292",
|
137 |
+
doi = "10.18653/v1/2020.findings-emnlp.292",
|
138 |
+
pages = "3255--3265"
|
139 |
+
}
|
140 |
+
```
|