pdelobelle's picture
Update README.md
ac79abe
|
raw
history blame
9.87 kB
metadata
language: nl
thumbnail: https://github.com/iPieter/RobBERT/raw/master/res/robbert_2023_logo.png
tags:
  - Dutch
  - Flemish
  - RoBERTa
  - RobBERT
  - BERT
license: mit
datasets:
  - oscar
  - dbrd
  - lassy-ud
  - europarl-mono
  - conll2002
widget:
  - text: >-
      Hallo, mijn naam is RobBERT-2023. Het <mask> taalmodel van UGent en KU
      Leuven.

RobBERT-2023: A Dutch RoBERTa-based Language Model

RobBERT-2023: Keeping Dutch Language Models Up-To-Date

RobBERT-2023 is the 2023 release of the Dutch RobBERT model. It is a new version of original pdelobelle/robbert-v2-dutch-base model on the 2023 version of the OSCAR version. We release a base model, but this time we also release an additional large model with 355M parameters (x3 over robbert-2022-base). We are particularly proud of the performance of both models, surpassing both the robbert-v2-base and robbert-2022-base models with +2.9 and +0.9 points on the DUMB benchmark from GroNLP. In addition, we also surpass BERTje with +18.6 points with robbert-2023-dutch-large.

The original RobBERT model was released in January 2020. Dutch has evolved a lot since then, for example the COVID-19 pandemic introduced a wide range of new words that were suddenly used daily. Also, many other world facts that the original model considered true have also changed. To account for this and other changes in usage, we release a new Dutch BERT model trained on data from 2022: RobBERT 2023. More in-depth information about RobBERT-2023 can be found in our blog post, the original RobBERT paper and the RobBERT Github repository.

How to use

RobBERT-2023 and RobBERT both use the RoBERTa architecture and pre-training but with a Dutch tokenizer and training data. RoBERTa is the robustly optimized English BERT model, making it even more powerful than the original BERT model. Given this same architecture, RobBERT can easily be finetuned and inferenced using code to finetune RoBERTa models and most code used for BERT models, e.g. as provided by HuggingFace Transformers library.

By default, RobBERT-2023 has the masked language model head used in training. This can be used as a zero-shot way to fill masks in sentences. It can be tested out for free on RobBERT's Hosted infererence API of Huggingface. You can also create a new prediction head for your own task by using any of HuggingFace's RoBERTa-runners, their fine-tuning notebooks by changing the model name to pdelobelle/robbert-2023-dutch-large.

from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("DTAI-KULeuven/robbert-2023-dutch-base")
model = AutoModelForSequenceClassification.from_pretrained("DTAI-KULeuven/robbert-2023-dutch-base")

You can then use most of HuggingFace's BERT-based notebooks for finetuning RobBERT-2022 on your type of Dutch language dataset.

Comparison of Available Dutch BERT models

There is a wide variety of Dutch BERT-based models available for fine-tuning on your tasks. Here's a quick summary to find the one that suits your need:

  • DTAI-KULeuven/robbert-2023-dutch-large: The RobBERT-2023 is the first Dutch large (355M parameters) model. It is trained on OSCAR2023 with a new tokenizer, using our Tik-to-Tok method.
  • (this model) DTAI-KULeuven/robbert-2023-dutch-base: The RobBERT-2023 is a new RobBERT model on the OSCAR2023 dataset with a completely new tokenizer. It is helpful for tasks that rely on words and/or information about more recent events.
  • DTAI-KULeuven/robbert-2022-dutch-base: The RobBERT-2022 is a further pre-trained RobBERT model on the OSCAR2022 dataset. It is helpful for tasks that rely on words and/or information about more recent events.
  • pdelobelle/robbert-v2-dutch-base: The RobBERT model has for years been the best performing BERT-like model for most language tasks. It is trained on a large Dutch webcrawled dataset (OSCAR) and uses the superior RoBERTa architecture, which robustly optimized the original BERT model.
  • DTAI-KULeuven/robbertje-1-gb-merged: The RobBERTje model is a distilled version of RobBERT and about half the size and four times faster to perform inference on. This can help deploy more scalable language models for your language task

There's also the GroNLP/bert-base-dutch-cased "BERTje" model. This model uses the outdated basic BERT model, and is trained on a smaller corpus of clean Dutch texts. Thanks to RobBERT's more recent architecture as well as its larger and more real-world-like training corpus, most researchers and practitioners seem to achieve higher performance on their language tasks with the RobBERT model.

How to Replicate Our Paper Experiments

Replicating our paper experiments is described in detail on the RobBERT repository README. The pretraining depends on the model, for RobBERT-2023 this is based on our Tik-to-Tok method.

Name Origin of RobBERT

Most BERT-like models have the word BERT in their name (e.g. RoBERTa, ALBERT, CamemBERT, and many, many others). As such, we queried our original RobBERT model using its masked language model to name itself \<mask\>bert using all kinds of prompts, and it consistently called itself RobBERT. We thought it was really quite fitting, given that RobBERT is a very Dutch name (and thus clearly a Dutch language model), and additionally has a high similarity to its root architecture, namely RoBERTa.

Since "rob" is a Dutch words to denote a seal, we decided to draw a seal and dress it up like Bert from Sesame Street for the RobBERT logo.

Credits and citation

The suite of RobBERT models are created by Pieter Delobelle, Thomas Winters, Bettina Berendt and François Remy. If you would like to cite our paper or model, you can use the following BibTeX:

@misc{delobelle2023robbert2023conversion,
author = {Delobelle, P and Remy, F},
month = {Sep},
organization = {Antwerp, Belgium},
title = {RobBERT-2023: Keeping Dutch Language Models Up-To-Date at a Lower Cost Thanks to Model Conversion},
year = {2023},
startyear = {2023},
startmonth = {Sep},
startday = {22},
finishyear = {2023},
finishmonth = {Sep},
finishday = {22},
venue = {The 33rd Meeting of Computational Linguistics in The Netherlands (CLIN 33)},
day = {22},
publicationstatus = {published},
url= {https://clin33.uantwerpen.be/abstract/robbert-2023-keeping-dutch-language-models-up-to-date-at-a-lower-cost-thanks-to-model-conversion/}
}

@inproceedings{delobelle2022robbert2022,
  doi = {10.48550/ARXIV.2211.08192},
  url = {https://arxiv.org/abs/2211.08192},
  author = {Delobelle, Pieter and Winters, Thomas and Berendt, Bettina},
  keywords = {Computation and Language (cs.CL), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {RobBERT-2022: Updating a Dutch Language Model to Account for Evolving Language Use},
  venue = {arXiv},
  year = {2022},
}

@inproceedings{delobelle2020robbert,
    title = "{R}ob{BERT}: a {D}utch {R}o{BERT}a-based {L}anguage {M}odel",
    author = "Delobelle, Pieter  and
      Winters, Thomas  and
      Berendt, Bettina",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.findings-emnlp.292",
    doi = "10.18653/v1/2020.findings-emnlp.292",
    pages = "3255--3265"
}