Copy from wietsedv/
Browse files- README.md +76 -0
- config.json +21 -0
- pytorch_model.bin +3 -0
- special_tokens_map.json +1 -0
- tf_model.h5 +3 -0
- tokenizer_config.json +11 -0
- vocab.txt +0 -0
README.md
ADDED
@@ -0,0 +1,76 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: nl
|
3 |
+
thumbnail: "https://raw.githubusercontent.com/wietsedv/bertje/master/bertje.png"
|
4 |
+
tags:
|
5 |
+
- BERTje
|
6 |
+
- BERT
|
7 |
+
- Dutch
|
8 |
+
---
|
9 |
+
|
10 |
+
# BERTje: A Dutch BERT model
|
11 |
+
|
12 |
+
## Model description
|
13 |
+
|
14 |
+
BERTje is a Dutch pre-trained BERT model developed at the University of Groningen.
|
15 |
+
|
16 |
+
<img src="https://raw.githubusercontent.com/wietsedv/bertje/master/bertje.png" height="250">
|
17 |
+
|
18 |
+
For details, check out our paper on [arXiv](https://arxiv.org/abs/1912.09582) and the code on [Github](https://github.com/wietsedv/bertje).
|
19 |
+
|
20 |
+
The paper and Github page mention fine-tuned models that are available [here](https://huggingface.co/wietsedv).
|
21 |
+
|
22 |
+
## How to use
|
23 |
+
|
24 |
+
```python
|
25 |
+
from transformers import AutoTokenizer, AutoModel, TFAutoModel
|
26 |
+
|
27 |
+
tokenizer = AutoTokenizer.from_pretrained("GroNLP/bert-base-dutch-cased")
|
28 |
+
model = AutoModel.from_pretrained("GroNLP/bert-base-dutch-cased") # PyTorch
|
29 |
+
model = TFAutoModel.from_pretrained("GroNLP/bert-base-dutch-cased") # Tensorflow
|
30 |
+
```
|
31 |
+
|
32 |
+
## Benchmarks
|
33 |
+
|
34 |
+
The arXiv paper lists benchmarks. Here are a couple of comparisons between BERTje, multilingual BERT, BERT-NL and RobBERT that were done after writing the paper. Unlike some other comparisons, the fine-tuning procedures for these benchmarks are identical for each pre-trained model. You may be able to achieve higher scores for individual models by optimizing fine-tuning procedures.
|
35 |
+
|
36 |
+
More experimental results will be added to this page when they are finished. Technical details about how a fine-tuned these models will be published later as well as downloadable fine-tuned checkpoints.
|
37 |
+
|
38 |
+
All of the tested models are *base* sized (12) layers with cased tokenization.
|
39 |
+
|
40 |
+
Headers in the tables below link to original data sources. Scores link to the model pages that corresponds to that specific fine-tuned model. These tables will be updated when more simple fine-tuned models are made available.
|
41 |
+
|
42 |
+
|
43 |
+
### Named Entity Recognition
|
44 |
+
|
45 |
+
|
46 |
+
| Model | [CoNLL-2002](https://www.clips.uantwerpen.be/conll2002/ner/) | [SoNaR-1](https://ivdnt.org/downloads/taalmaterialen/tstc-sonar-corpus) | spaCy UD LassySmall |
|
47 |
+
| ---------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------- |
|
48 |
+
| **BERTje** | [**90.24**](https://huggingface.co/wietsedv/bert-base-dutch-cased-finetuned-conll2002-ner) | [**84.93**](https://huggingface.co/wietsedv/bert-base-dutch-cased-finetuned-sonar-ner) | [86.10](https://huggingface.co/wietsedv/bert-base-dutch-cased-finetuned-udlassy-ner) |
|
49 |
+
| [mBERT](https://github.com/google-research/bert/blob/master/multilingual.md) | [88.61](https://huggingface.co/wietsedv/bert-base-multilingual-cased-finetuned-conll2002-ner) | [84.19](https://huggingface.co/wietsedv/bert-base-multilingual-cased-finetuned-sonar-ner) | [**86.77**](https://huggingface.co/wietsedv/bert-base-multilingual-cased-finetuned-udlassy-ner) |
|
50 |
+
| [BERT-NL](http://textdata.nl) | 85.05 | 80.45 | 81.62 |
|
51 |
+
| [RobBERT](https://github.com/iPieter/RobBERT) | 84.72 | 81.98 | 79.84 |
|
52 |
+
|
53 |
+
### Part-of-speech tagging
|
54 |
+
|
55 |
+
| Model | [UDv2.5 LassySmall](https://universaldependencies.org/treebanks/nl_lassysmall/index.html) |
|
56 |
+
| ---------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------- |
|
57 |
+
| **BERTje** | **96.48** |
|
58 |
+
| [mBERT](https://github.com/google-research/bert/blob/master/multilingual.md) | 96.20 |
|
59 |
+
| [BERT-NL](http://textdata.nl) | 96.10 |
|
60 |
+
| [RobBERT](https://github.com/iPieter/RobBERT) | 95.91 |
|
61 |
+
|
62 |
+
|
63 |
+
|
64 |
+
### BibTeX entry and citation info
|
65 |
+
|
66 |
+
```bibtex
|
67 |
+
@misc{devries2019bertje,
|
68 |
+
title = {{BERTje}: {A} {Dutch} {BERT} {Model}},
|
69 |
+
shorttitle = {{BERTje}},
|
70 |
+
author = {de Vries, Wietse and van Cranenburgh, Andreas and Bisazza, Arianna and Caselli, Tommaso and Noord, Gertjan van and Nissim, Malvina},
|
71 |
+
year = {2019},
|
72 |
+
month = dec,
|
73 |
+
howpublished = {arXiv:1912.09582},
|
74 |
+
url = {http://arxiv.org/abs/1912.09582},
|
75 |
+
}
|
76 |
+
```
|
config.json
ADDED
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"_name_or_path": "wietsedv/bert-base-dutch-cased",
|
3 |
+
"architectures": [
|
4 |
+
"BertForMaskedLM"
|
5 |
+
],
|
6 |
+
"attention_probs_dropout_prob": 0.1,
|
7 |
+
"gradient_checkpointing": false,
|
8 |
+
"hidden_act": "gelu",
|
9 |
+
"hidden_dropout_prob": 0.1,
|
10 |
+
"hidden_size": 768,
|
11 |
+
"initializer_range": 0.02,
|
12 |
+
"intermediate_size": 3072,
|
13 |
+
"layer_norm_eps": 1e-12,
|
14 |
+
"max_position_embeddings": 512,
|
15 |
+
"model_type": "bert",
|
16 |
+
"num_attention_heads": 12,
|
17 |
+
"num_hidden_layers": 12,
|
18 |
+
"pad_token_id": 3,
|
19 |
+
"type_vocab_size": 2,
|
20 |
+
"vocab_size": 30000
|
21 |
+
}
|
pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:e7bbada9bd1f19adb55f62096564080c4f58f037bfe7aa9084dfd7781d18249c
|
3 |
+
size 436409143
|
special_tokens_map.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
|
tf_model.h5
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:0a659525e7b8a92c53f9cf0d6e42ee7a15a7aabd2ba63298ab1a3b4c4105e85e
|
3 |
+
size 436587288
|
tokenizer_config.json
ADDED
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"do_lower_case": false,
|
3 |
+
"unk_token": "[UNK]",
|
4 |
+
"sep_token": "[SEP]",
|
5 |
+
"pad_token": "[PAD]",
|
6 |
+
"cls_token": "[CLS]",
|
7 |
+
"mask_token": "[MASK]",
|
8 |
+
"tokenize_chinese_chars": true,
|
9 |
+
"strip_accents": null,
|
10 |
+
"model_max_length": 512
|
11 |
+
}
|
vocab.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|