KennethTM's picture
Update README.md
fa1a3d4
|
raw
history blame
3.87 kB
metadata
license: mit
datasets:
  - oscar
  - DDSC/dagw_reddit_filtered_v1.0.0
  - graelo/wikipedia
language:
  - da
widget:
  - text: Der var engang en [MASK]

What is this?

A pre-trained BERT model (base version, ~110 M parameters) for Danish NLP. The model was not pre-trained from scratch but adapted from the English version with a tokenizer trained on Danish text.

How to use

Test the model using the pipeline from the 🤗 Transformers library:

from transformers import pipeline

pipe = pipeline("fill-mask", model="KennethTM/bert-base-uncased-danish")

pipe("Der var engang en [MASK]")

Or load it using the Auto* classes:

# Load model directly
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("KennethTM/bert-base-uncased-danish")
model = AutoModelForMaskedLM.from_pretrained("KennethTM/bert-base-uncased-danish")

Model training

The model is trained using multiple Danish datasets and a context length of 512 tokens.

The model weights are initialized from the English bert-base-uncased model with new word token embeddings created for Danish using WECHSEL.

Initially, only the word token embeddings are trained using 1.000.000 samples. Finally, the whole model is trained for 8 epochs.

Evaluation

The performance of the pretrained model was evaluated using ScandEval.

task dataset summary
sentiment-classification swerec mcc = 63.02, mcc_se = 2.16, macro_f1 = 62.2, macro_f1_se = 3.61
sentiment-classification angry-tweets mcc = 47.21, mcc_se = 0.53, macro_f1 = 64.21, macro_f1_se = 0.53
sentiment-classification norec mcc = 42.23, mcc_se = 8.69, macro_f1 = 57.24, macro_f1_se = 7.67
named-entity-recognition suc3 micro_f1 = 50.03, micro_f1_se = 4.16, micro_f1_no_misc = 53.55, micro_f1_no_misc_se = 4.57
named-entity-recognition dane micro_f1 = 76.44, micro_f1_se = 1.36, micro_f1_no_misc = 80.61, micro_f1_no_misc_se = 1.11
named-entity-recognition norne-nb micro_f1 = 68.38, micro_f1_se = 1.72, micro_f1_no_misc = 73.08, micro_f1_no_misc_se = 1.66
named-entity-recognition norne-nn micro_f1 = 60.45, micro_f1_se = 1.71, micro_f1_no_misc = 64.39, micro_f1_no_misc_se = 1.8
linguistic-acceptability scala-sv mcc = 5.01, mcc_se = 5.41, macro_f1 = 49.46, macro_f1_se = 3.67
linguistic-acceptability scala-da mcc = 54.74, mcc_se = 12.22, macro_f1 = 76.25, macro_f1_se = 6.09
linguistic-acceptability scala-nb mcc = 19.18, mcc_se = 14.01, macro_f1 = 55.3, macro_f1_se = 8.85
linguistic-acceptability scala-nn mcc = 5.72, mcc_se = 5.91, macro_f1 = 49.56, macro_f1_se = 3.73
question-answering scandiqa-da em = 26.36, em_se = 1.17, f1 = 32.41, f1_se = 1.1
question-answering scandiqa-no em = 26.14, em_se = 1.59, f1 = 32.02, f1_se = 1.59
question-answering scandiqa-sv em = 26.38, em_se = 1.1, f1 = 32.33, f1_se = 1.05
speed speed speed = 4.55, speed_se = 0.0