hjb commited on
Commit
f815645
1 Parent(s): 5288059

Initial commit

Browse files
README.md ADDED
@@ -0,0 +1,90 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: "da"
3
+ tags:
4
+ - ælæctra
5
+ - pytorch
6
+ - danish
7
+ - ELECTRA-Small
8
+ - replaced token detection
9
+ license: "mit"
10
+ datasets:
11
+ - DAGW
12
+ metrics:
13
+ - f1
14
+ ---
15
+
16
+ # Ælæctra - Finetuned for Named Entity Recognition on the [DaNE dataset](https://danlp.alexandra.dk/304bd159d5de/datasets/ddt.zip) (Hvingelby et al., 2020).
17
+ **Ælæctra** is a Danish Transformer-based language model created to enhance the variety of Danish NLP resources with a more efficient model compared to previous state-of-the-art (SOTA) models.
18
+
19
+ Ælæctra was pretrained with the ELECTRA-Small (Clark et al., 2020) pretraining approach by using the Danish Gigaword Corpus (Strømberg-Derczynski et al., 2020) and evaluated on Named Entity Recognition (NER) tasks. Since NER only presents a limited picture of Ælæctra's capabilities I am very interested in further evaluations. Therefore, if you employ it for any task, feel free to hit me up your findings!
20
+
21
+ Ælæctra was, as mentioned, created to enhance the Danish NLP capabilties and please do note how this GitHub still does not support the Danish characters "*Æ, Ø and Å*" as the title of this repository becomes "*-l-ctra*". How ironic.🙂
22
+
23
+ Here is an example on how to load the finetuned Ælæctra-uncased model for Named Entity Recognition in [PyTorch](https://pytorch.org/) using the [🤗Transformers](https://github.com/huggingface/transformers) library:
24
+
25
+ ```python
26
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
27
+
28
+ tokenizer = AutoTokenizer.from_pretrained("Maltehb/-l-ctra_uncased_ner_dane")
29
+ model = AutoModelForTokenClassification.from_pretrained("Maltehb/-l-ctra_uncased_ner_dane")
30
+ ```
31
+
32
+ ### Evaluation of current Danish Language Models
33
+
34
+ Ælæctra, Danish BERT (DaBERT) and multilingual BERT (mBERT) were evaluated:
35
+
36
+ | Model | Layers | Hidden Size | Params | AVG NER micro-f1 (DaNE-testset) | Average Inference Time (Sec/Epoch) | Download |
37
+ | --- | --- | --- | --- | --- | --- | --- |
38
+ | Ælæctra Uncased | 12 | 256 | 13.7M | 78.03 (SD = 1.28) | 10.91 | [Link for model](https://www.dropbox.com/s/cag7prs1nvdchqs/%C3%86l%C3%A6ctra.zip?dl=0) |
39
+ | Ælæctra Cased | 12 | 256 | 14.7M | 80.08 (SD = 0.26) | 10.92 | [Link for model](https://www.dropbox.com/s/cag7prs1nvdchqs/%C3%86l%C3%A6ctra.zip?dl=0) |
40
+ | DaBERT | 12 | 768 | 110M | 84.89 (SD = 0.64) | 43.03 | [Link for model](https://www.dropbox.com/s/19cjaoqvv2jicq9/danish_bert_uncased_v2.zip?dl=1) |
41
+ | mBERT Uncased | 12 | 768 | 167M | 80.44 (SD = 0.82) | 72.10 | [Link for model](https://storage.googleapis.com/bert_models/2018_11_03/multilingual_L-12_H-768_A-12.zip) |
42
+ | mBERT Cased | 12 | 768 | 177M | 83.79 (SD = 0.91) | 70.56 | [Link for model](https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip) |
43
+
44
+
45
+ On [DaNE](https://danlp.alexandra.dk/304bd159d5de/datasets/ddt.zip) (Hvingelby et al., 2020), Ælæctra scores slightly worse than both cased and uncased Multilingual BERT (Devlin et al., 2019) and Danish BERT (Danish BERT, 2019/2020), however, Ælæctra is less than one third the size, and uses significantly fewer computational resources to pretrain and instantiate. For a full description of the evaluation and specification of the model read the thesis: 'Ælæctra - A Step Towards More Efficient Danish Natural Language Processing'.
46
+
47
+ ### Pretraining
48
+ To pretrain Ælæctra it is recommended to build a Docker Container from the [Dockerfile](https://github.com/MalteHB/Ælæctra/tree/master/notebooks/fine-tuning/). Next, simply follow the [pretraining notebooks](https://github.com/MalteHB/Ælæctra/tree/master/infrastructure/Dockerfile/)
49
+
50
+ The pretraining was done by utilizing a single NVIDIA Tesla V100 GPU with 16 GiB, endowed by the Danish data company [KMD](https://www.kmd.dk/). The pretraining took approximately 4 days and 9.5 hours for both the cased and uncased model
51
+
52
+ ### Fine-tuning
53
+ To fine-tune any Ælæctra model follow the [fine-tuning notebooks](https://github.com/MalteHB/Ælæctra/tree/master/notebooks/fine-tuning/)
54
+
55
+ ### References
56
+ Clark, K., Luong, M.-T., Le, Q. V., & Manning, C. D. (2020). ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. ArXiv:2003.10555 [Cs]. http://arxiv.org/abs/2003.10555
57
+
58
+ Danish BERT. (2020). BotXO. https://github.com/botxo/nordic_bert (Original work published 2019)
59
+
60
+ Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv:1810.04805 [Cs]. http://arxiv.org/abs/1810.04805
61
+
62
+ Hvingelby, R., Pauli, A. B., Barrett, M., Rosted, C., Lidegaard, L. M., & Søgaard, A. (2020). DaNE: A Named Entity Resource for Danish. Proceedings of the 12th Language Resources and Evaluation Conference, 4597–4604. https://www.aclweb.org/anthology/2020.lrec-1.565
63
+
64
+ Strømberg-Derczynski, L., Baglini, R., Christiansen, M. H., Ciosici, M. R., Dalsgaard, J. A., Fusaroli, R., Henrichsen, P. J., Hvingelby, R., Kirkedal, A., Kjeldsen, A. S., Ladefoged, C., Nielsen, F. Å., Petersen, M. L., Rystrøm, J. H., & Varab, D. (2020). The Danish Gigaword Project. ArXiv:2005.03521 [Cs]. http://arxiv.org/abs/2005.03521
65
+
66
+
67
+ #### Acknowledgements
68
+ As the majority of this repository is build upon [the works](https://github.com/google-research/electra) by the team at Google who created ELECTRA, a HUGE thanks to them is in order.
69
+
70
+ A Giga thanks also goes out to the incredible people who collected The Danish Gigaword Corpus (Strømberg-Derczynski et al., 2020).
71
+
72
+ Furthermore, I would like to thank my supervisor [Riccardo Fusaroli](https://github.com/fusaroli) for the support with the thesis, and a special thanks goes out to [Kenneth Enevoldsen](https://github.com/KennethEnevoldsen) for his continuous feedback.
73
+
74
+ Lastly, i would like to thank KMD, my colleagues from KMD, and my peers and co-students from Cognitive Science for encouriging me to keep on working hard and holding my head up high!
75
+
76
+ #### Contact
77
+
78
+ For help or further information feel free to connect with the author Malte Højmark-Bertelsen on [hjb@kmd.dk](mailto:hjb@kmd.dk?subject=[GitHub]%20ÆlæctraUncasedNER) or any of the following platforms:
79
+
80
+ [<img align="left" alt="MalteHB | Twitter" width="22px" src="https://cdn.jsdelivr.net/npm/simple-icons@v3/icons/twitter.svg" />][twitter]
81
+ [<img align="left" alt="MalteHB | LinkedIn" width="22px" src="https://cdn.jsdelivr.net/npm/simple-icons@v3/icons/linkedin.svg" />][linkedin]
82
+ [<img align="left" alt="MalteHB | Instagram" width="22px" src="https://cdn.jsdelivr.net/npm/simple-icons@v3/icons/instagram.svg" />][instagram]
83
+
84
+ <br />
85
+
86
+ </details>
87
+
88
+ [twitter]: https://twitter.com/malteH_B
89
+ [instagram]: https://www.instagram.com/maltemusen/
90
+ [linkedin]: https://www.linkedin.com/in/malte-h%C3%B8jmark-bertelsen-9a618017b/
config.json ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "ElectraForTokenClassification"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "embedding_size": 128,
7
+ "generator_size": "0.25",
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 256,
11
+ "id2label": {
12
+ "0": "LABEL_0",
13
+ "1": "LABEL_1",
14
+ "2": "LABEL_2",
15
+ "3": "LABEL_3",
16
+ "4": "LABEL_4",
17
+ "5": "LABEL_5",
18
+ "6": "LABEL_6",
19
+ "7": "LABEL_7",
20
+ "8": "LABEL_8",
21
+ "9": "LABEL_9"
22
+ },
23
+ "initializer_range": 0.02,
24
+ "intermediate_size": 1024,
25
+ "label2id": {
26
+ "LABEL_0": 0,
27
+ "LABEL_1": 1,
28
+ "LABEL_2": 2,
29
+ "LABEL_3": 3,
30
+ "LABEL_4": 4,
31
+ "LABEL_5": 5,
32
+ "LABEL_6": 6,
33
+ "LABEL_7": 7,
34
+ "LABEL_8": 8,
35
+ "LABEL_9": 9
36
+ },
37
+ "layer_norm_eps": 1e-12,
38
+ "max_position_embeddings": 512,
39
+ "model_type": "electra",
40
+ "num_attention_heads": 4,
41
+ "num_hidden_layers": 12,
42
+ "pad_token_id": 0,
43
+ "summary_activation": "gelu",
44
+ "summary_last_dropout": 0.1,
45
+ "summary_type": "first",
46
+ "summary_use_proj": true,
47
+ "type_vocab_size": 2,
48
+ "vocab_size": 32000
49
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c2c38dd1afc7f54b159ab5aceca35912dd11d739b53d243263af9799bb97c16a
3
+ size 54787544
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"do_lower_case": true, "strip_accents": false}
vocab.txt ADDED
The diff for this file is too large to render. See raw diff