hjb
Initial commit
52bc23d
|
raw
history blame
13.7 kB

��--- language: "da" tags: - �l�ctra - pytorch - danish - ELECTRA-Small - replaced token detection license: "mit" datasets: - DAGW metrics: - f1 --- # �l�ctra - Finetuned for Named Entity Recognition on the [DaNE dataset](https://danlp.alexandra.dk/304bd159d5de/datasets/ddt.zip) (Hvingelby et al., 2020). �l�ctra is a Danish Transformer-based language model created to enhance the variety of Danish NLP resources with a more efficient model compared to previous state-of-the-art (SOTA) models. Initially a cased and an uncased model are released. It was created as part of a Cognitive Science bachelor's thesis. �l�ctra was pretrained with the ELECTRA-Small (Clark et al., 2020) pretraining approach by using the Danish Gigaword Corpus (Str�mberg-Derczynski et al., 2020) and evaluated on Named Entity Recognition (NER) tasks. Since NER only presents a limited picture of �l�ctra's capabilities I am very interested in further evaluations. Therefore, if you employ it for any task, feel free to hit me up your findings! �l�ctra was, as mentioned, created to enhance the Danish NLP capabilties and please do note how this GitHub still does not support the Danish characters "�, � and �" as the title of this repository becomes "-l-ctra". How ironic.=�B� Here is an example on how to load the finetuned �l�ctra-cased model for Named Entity Recognition in [PyTorch](https://pytorch.org/) using the [>��Transformers](https://github.com/huggingface/transformers) library: python from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("Maltehb/-l-ctra_cased_ner_dane") model = AutoModelForTokenClassification.from_pretrained("Maltehb/-l-ctra_cased_ner_dane") ### Evaluation of current Danish Language Models �l�ctra, Danish BERT (DaBERT) and multilingual BERT (mBERT) were evaluated: | Model | Layers | Hidden Size | Params | AVG NER micro-f1 (DaNE-testset) | Average Inference Time (Sec/Epoch) | Download | | --- | --- | --- | --- | --- | --- | --- | | �l�ctra Uncased | 12 | 256 | 13.7M | 78.03 (SD = 1.28) | 10.91 | [Link for model](https://www.dropbox.com/s/cag7prs1nvdchqs/%C3%86l%C3%A6ctra.zip?dl=0) | | �l�ctra Cased | 12 | 256 | 14.7M | 80.08 (SD = 0.26) | 10.92 | [Link for model](https://www.dropbox.com/s/cag7prs1nvdchqs/%C3%86l%C3%A6ctra.zip?dl=0) | | DaBERT | 12 | 768 | 110M | 84.89 (SD = 0.64) | 43.03 | [Link for model](https://www.dropbox.com/s/19cjaoqvv2jicq9/danish_bert_uncased_v2.zip?dl=1) | | mBERT Uncased | 12 | 768 | 167M | 80.44 (SD = 0.82) | 72.10 | [Link for model](https://storage.googleapis.com/bert_models/2018_11_03/multilingual_L-12_H-768_A-12.zip) | | mBERT Cased | 12 | 768 | 177M | 83.79 (SD = 0.91) | 70.56 | [Link for model](https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip) | On [DaNE](https://danlp.alexandra.dk/304bd159d5de/datasets/ddt.zip) (Hvingelby et al., 2020), �l�ctra scores slightly worse than both cased and uncased Multilingual BERT (Devlin et al., 2019) and Danish BERT (Danish BERT, 2019/2020), however, �l�ctra is less than one third the size, and uses significantly fewer computational resources to pretrain and instantiate. For a full description of the evaluation and specification of the model read the thesis: '�l�ctra - A Step Towards More Efficient Danish Natural Language Processing'. ### Pretraining To pretrain �l�ctra it is recommended to build a Docker Container from the [Dockerfile](https://github.com/MalteHB/�l�ctra/tree/master/notebooks/fine-tuning/). Next, simply follow the [pretraining notebooks](https://github.com/MalteHB/�l�ctra/tree/master/infrastructure/Dockerfile/) The pretraining was done by utilizing a single NVIDIA Tesla V100 GPU with 16 GiB, endowed by the Danish data company [KMD](https://www.kmd.dk/). The pretraining took approximately 4 days and 9.5 hours for both the cased and uncased model ### Fine-tuning To fine-tune any �l�ctra model follow the [fine-tuning notebooks](https://github.com/MalteHB/�l�ctra/tree/master/notebooks/fine-tuning/) ### References Clark, K., Luong, M.-T., Le, Q. V., & Manning, C. D. (2020). ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. ArXiv:2003.10555 [Cs]. http://arxiv.org/abs/2003.10555 Danish BERT. (2020). BotXO. https://github.com/botxo/nordic_bert (Original work published 2019) Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv:1810.04805 [Cs]. http://arxiv.org/abs/1810.04805 Hvingelby, R., Pauli, A. B., Barrett, M., Rosted, C., Lidegaard, L. M., & S�gaard, A. (2020). DaNE: A Named Entity Resource for Danish. Proceedings of the 12th Language Resources and Evaluation Conference, 4597 4604. https://www.aclweb.org/anthology/2020.lrec-1.565 Str�mberg-Derczynski, L., Baglini, R., Christiansen, M. H., Ciosici, M. R., Dalsgaard, J. A., Fusaroli, R., Henrichsen, P. J., Hvingelby, R., Kirkedal, A., Kjeldsen, A. S., Ladefoged, C., Nielsen, F. �., Petersen, M. L., Rystr�m, J. H., & Varab, D. (2020). The Danish Gigaword Project. ArXiv:2005.03521 [Cs]. http://arxiv.org/abs/2005.03521 #### Acknowledgements As the majority of this repository is build upon [the works](https://github.com/google-research/electra) by the team at Google who created ELECTRA, a HUGE thanks to them is in order. A Giga thanks also goes out to the incredible people who collected The Danish Gigaword Corpus (Str�mberg-Derczynski et al., 2020). Furthermore, I would like to thank my supervisor [Riccardo Fusaroli](https://github.com/fusaroli) for the support with the thesis, and a special thanks goes out to [Kenneth Enevoldsen](https://github.com/KennethEnevoldsen) for his continuous feedback. Lastly, i would like to thank KMD, my colleagues from KMD, and my peers and co-students from Cognitive Science for encouriging me to keep on working hard and holding my head up high! #### Contact For help or further information feel free to connect with the author Malte H�jmark-Bertelsen on [hjb@kmd.dk](mailto:hjb@kmd.dk?subject=[GitHub]%20�l�ctraCasedNER) or any of the following platforms: [<img align="left" alt="MalteHB | Twitter" width="22px" src="https://cdn.jsdelivr.net/npm/simple-icons@v3/icons/twitter.svg" />][twitter] [<img align="left" alt="MalteHB | LinkedIn" width="22px" src="https://cdn.jsdelivr.net/npm/simple-icons@v3/icons/linkedin.svg" />][linkedin] [<img align="left" alt="MalteHB | Instagram" width="22px" src="https://cdn.jsdelivr.net/npm/simple-icons@v3/icons/instagram.svg" />][instagram] <br /> </details> [twitter]: https://twitter.com/malteH_B [instagram]: https://www.instagram.com/maltemusen/ [linkedin]: https://www.linkedin.com/in/malte-h%C3%B8jmark-bertelsen-9a618017b/