metadata
language: ar
license: apache-2.0
datasets:
- AQMAR
- ANERcorp
thumbnail: >-
https://www.informatik.hu-berlin.de/en/forschung-en/gebiete/ml-en/resolveuid/a6f82e0d7fa446a59c902cac4cafa9cb/@@images/image/preview
tags:
- flair
- Text Classification
- token-classification
- sequence-tagger-model
metrics:
- f1
widget:
- text: >-
اختارها خيري بشارة كممثلة، دون سابقة معرفة أو تجربة تمثيلية، لتقف بجانب
فاتن حمامة في فيلم «يوم مر ويوم حلو» (1988) وهي ما زالت شابة لم تتخطَ
عامها الثاني
Arabic NER Model for AQMAR dataset
Training was conducted over 86 epochs, using a linear decaying learning rate of 2e-05, starting from 0.3 and a batch size of 48 with fastText and Flair forward and backward embeddings.
Original Dataset:
Results:
- F1-score (micro) 0.9323
- F1-score (macro) 0.9272
True Posititves | False Positives | False Negatives | Precision | Recall | class-F1 | |
---|---|---|---|---|---|---|
LOC | 164 | 7 | 13 | 0.9591 | 0.9266 | 0.9425 |
MISC | 398 | 22 | 37 | 0.9476 | 0.9149 | 0.9310 |
ORG | 65 | 6 | 9 | 0.9155 | 0.8784 | 0.8966 |
PER | 199 | 13 | 13 | 0.9387 | 0.9387 | 0.9387 |
Usage
from flair.data import Sentence
from flair.models import SequenceTagger
import pyarabic.araby as araby
from icecream import ic
arTagger = SequenceTagger.load('megantosh/flair-arabic-MSA-aqmar')
sentence = Sentence('George Washington went to Washington .')
arSentence = Sentence('عمرو عادلي أستاذ للاقتصاد السياسي المساعد في الجامعة الأمريكية بالقاهرة .')
# predict NER tags
tagger.predict(sentence)
arTagger.predict(arSentence)
# print sentence with predicted tags
ic(sentence.to_tagged_string)
ic(arSentence.to_tagged_string)
Example
see an example from a similar NER model in Flair
Model Configuration
(embeddings): StackedEmbeddings(
(list_embedding_0): WordEmbeddings('ar')
(list_embedding_1): FlairEmbeddings(
(lm): LanguageModel(
(drop): Dropout(p=0.1, inplace=False)
(encoder): Embedding(7125, 100)
(rnn): LSTM(100, 2048)
(decoder): Linear(in_features=2048, out_features=7125, bias=True)
)
)
(list_embedding_2): FlairEmbeddings(
(lm): LanguageModel(
(drop): Dropout(p=0.1, inplace=False)
(encoder): Embedding(7125, 100)
(rnn): LSTM(100, 2048)
(decoder): Linear(in_features=2048, out_features=7125, bias=True)
)
)
)
(word_dropout): WordDropout(p=0.05)
(locked_dropout): LockedDropout(p=0.5)
(embedding2nn): Linear(in_features=4396, out_features=4396, bias=True)
(rnn): LSTM(4396, 256, batch_first=True, bidirectional=True)
(linear): Linear(in_features=512, out_features=14, bias=True)
(beta): 1.0
(weights): None
(weight_tensor) None
)"
2021-03-31 22:19:50,654 ----------------------------------------------------------------------------------------------------
2021-03-31 22:19:50,654 Corpus: "Corpus: 3025 train + 336 dev + 373 test sentences"
2021-03-31 22:19:50,654 ----------------------------------------------------------------------------------------------------
2021-03-31 22:19:50,654 Parameters:
2021-03-31 22:19:50,654 - learning_rate: "0.3"
2021-03-31 22:19:50,654 - mini_batch_size: "48"
2021-03-31 22:19:50,654 - patience: "3"
2021-03-31 22:19:50,654 - anneal_factor: "0.5"
2021-03-31 22:19:50,654 - max_epochs: "150"
2021-03-31 22:19:50,654 - shuffle: "True"
2021-03-31 22:19:50,654 - train_with_dev: "False"
2021-03-31 22:19:50,654 - batch_growth_annealing: "False"
2021-03-31 22:19:50,655 ------------------------------------
Due to some formatting errors, your code might appear like this.
Citation
if you use this model in your work, please consider citing this work:
@unpublished{MMHU21
author = "M. Megahed",
title = "Sequence Labeling Architectures in Diglossia",
note = "In Review",
}