Amharic BERT and RoBERTa
Collection
BERT and RoBERTa transformer encoder models pretrained on 290 million tokens of Amharic text
•
8 items
•
Updated
•
3
This model has the same architecture as bert-tiny and was pretrained from scratch using the Amharic subsets of the oscar, mc4, and amharic-sentences-corpus datasets, on a total of 290 million tokens. The tokenizer was trained from scratch on the same text corpus, and had a vocabulary size of 28k.
It achieves the following results on the evaluation set:
Loss: 4.27
Perplexity: 71.52
This model has just 4.18M
parameters.
You can use this model directly with a pipeline for masked language modeling:
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='rasyosef/bert-tiny-amharic')
>>> unmasker("ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ [MASK] ተቆጥሯል።")
[{'score': 0.5629344582557678,
'token': 9617,
'token_str': 'ዓመታት',
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመታት ተቆጥሯል ።'},
{'score': 0.3049253523349762,
'token': 9345,
'token_str': 'ዓመት',
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመት ተቆጥሯል ።'},
{'score': 0.0681595504283905,
'token': 10898,
'token_str': 'አመታት',
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመታት ተቆጥሯል ።'},
{'score': 0.028840897604823112,
'token': 9913,
'token_str': 'አመት',
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመት ተቆጥሯል ።'},
{'score': 0.008974998258054256,
'token': 15098,
'token_str': 'ዘመናት',
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዘመናት ተቆጥሯል ።'}]
This model was finetuned and evaluated on the following Amharic NLP tasks
The reported F1 scores are macro averages.
Model | Size (# params) | Perplexity | Sentiment (F1) | Named Entity Recognition (F1) |
---|---|---|---|---|
bert-medium-amharic | 40.5M | 13.74 | 0.83 | 0.68 |
bert-small-amharic | 27.8M | 15.96 | 0.83 | 0.68 |
bert-mini-amharic | 10.7M | 22.42 | 0.81 | 0.64 |
bert-tiny-amharic | 4.18M | 71.52 | 0.79 | 0.54 |
xlm-roberta-base | 279M | 0.83 | 0.73 | |
am-roberta | 443M | 0.82 | 0.69 |