File size: 4,079 Bytes
c1bb127 e218d27 946d6aa e218d27 c1bb127 e7bc994 e6f734e e7bc994 87c5d6d e7bc994 c1bb127 e6f734e e7bc994 dc81b10 5cf2f0f dc81b10 ee4ff41 dc81b10 ec78dac 5cf2f0f ec78dac 5cf2f0f ec78dac dc81b10 5cf2f0f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 |
---
library_name: transformers
datasets:
- oscar
- mc4
- rasyosef/amharic-sentences-corpus
language:
- am
metrics:
- perplexity
pipeline_tag: fill-mask
widget:
- text: ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ [MASK] ተቆጥሯል።
example_title: Example 1
- text: ባለፉት አምስት ዓመታት የአውሮጳ ሀገራት የጦር [MASK] ግዢ በእጅጉ ጨምሯል።
example_title: Example 2
- text: ኬንያውያን ከዳር እስከዳር በአንድ ቆመው የተቃውሞ ድምጻቸውን ማሰማታቸውን ተከትሎ የዜጎችን ቁጣ የቀሰቀሰው የቀረጥ ጭማሪ ሕግ ትናንት በፕሬዝደንት ዊልያም ሩቶ [MASK] ቢደረግም ዛሬም ግን የተቃውሞው እንቅስቃሴ መቀጠሉ እየተነገረ ነው።
example_title: Example 3
- text: ተማሪዎቹ በውድድሩ ካሸነፉበት የፈጠራ ስራ መካከል [MASK] እና ቅዝቃዜን እንደአየር ሁኔታው የሚያስተካክል ጃኬት አንዱ ነው።
example_title: Example 4
---
# bert-tiny-amharic
This model has the same architecture as [bert-tiny](https://huggingface.co/prajjwal1/bert-tiny) and was pretrained from scratch using the Amharic subsets of the [oscar](https://huggingface.co/datasets/oscar), [mc4](https://huggingface.co/datasets/mc4), and [amharic-sentences-corpus](https://huggingface.co/datasets/rasyosef/amharic-sentences-corpus) datasets, on a total of **290 million tokens**. The tokenizer was trained from scratch on the same text corpus, and had a vocabulary size of 28k.
It achieves the following results on the evaluation set:
- `Loss: 4.27`
- `Perplexity: 71.52`
This model has just `4.18M` parameters.
# How to use
You can use this model directly with a pipeline for masked language modeling:
```python
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='rasyosef/bert-tiny-amharic')
>>> unmasker("ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ [MASK] ተቆጥሯል።")
[{'score': 0.5629344582557678,
'token': 9617,
'token_str': 'ዓመታት',
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመታት ተቆጥሯል ።'},
{'score': 0.3049253523349762,
'token': 9345,
'token_str': 'ዓመት',
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመት ተቆጥሯል ።'},
{'score': 0.0681595504283905,
'token': 10898,
'token_str': 'አመታት',
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመታት ተቆጥሯል ።'},
{'score': 0.028840897604823112,
'token': 9913,
'token_str': 'አመት',
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመት ተቆጥሯል ።'},
{'score': 0.008974998258054256,
'token': 15098,
'token_str': 'ዘመናት',
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዘመናት ተቆጥሯል ።'}]
```
# Finetuning
This model was finetuned and evaluated on the following Amharic NLP tasks
- **Sentiment Classification**
- Dataset: [amharic-sentiment](https://huggingface.co/datasets/rasyosef/amharic-sentiment)
- Code: https://github.com/rasyosef/amharic-sentiment-classification
- **Named Entity Recognition**
- Dataset: [amharic-named-entity-recognition](https://huggingface.co/datasets/rasyosef/amharic-named-entity-recognition)
- Code: https://github.com/rasyosef/amharic-named-entity-recognition
### Finetuned Model Performance
The reported F1 scores are macro averages.
|Model|Size (# params)| Perplexity|Sentiment (F1)| Named Entity Recognition (F1)|
|-----|---------------|-----------|--------------|------------------------------|
|bert-medium-amharic|40.5M|13.74|0.83|0.68|
|bert-small-amharic|27.8M|15.96|0.83|0.68|
|bert-mini-amharic|10.7M|22.42|0.81|0.64|
|**bert-tiny-amharic**|**4.18M**|**71.52**|**0.79**|**0.54**|
|xlm-roberta-base|279M||0.83|0.73|
|am-roberta|443M||0.82|0.69| |