bert-tiny-amharic / README.md
rasyosef's picture
Update README.md
946d6aa verified
metadata
library_name: transformers
datasets:
  - oscar
  - mc4
  - rasyosef/amharic-sentences-corpus
language:
  - am
metrics:
  - perplexity
pipeline_tag: fill-mask
widget:
  - text: ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ [MASK] ተቆጥሯል።
    example_title: Example 1
  - text: ባለፉት አምስት ዓመታት የአውሮጳ ሀገራት የጦር [MASK] ግዢ በእጅጉ ጨምሯል።
    example_title: Example 2
  - text: >-
      ኬንያውያን ከዳር እስከዳር በአንድ ቆመው የተቃውሞ ድምጻቸውን ማሰማታቸውን ተከትሎ የዜጎችን ቁጣ የቀሰቀሰው የቀረጥ
      ጭማሪ ሕግ ትናንት በፕሬዝደንት ዊልያም ሩቶ [MASK] ቢደረግም ዛሬም ግን የተቃውሞው እንቅስቃሴ መቀጠሉ እየተነገረ
      ነው።
    example_title: Example 3
  - text: >-
      ተማሪዎቹ በውድድሩ ካሸነፉበት የፈጠራ ስራ መካከል [MASK] እና ቅዝቃዜን እንደአየር ሁኔታው የሚያስተካክል ጃኬት
      አንዱ ነው።
    example_title: Example 4

bert-tiny-amharic

This model has the same architecture as bert-tiny and was pretrained from scratch using the Amharic subsets of the oscar, mc4, and amharic-sentences-corpus datasets, on a total of 290 million tokens. The tokenizer was trained from scratch on the same text corpus, and had a vocabulary size of 28k.

It achieves the following results on the evaluation set:

  • Loss: 4.27
  • Perplexity: 71.52

This model has just 4.18M parameters.

How to use

You can use this model directly with a pipeline for masked language modeling:

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='rasyosef/bert-tiny-amharic')
>>> unmasker("ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ [MASK] ተቆጥሯል።")

[{'score': 0.5629344582557678,
  'token': 9617,
  'token_str': 'ዓመታት',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመታት ተቆጥሯል ።'},
 {'score': 0.3049253523349762,
  'token': 9345,
  'token_str': 'ዓመት',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመት ተቆጥሯል ።'},
 {'score': 0.0681595504283905,
  'token': 10898,
  'token_str': 'አመታት',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመታት ተቆጥሯል ።'},
 {'score': 0.028840897604823112,
  'token': 9913,
  'token_str': 'አመት',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመት ተቆጥሯል ።'},
 {'score': 0.008974998258054256,
  'token': 15098,
  'token_str': 'ዘመናት',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዘመናት ተቆጥሯል ።'}]

Finetuning

This model was finetuned and evaluated on the following Amharic NLP tasks

Finetuned Model Performance

The reported F1 scores are macro averages.

Model Size (# params) Perplexity Sentiment (F1) Named Entity Recognition (F1)
bert-medium-amharic 40.5M 13.74 0.83 0.68
bert-small-amharic 27.8M 15.96 0.83 0.68
bert-mini-amharic 10.7M 22.42 0.81 0.64
bert-tiny-amharic 4.18M 71.52 0.79 0.54
xlm-roberta-base 279M 0.83 0.73
am-roberta 443M 0.82 0.69