Amharic WordPiece Tokenizer

This repo contains a WordPiece tokenizer trained on the Amharic subset of the oscar and mc4 datasets. It's the same as the BERT tokenizer but trained from scratch on an amharic dataset with a vocabulary size of 30522.

How to use

You can load the tokenizer from huggingface hub as follows.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("rasyosef/bert-amharic-tokenizer")
tokenizer.tokenize("የዓለምአቀፉ ነጻ ንግድ መስፋፋት ድህነትን ለማሸነፍ በሚደረገው ትግል አንዱ ጠቃሚ መሣሪያ ሊሆን መቻሉ ብዙ የሚነገርለት ጉዳይ ነው።")

Output:

['የዓለም', '##አቀፉ', 'ነጻ', 'ንግድ', 'መስፋፋት', 'ድህነትን', 'ለማሸነፍ', 'በሚደረገው', 'ትግል', 'አንዱ', 'ጠቃሚ', 'መሣሪያ', 'ሊሆን', 'መቻሉ', 'ብዙ', 'የሚነገርለት', 'ጉዳይ', 'ነው', '።']

rasyosef
/

bert-amharic-tokenizer

Amharic WordPiece Tokenizer

How to use

Datasets used to train rasyosef/bert-amharic-tokenizer