Regional bengali text to IPA transcription - umt5-base

This is a fine-tuned version of the google/umt5-base for the task of generating IPA transcriptions from regional bengali text. This was done on the dataset of the competition “ভাষামূল: মুখের ভাষার খোঁজে“ by Bengali.AI.

Scores achieved till now (test scores):

Word error rate (wer): 0.02390405721962450
Char error rate (cer): 0.01011514943093060

Supported district tokens:

Kishoreganj
Narail
Narsingdi
Chittagong
Rangpur
Tangail

Loading & using the model

# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("teamapocalypseml/ben2ipa-umt5base")
model = AutoModelForSeq2SeqLM.from_pretrained("teamapocalypseml/ben2ipa-umt5base")
"""
  The format of the input text MUST BE: <district> <bengali_text>
"""
text = "<district> bengali_text_here"
text_ids = tokenizer(text, return_tensors='pt').input_ids
model(text_ids)

Using the pipeline

# Use a pipeline as a high-level helper
from transformers import pipeline
device = "cuda" if torch.cuda.is_available() else "cpu"
pipe = pipeline("text2text-generation", model="teamapocalypseml/ben2ipa-umt5base", device=device)
"""
  `texts` must be in the format of: <district> <contents>
"""
outputs = pipe(texts, max_length=512, batch_size=batch_size)

Credits

Done by S M Jishanul Islam, Sadia Ahmmed, Sahid Hossain Mustakim

teamapocalypseml
/

regben2ipa-umt5base

Regional bengali text to IPA transcription - umt5-base

Loading & using the model

Using the pipeline

Credits

Collection including teamapocalypseml/regben2ipa-umt5base

Bengali Regional Text to IPA Models