metadata

license: mit

SMALL-100 Model

SMaLL-100 is a compact and fast massively multilingual machine translation model covering more than 10K language pairs, that achieves competitive results with M2M-100 while being much smaller and faster. It is introduced in this paper(accepted to EMNLP2022), and initially released in this repository.

The model architecture and config are the same as M2M-100 implementation, but the tokenizer is modified to adjust language codes. So, you should load the tokenizer locally from tokenization_small100.py file for the moment.

Demo: https://huggingface.co/spaces/alirezamsh/small100

Note: SMALL100Tokenizer requires sentencepiece, so make sure to install it by:

pip install sentencepiece

Supervised Training

SMaLL-100 is a seq-to-seq model for the translation task. The input to the model is source:[tgt_lang_code] + src_tokens + [EOS] and target: tgt_tokens + [EOS].

`small-100-th` is the fine-tuned version of SMALL-100 for Thai

The dataset can be acquired from Vistec

small-100-th inference

from transformers import M2M100ForConditionalGeneration
from tokenization_small100 import SMALL100Tokenizer
from huggingface_hub import notebook_login

notebook_login()

checkpoint = "kimmchii/small-100-th"
model = M2M100ForConditionalGeneration.from_pretrained(checkpoint)
tokenizer = SMALL100Tokenizer.from_pretrained(checkpoint)

thai_text = "สวัสดี"

# translate Thai to English
tokenizer.tgt_lang = "en"
encoded_th = tokenizer(thai_text, return_tensors="pt")
generated_tokens = model.generate(**encoded_th)
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
# => "Hello"

SMALL-100 Model

small-100-th is the fine-tuned version of SMALL-100 for Thai

small-100-th inference

`small-100-th` is the fine-tuned version of SMALL-100 for Thai