myt5-small / README.md
Tomlim's picture
Upload T5ForConditionalGeneration
9baf281 verified
metadata
language:
  - af
  - am
  - ar
  - az
  - be
  - bg
  - bn
  - ca
  - ceb
  - co
  - cs
  - cy
  - da
  - de
  - el
  - en
  - eo
  - es
  - et
  - eu
  - fa
  - fi
  - fil
  - fr
  - fy
  - ga
  - gd
  - gl
  - gu
  - ha
  - haw
  - he
  - hi
  - hmn
  - ht
  - hu
  - hy
  - id
  - ig
  - is
  - it
  - iw
  - ja
  - jv
  - ka
  - kk
  - km
  - kn
  - ko
  - ku
  - ky
  - la
  - lb
  - lo
  - lt
  - lv
  - mg
  - mi
  - mk
  - ml
  - mn
  - mr
  - ms
  - mt
  - my
  - ne
  - nl
  - 'no'
  - ny
  - pa
  - pl
  - ps
  - pt
  - ro
  - ru
  - sd
  - si
  - sk
  - sl
  - sm
  - sn
  - so
  - sq
  - sr
  - st
  - su
  - sv
  - sw
  - ta
  - te
  - tg
  - th
  - tr
  - uk
  - und
  - ur
  - uz
  - vi
  - xh
  - yi
  - yo
  - zh
  - zu
license: mit
datasets:
  - mc4

MyT5

Model Details

MyT5 (Myte T5) is a multilingual language model based on T5 architecture. The model uses a morphologically-driven byte (MYTE) representation described in our paper Limisiewicz et al., 2024.

Model Description

  • Developed by: Tomasz Limisiewicz, Terra Blevins, Hila Gonen, Orevaoghene Ahia, Luke Zettlemoyer
  • Funded by: University of Washington Fellowship, Charles University Grant Agency
  • Model type: T5
  • Language(s) (NLP): Multilingual
  • License: MIT

Model Sizes

Model Sources

How to Get Started with the Model

The snippet below shows the basic usage of the model for multilingual language modeling. Custom Tokenizer is available in GitHubrepository, in src/myt5/myt5_tokenizer.py. We also plan to release it on HuggingFace in the future.

from transformers import T5ForConditionalGeneration
from src.myt5.myt5_tokenizer import MyT5Tokenizer
import torch

MODEL_SIZE = "large" # small, base, or large

model = T5ForConditionalGeneration.from_pretrained(f"Tomlim/MyT5_{MODEL_SIZE}", use_safetensors=True)
tokenizer = MyT5Tokenizer()

pre_texts = ['"We now have',
            '„Mamy teraz myszy w wieku',
            '"""எங்களிடம் இப்போது']
post_texts = ['4-month-old mice that are non-diabetic that used to be diabetic," he added.',
              '4 miesięcy, które miały cukrzycę, ale zostały z niej wyleczone” – dodał.',
              '4-மாத-வயதுடைய எலி ஒன்று உள்ளது, முன்னர் அதற்கு நீரிழிவு இருந்தது தற்போது இல்லை"" என்று அவர் மேலும் கூறினார்."']

inputs = tokenizer(pre_texts, padding="longest", return_tensors="pt")
targets = tokenizer(post_texts, padding="longest", return_tensors="pt")


outputs = model(**inputs, labels=targets.input_ids)
probs = torch.nn.functional.softmax(outputs.logits, dim=-1)

Training Details

Training Data

The model was trained on the standard T5 task of restoring corrupted spans in the multilingual MC4 dataset.

Preprocessing

Instead of UTF-8 bytes, we used morphologically-driven byte representation. See the description in our paper for more details.

Training Hyperparameters

We used the same hyperparameters as in the original ByT5 paper. The only difference is that we decreased the number of training steps to 250,000 to avoid overfiting.

Computational Infrastructure

Models were trained on TPUs available through TPU Research Cloud (TRC). We used v3-8 TPU for training small and base models and v3-32 for a large model. The training for each instance took:

  • Small: 90h
  • Base: 230h
  • Large: 190h

Evaluation

MyT5 models are compared with reimplementation of ByT5 models trained for 250,000 steps.

Language Modeling

We have evaluated LM performance on multi-parallel FLORES 200 corpus. To compare the scores across languages and models, we used a normalized metric, i.e., Bit-per-English-Byte (BPEB).

Results

ByT5 MyT5
BPEB T (ms) BPEB T (ms)
small All 10.1 7.0 4.6 6.7
Latin 4.6 5.9 4.2 6.6
Non Latin 18.1 8.5 5.1 6.8
base All 8.2 11.5 5.8 8.9
Latin 4.9 9.4 5.0 8.7
Non Latin 13.0 14.6 6.9 9.1
large All 13.4 31.8 4.6 26.7
Latin 10.1 28.1 4.0 26.6
Non Latin 18.2 37.3 5.4 27.0

Byte-per-English-Bits and Inference times (average per Flores 200 sentence) averaged for three language groupings. The inference was run on an A40 GPU core.

Citation

@misc{limisiewicz2024myte,
      title={MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling}, 
      author={Tomasz Limisiewicz and Terra Blevins and Hila Gonen and Orevaoghene Ahia and Luke Zettlemoyer},
      year={2024},
      eprint={2403.10691},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Model Card Author

Tomasz Limisiewicz