|
--- |
|
license: mit |
|
--- |
|
|
|
# umT5 Small |
|
|
|
The UMT5 model was proposed in [UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining](https://openreview.net/forum?id=kXwdL1cWOAi) |
|
by Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant. |
|
|
|
The abstract from the paper is the following: |
|
|
|
*Pretrained multilingual large language models have typically used heuristic temperature-based sampling to balance |
|
between different languages. However previous work has not systematically evaluated the efficacy of different |
|
pretraining language distributions across model scales. In this paper, we propose a new sampling method, UniMax, |
|
that delivers more uniform coverage of head languages while mitigating overfitting on tail languages by explicitly |
|
capping the number of repeats over each language's corpus. We perform an extensive series of ablations testing a |
|
range of sampling strategies on a suite of multilingual benchmarks, while varying model scale. We find that UniMax |
|
outperforms standard temperature-based sampling, and the benefits persist as scale increases. As part of our |
|
contribution, we release: (i) an improved and refreshed mC4 multilingual corpus consisting of 29 trillion characters |
|
across 107 languages, and (ii) a suite of pretrained umT5 model checkpoints trained with UniMax sampling.* |
|
|
|
# Integration into Transformers |
|
|
|
Overview of umT5 model integration: |
|
|
|
* Transformers Integration is on-going, see this awesome [PR](https://github.com/huggingface/transformers/pull/22626) by @agemagician! |
|
* Conversion script (umT5X checkpoints to FLAX) is [here](https://gist.github.com/stefan-it/5d6a4ec89e7ad97181983881434cb4eb). |