|
---
|
|
license: apache-2.0
|
|
language:
|
|
- multilingual
|
|
- en
|
|
- ru
|
|
- es
|
|
- fr
|
|
- de
|
|
- it
|
|
- pt
|
|
- pl
|
|
- nl
|
|
- vi
|
|
- tr
|
|
- sv
|
|
- id
|
|
- ro
|
|
- cs
|
|
- zh
|
|
- hu
|
|
- ja
|
|
- th
|
|
- fi
|
|
- fa
|
|
- uk
|
|
- da
|
|
- el
|
|
- "no"
|
|
- bg
|
|
- sk
|
|
- ko
|
|
- ar
|
|
- lt
|
|
- ca
|
|
- sl
|
|
- he
|
|
- et
|
|
- lv
|
|
- hi
|
|
- sq
|
|
- ms
|
|
- az
|
|
- sr
|
|
- ta
|
|
- hr
|
|
- kk
|
|
- is
|
|
- ml
|
|
- mr
|
|
- te
|
|
- af
|
|
- gl
|
|
- fil
|
|
- be
|
|
- mk
|
|
- eu
|
|
- bn
|
|
- ka
|
|
- mn
|
|
- bs
|
|
- uz
|
|
- ur
|
|
- sw
|
|
- yue
|
|
- ne
|
|
- kn
|
|
- kaa
|
|
- gu
|
|
- si
|
|
- cy
|
|
- eo
|
|
- la
|
|
- hy
|
|
- ky
|
|
- tg
|
|
- ga
|
|
- mt
|
|
- my
|
|
- km
|
|
- tt
|
|
- so
|
|
- ku
|
|
- ps
|
|
- pa
|
|
- rw
|
|
- lo
|
|
- ha
|
|
- dv
|
|
- fy
|
|
- lb
|
|
- ckb
|
|
- mg
|
|
- gd
|
|
- am
|
|
- ug
|
|
- ht
|
|
- grc
|
|
- hmn
|
|
- sd
|
|
- jv
|
|
- mi
|
|
- tk
|
|
- ceb
|
|
- yi
|
|
- ba
|
|
- fo
|
|
- or
|
|
- xh
|
|
- su
|
|
- kl
|
|
- ny
|
|
- sm
|
|
- sn
|
|
- co
|
|
- zu
|
|
- ig
|
|
- yo
|
|
- pap
|
|
- st
|
|
- haw
|
|
- as
|
|
- oc
|
|
- cv
|
|
- lus
|
|
- tet
|
|
- gsw
|
|
- sah
|
|
- br
|
|
- rm
|
|
- sa
|
|
- bo
|
|
- om
|
|
- se
|
|
- ce
|
|
- cnh
|
|
- ilo
|
|
- hil
|
|
- udm
|
|
- os
|
|
- lg
|
|
- ti
|
|
- vec
|
|
- ts
|
|
- tyv
|
|
- kbd
|
|
- ee
|
|
- iba
|
|
- av
|
|
- kha
|
|
- to
|
|
- tn
|
|
- nso
|
|
- fj
|
|
- zza
|
|
- ak
|
|
- ada
|
|
- otq
|
|
- dz
|
|
- bua
|
|
- cfm
|
|
- ln
|
|
- chm
|
|
- gn
|
|
- krc
|
|
- wa
|
|
- hif
|
|
- yua
|
|
- srn
|
|
- war
|
|
- rom
|
|
- bik
|
|
- pam
|
|
- sg
|
|
- lu
|
|
- ady
|
|
- kbp
|
|
- syr
|
|
- ltg
|
|
- myv
|
|
- iso
|
|
- kac
|
|
- bho
|
|
- ay
|
|
- kum
|
|
- qu
|
|
- za
|
|
- pag
|
|
- ngu
|
|
- ve
|
|
- pck
|
|
- zap
|
|
- tyz
|
|
- hui
|
|
- bbc
|
|
- tzo
|
|
- tiv
|
|
- ksd
|
|
- gom
|
|
- min
|
|
- ang
|
|
- nhe
|
|
- bgp
|
|
- nzi
|
|
- nnb
|
|
- nv
|
|
- zxx
|
|
- bci
|
|
- kv
|
|
- new
|
|
- mps
|
|
- alt
|
|
- meu
|
|
- bew
|
|
- fon
|
|
- iu
|
|
- abt
|
|
- mgh
|
|
- mnw
|
|
- tvl
|
|
- dov
|
|
- tlh
|
|
- ho
|
|
- kw
|
|
- mrj
|
|
- meo
|
|
- crh
|
|
- mbt
|
|
- emp
|
|
- ace
|
|
- ium
|
|
- mam
|
|
- gym
|
|
- mai
|
|
- crs
|
|
- pon
|
|
- ubu
|
|
- fip
|
|
- quc
|
|
- gv
|
|
- kj
|
|
- btx
|
|
- ape
|
|
- chk
|
|
- rcf
|
|
- shn
|
|
- tzh
|
|
- mdf
|
|
- ppk
|
|
- ss
|
|
- gag
|
|
- cab
|
|
- kri
|
|
- seh
|
|
- ibb
|
|
- tbz
|
|
- bru
|
|
- enq
|
|
- ach
|
|
- cuk
|
|
- kmb
|
|
- wo
|
|
- kek
|
|
- qub
|
|
- tab
|
|
- bts
|
|
- kos
|
|
- rwo
|
|
- cak
|
|
- tuc
|
|
- bum
|
|
- cjk
|
|
- gil
|
|
- stq
|
|
- tsg
|
|
- quh
|
|
- mak
|
|
- arn
|
|
- ban
|
|
- jiv
|
|
- sja
|
|
- yap
|
|
- tcy
|
|
- toj
|
|
- twu
|
|
- xal
|
|
- amu
|
|
- rmc
|
|
- hus
|
|
- nia
|
|
- kjh
|
|
- bm
|
|
- guh
|
|
- mas
|
|
- acf
|
|
- dtp
|
|
- ksw
|
|
- bzj
|
|
- din
|
|
- zne
|
|
- mad
|
|
- msi
|
|
- mag
|
|
- mkn
|
|
- kg
|
|
- lhu
|
|
- ch
|
|
- qvi
|
|
- mh
|
|
- djk
|
|
- sus
|
|
- mfe
|
|
- srm
|
|
- dyu
|
|
- ctu
|
|
- gui
|
|
- pau
|
|
- inb
|
|
- bi
|
|
- mni
|
|
- guc
|
|
- jam
|
|
- wal
|
|
- jac
|
|
- bas
|
|
- gor
|
|
- skr
|
|
- nyu
|
|
- noa
|
|
- sda
|
|
- gub
|
|
- nog
|
|
- cni
|
|
- teo
|
|
- tdx
|
|
- sxn
|
|
- rki
|
|
- nr
|
|
- frp
|
|
- alz
|
|
- taj
|
|
- lrc
|
|
- cce
|
|
- rn
|
|
- jvn
|
|
- hvn
|
|
- nij
|
|
- dwr
|
|
- izz
|
|
- msm
|
|
- bus
|
|
- ktu
|
|
- chr
|
|
- maz
|
|
- tzj
|
|
- suz
|
|
- knj
|
|
- bim
|
|
- gvl
|
|
- bqc
|
|
- tca
|
|
- pis
|
|
- prk
|
|
- laj
|
|
- mel
|
|
- qxr
|
|
- niq
|
|
- ahk
|
|
- shp
|
|
- hne
|
|
- spp
|
|
- koi
|
|
- krj
|
|
- quf
|
|
- luz
|
|
- agr
|
|
- tsc
|
|
- mqy
|
|
- gof
|
|
- gbm
|
|
- miq
|
|
- dje
|
|
- awa
|
|
- bjj
|
|
- qvz
|
|
- sjp
|
|
- tll
|
|
- raj
|
|
- kjg
|
|
- bgz
|
|
- quy
|
|
- cbk
|
|
- akb
|
|
- oj
|
|
- ify
|
|
- mey
|
|
- ks
|
|
- cac
|
|
- brx
|
|
- qup
|
|
- syl
|
|
- jax
|
|
- ff
|
|
- ber
|
|
- tks
|
|
- trp
|
|
- mrw
|
|
- adh
|
|
- smt
|
|
- srr
|
|
- ffm
|
|
- qvc
|
|
- mtr
|
|
- ann
|
|
- kaa
|
|
- aa
|
|
- noe
|
|
- nut
|
|
- gyn
|
|
- kwi
|
|
- xmm
|
|
- msb
|
|
library_name: ctranslate2
|
|
tags:
|
|
- text2text-generation
|
|
- text-generation-inference
|
|
datasets:
|
|
- allenai/MADLAD-400
|
|
pipeline_tag: translation
|
|
|
|
widget:
|
|
- text: "<2en> Como vai, amigo?"
|
|
example_title: "Translation to English"
|
|
- text: "<2de> Do you speak German?"
|
|
example_title: "Translation to German"
|
|
|
|
---
|
|
|
|
# MADLAD-400-7B-MT-BT (int8 quantized using CTranslate2)
|
|
|
|
```
|
|
ct2-transformers-converter --model ./madlad400-7b-mt-bt --quantization int8 --output_dir madlad400-7b-mt-bt-ct2-8bit --copy_files added_tokens.json generation_config.json model.safetensors.index.json special_tokens_map.json spiece.model tokenizer.json tokenizer_config.json
|
|
```
|
|
|
|
---
|
|
|
|
Original model card below
|
|
|
|
---
|
|
|
|
|
|
# Model Card for MADLAD-400-7B-MT
|
|
|
|
# Table of Contents
|
|
|
|
0. [TL;DR](#TL;DR)
|
|
1. [Model Details](#model-details)
|
|
2. [Usage](#usage)
|
|
3. [Uses](#uses)
|
|
4. [Bias, Risks, and Limitations](#bias-risks-and-limitations)
|
|
5. [Training Details](#training-details)
|
|
6. [Evaluation](#evaluation)
|
|
7. [Environmental Impact](#environmental-impact)
|
|
8. [Citation](#citation)
|
|
|
|
# TL;DR
|
|
|
|
MADLAD-400-7B-MT-BT is a multilingual machine translation model based on the T5 architecture that was
|
|
trained on 250 billion tokens covering over 450 languages using publicly available data.
|
|
It is competitive with models that are significantly larger.
|
|
|
|
It's a finetuned version of the 7.2B parameter model on backtranslated data. Authors say in the [paper](https://arxiv.org/pdf/2309.04662.pdf) that:
|
|
|
|
> While this setup is very likely sub-optimal, we see that back-translation
|
|
> greatly improves en2xx translation (by 3.0 chrf, in the case of Flores-200) in most cases.
|
|
|
|
**Disclaimer**: [Juarez Bochi](https://huggingface.co/jbochi), who was not involved in this research, converted
|
|
the original weights and wrote the contents of this model card based on the original paper and Flan-T5.
|
|
|
|
# Model Details
|
|
|
|
## Model Description
|
|
|
|
- **Model type:** Language model
|
|
- **Language(s) (NLP):** Multilingual (400+ languages)
|
|
- **License:** Apache 2.0
|
|
- **Related Models:** [All MADLAD-400 Checkpoints](https://huggingface.co/models?search=madlad)
|
|
- **Original Checkpoints:** [All Original MADLAD-400 Checkpoints](https://github.com/google-research/google-research/tree/master/madlad_400)
|
|
- **Resources for more information:**
|
|
- [Research paper](https://arxiv.org/abs/2309.04662)
|
|
- [GitHub Repo](https://github.com/google-research/t5x)
|
|
- [Hugging Face MADLAD-400 Docs (Similar to T5) ](https://huggingface.co/docs/transformers/model_doc/MADLAD-400) - [Pending PR](https://github.com/huggingface/transformers/pull/27471)
|
|
|
|
# Usage
|
|
|
|
Find below some example scripts on how to use the model:
|
|
|
|
## Using the Pytorch model with `transformers`
|
|
|
|
### Running the model on a CPU or GPU
|
|
|
|
<details>
|
|
<summary> Click to expand </summary>
|
|
|
|
First, install the Python packages that are required:
|
|
|
|
`pip install transformers accelerate sentencepiece`
|
|
|
|
```python
|
|
from transformers import T5ForConditionalGeneration, T5Tokenizer
|
|
|
|
model_name = 'jbochi/madlad400-7b-mt-bt'
|
|
model = T5ForConditionalGeneration.from_pretrained(model_name, device_map="auto")
|
|
tokenizer = T5Tokenizer.from_pretrained(model_name)
|
|
|
|
text = "<2pt> I love pizza!"
|
|
input_ids = tokenizer(text, return_tensors="pt").input_ids.to(model.device)
|
|
outputs = model.generate(input_ids=input_ids)
|
|
|
|
tokenizer.decode(outputs[0], skip_special_tokens=True)
|
|
# Eu adoro pizza!
|
|
```
|
|
|
|
</details>
|
|
|
|
## Running the model with Candle
|
|
|
|
<details>
|
|
<summary> Click to expand </summary>
|
|
|
|
Usage with [candle](https://github.com/huggingface/candle):
|
|
|
|
```bash
|
|
$ cargo run --example t5 --release -- \
|
|
--model-id "jbochi/madlad400-7b-mt-bt" \
|
|
--prompt "<2de> How are you, my friend?" \
|
|
--decode --temperature 0
|
|
```
|
|
|
|
</details>
|
|
|
|
|
|
# Uses
|
|
|
|
## Direct Use and Downstream Use
|
|
|
|
> Primary intended uses: Machine Translation and multilingual NLP tasks on over 400 languages.
|
|
> Primary intended users: Research community.
|
|
|
|
## Out-of-Scope Use
|
|
|
|
> These models are trained on general domain data and are therefore not meant to
|
|
> work on domain-specific models out-of-the box. Moreover, these research models have not been assessed
|
|
> for production usecases.
|
|
|
|
# Bias, Risks, and Limitations
|
|
|
|
> We note that we evaluate on only 204 of the languages supported by these models and on machine translation
|
|
> and few-shot machine translation tasks. Users must consider use of this model carefully for their own
|
|
> usecase.
|
|
|
|
## Ethical considerations and risks
|
|
|
|
> We trained these models with MADLAD-400 and publicly available data to create baseline models that
|
|
> support NLP for over 400 languages, with a focus on languages underrepresented in large-scale corpora.
|
|
> Given that these models were trained with web-crawled datasets that may contain sensitive, offensive or
|
|
> otherwise low-quality content despite extensive preprocessing, it is still possible that these issues to the
|
|
> underlying training data may cause differences in model performance and toxic (or otherwise problematic)
|
|
> output for certain domains. Moreover, large models are dual use technologies that have specific risks
|
|
> associated with their use and development. We point the reader to surveys such as those written by
|
|
> Weidinger et al. or Bommasani et al. for a more detailed discussion of these risks, and to Liebling
|
|
> et al. for a thorough discussion of the risks of machine translation systems.
|
|
|
|
## Known Limitations
|
|
|
|
More information needed
|
|
|
|
## Sensitive Use:
|
|
|
|
More information needed
|
|
|
|
# Training Details
|
|
|
|
> We train models of various sizes: a 3B, 32-layer parameter model,
|
|
> a 7.2B 48-layer parameter model and a 10.7B 32-layer parameter model.
|
|
> We share all parameters of the model across language pairs,
|
|
> and use a Sentence Piece Model with 256k tokens shared on both the encoder and decoder
|
|
> side. Each input sentence has a <2xx> token prepended to the source sentence to indicate the target
|
|
> language.
|
|
|
|
See the [research paper](https://arxiv.org/pdf/2309.04662.pdf) for further details.
|
|
|
|
## Training Data
|
|
|
|
> For both the machine translation and language model, MADLAD-400 is used. For the machine translation
|
|
> model, a combination of parallel datasources covering 157 languages is also used. Further details are
|
|
> described in the [paper](https://arxiv.org/pdf/2309.04662.pdf).
|
|
|
|
## Training Procedure
|
|
|
|
See the [research paper](https://arxiv.org/pdf/2309.04662.pdf) for further details.
|
|
|
|
# Evaluation
|
|
|
|
## Testing Data, Factors & Metrics
|
|
|
|
> For evaluation, we used WMT, NTREX, Flores-200 and Gatones datasets as described in Section 4.3 in the [paper](https://arxiv.org/pdf/2309.04662.pdf).
|
|
|
|
> The translation quality of this model varies based on language, as seen in the paper, and likely varies on
|
|
> domain, though we have not assessed this.
|
|
|
|
## Results
|
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64b7f632037d6452a321fa15/EzsMD1AwCuFH0S0DeD-n8.png)
|
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64b7f632037d6452a321fa15/CJ5zCUVy7vTU76Lc8NZcK.png)
|
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64b7f632037d6452a321fa15/NK0S-yVeWuhKoidpLYh3m.png)
|
|
|
|
See the [research paper](https://arxiv.org/pdf/2309.04662.pdf) for further details.
|
|
|
|
# Environmental Impact
|
|
|
|
More information needed
|
|
|
|
# Citation
|
|
|
|
**BibTeX:**
|
|
|
|
```bibtex
|
|
@misc{kudugunta2023madlad400,
|
|
title={MADLAD-400: A Multilingual And Document-Level Large Audited Dataset},
|
|
author={Sneha Kudugunta and Isaac Caswell and Biao Zhang and Xavier Garcia and Christopher A. Choquette-Choo and Katherine Lee and Derrick Xin and Aditya Kusupati and Romi Stella and Ankur Bapna and Orhan Firat},
|
|
year={2023},
|
|
eprint={2309.04662},
|
|
archivePrefix={arXiv},
|
|
primaryClass={cs.CL}
|
|
}
|
|
```
|
|
|