kyteinsky's picture
update readme
22c864a unverified
|
raw
history blame
10 kB
---
license: apache-2.0
language:
- multilingual
- en
- ru
- es
- fr
- de
- it
- pt
- pl
- nl
- vi
- tr
- sv
- id
- ro
- cs
- zh
- hu
- ja
- th
- fi
- fa
- uk
- da
- el
- "no"
- bg
- sk
- ko
- ar
- lt
- ca
- sl
- he
- et
- lv
- hi
- sq
- ms
- az
- sr
- ta
- hr
- kk
- is
- ml
- mr
- te
- af
- gl
- fil
- be
- mk
- eu
- bn
- ka
- mn
- bs
- uz
- ur
- sw
- yue
- ne
- kn
- kaa
- gu
- si
- cy
- eo
- la
- hy
- ky
- tg
- ga
- mt
- my
- km
- tt
- so
- ku
- ps
- pa
- rw
- lo
- ha
- dv
- fy
- lb
- ckb
- mg
- gd
- am
- ug
- ht
- grc
- hmn
- sd
- jv
- mi
- tk
- ceb
- yi
- ba
- fo
- or
- xh
- su
- kl
- ny
- sm
- sn
- co
- zu
- ig
- yo
- pap
- st
- haw
- as
- oc
- cv
- lus
- tet
- gsw
- sah
- br
- rm
- sa
- bo
- om
- se
- ce
- cnh
- ilo
- hil
- udm
- os
- lg
- ti
- vec
- ts
- tyv
- kbd
- ee
- iba
- av
- kha
- to
- tn
- nso
- fj
- zza
- ak
- ada
- otq
- dz
- bua
- cfm
- ln
- chm
- gn
- krc
- wa
- hif
- yua
- srn
- war
- rom
- bik
- pam
- sg
- lu
- ady
- kbp
- syr
- ltg
- myv
- iso
- kac
- bho
- ay
- kum
- qu
- za
- pag
- ngu
- ve
- pck
- zap
- tyz
- hui
- bbc
- tzo
- tiv
- ksd
- gom
- min
- ang
- nhe
- bgp
- nzi
- nnb
- nv
- zxx
- bci
- kv
- new
- mps
- alt
- meu
- bew
- fon
- iu
- abt
- mgh
- mnw
- tvl
- dov
- tlh
- ho
- kw
- mrj
- meo
- crh
- mbt
- emp
- ace
- ium
- mam
- gym
- mai
- crs
- pon
- ubu
- fip
- quc
- gv
- kj
- btx
- ape
- chk
- rcf
- shn
- tzh
- mdf
- ppk
- ss
- gag
- cab
- kri
- seh
- ibb
- tbz
- bru
- enq
- ach
- cuk
- kmb
- wo
- kek
- qub
- tab
- bts
- kos
- rwo
- cak
- tuc
- bum
- cjk
- gil
- stq
- tsg
- quh
- mak
- arn
- ban
- jiv
- sja
- yap
- tcy
- toj
- twu
- xal
- amu
- rmc
- hus
- nia
- kjh
- bm
- guh
- mas
- acf
- dtp
- ksw
- bzj
- din
- zne
- mad
- msi
- mag
- mkn
- kg
- lhu
- ch
- qvi
- mh
- djk
- sus
- mfe
- srm
- dyu
- ctu
- gui
- pau
- inb
- bi
- mni
- guc
- jam
- wal
- jac
- bas
- gor
- skr
- nyu
- noa
- sda
- gub
- nog
- cni
- teo
- tdx
- sxn
- rki
- nr
- frp
- alz
- taj
- lrc
- cce
- rn
- jvn
- hvn
- nij
- dwr
- izz
- msm
- bus
- ktu
- chr
- maz
- tzj
- suz
- knj
- bim
- gvl
- bqc
- tca
- pis
- prk
- laj
- mel
- qxr
- niq
- ahk
- shp
- hne
- spp
- koi
- krj
- quf
- luz
- agr
- tsc
- mqy
- gof
- gbm
- miq
- dje
- awa
- bjj
- qvz
- sjp
- tll
- raj
- kjg
- bgz
- quy
- cbk
- akb
- oj
- ify
- mey
- ks
- cac
- brx
- qup
- syl
- jax
- ff
- ber
- tks
- trp
- mrw
- adh
- smt
- srr
- ffm
- qvc
- mtr
- ann
- kaa
- aa
- noe
- nut
- gyn
- kwi
- xmm
- msb
library_name: ctranslate2
tags:
- text2text-generation
- text-generation-inference
datasets:
- allenai/MADLAD-400
pipeline_tag: translation
widget:
- text: "<2en> Como vai, amigo?"
example_title: "Translation to English"
- text: "<2de> Do you speak German?"
example_title: "Translation to German"
---
# MADLAD-400-3B-MT (int8 quantized using CTranslate2)
```
ct2-transformers-converter --model ./madlad400-3b-mt --quantization int8 --output_dir ctranslate-madlad400-3b-mt-8bit --copy_files added_tokens.json generation_config.json special_tokens_map.json spiece.model tokenizer.json tokenizer_config.json
```
---
Original model card below
---
# Model Card for MADLAD-400-3B-MT
# Table of Contents
0. [TL;DR](#TL;DR)
1. [Model Details](#model-details)
2. [Usage](#usage)
3. [Uses](#uses)
4. [Bias, Risks, and Limitations](#bias-risks-and-limitations)
5. [Training Details](#training-details)
6. [Evaluation](#evaluation)
7. [Environmental Impact](#environmental-impact)
8. [Citation](#citation)
# TL;DR
MADLAD-400-3B-MT is a multilingual machine translation model based on the T5 architecture that was
trained on 1 trillion tokens covering over 450 languages using publicly available data.
It is competitive with models that are significantly larger.
**Disclaimer**: [Juarez Bochi](https://huggingface.co/jbochi), who was not involved in this research, converted
the original weights and wrote the contents of this model card based on the original paper and Flan-T5.
# Model Details
## Model Description
- **Model type:** Language model
- **Language(s) (NLP):** Multilingual (400+ languages)
- **License:** Apache 2.0
- **Related Models:** [All MADLAD-400 Checkpoints](https://huggingface.co/models?search=madlad)
- **Original Checkpoints:** [All Original MADLAD-400 Checkpoints](https://github.com/google-research/google-research/tree/master/madlad_400)
- **Resources for more information:**
- [Research paper](https://arxiv.org/abs/2309.04662)
- [GitHub Repo](https://github.com/google-research/t5x)
- [Hugging Face MADLAD-400 Docs (Similar to T5) ](https://huggingface.co/docs/transformers/model_doc/MADLAD-400) - [Pending PR](https://github.com/huggingface/transformers/pull/27471)
# Usage
Find below some example scripts on how to use the model:
## Using the Pytorch model with `transformers`
### Running the model on a CPU or GPU
<details>
<summary> Click to expand </summary>
First, install the Python packages that are required:
`pip install transformers accelerate sentencepiece`
```python
from transformers import T5ForConditionalGeneration, T5Tokenizer
model_name = 'jbochi/madlad400-3b-mt'
model = T5ForConditionalGeneration.from_pretrained(model_name, device_map="auto")
tokenizer = T5Tokenizer.from_pretrained(model_name)
text = "<2pt> I love pizza!"
input_ids = tokenizer(text, return_tensors="pt").input_ids.to(model.device)
outputs = model.generate(input_ids=input_ids)
tokenizer.decode(outputs[0], skip_special_tokens=True)
# Eu adoro pizza!
```
</details>
## Running the model with Candle
<details>
<summary> Click to expand </summary>
Usage with [candle](https://github.com/huggingface/candle):
```bash
$ cargo run --example t5 --release -- \
--model-id "jbochi/madlad400-3b-mt" \
--prompt "<2de> How are you, my friend?" \
--decode --temperature 0
```
We also provide a quantized model (1.65 GB vs the original 11.8 GB file):
```
cargo run --example quantized-t5 --release -- \
--model-id "jbochi/madlad400-3b-mt" --weight-file "model-q4k.gguf" \
--prompt "<2de> How are you, my friend?" \
--temperature 0
...
Wie geht es dir, mein Freund?
```
</details>
# Uses
## Direct Use and Downstream Use
> Primary intended uses: Machine Translation and multilingual NLP tasks on over 400 languages.
> Primary intended users: Research community.
## Out-of-Scope Use
> These models are trained on general domain data and are therefore not meant to
> work on domain-specific models out-of-the box. Moreover, these research models have not been assessed
> for production usecases.
# Bias, Risks, and Limitations
> We note that we evaluate on only 204 of the languages supported by these models and on machine translation
> and few-shot machine translation tasks. Users must consider use of this model carefully for their own
> usecase.
## Ethical considerations and risks
> We trained these models with MADLAD-400 and publicly available data to create baseline models that
> support NLP for over 400 languages, with a focus on languages underrepresented in large-scale corpora.
> Given that these models were trained with web-crawled datasets that may contain sensitive, offensive or
> otherwise low-quality content despite extensive preprocessing, it is still possible that these issues to the
> underlying training data may cause differences in model performance and toxic (or otherwise problematic)
> output for certain domains. Moreover, large models are dual use technologies that have specific risks
> associated with their use and development. We point the reader to surveys such as those written by
> Weidinger et al. or Bommasani et al. for a more detailed discussion of these risks, and to Liebling
> et al. for a thorough discussion of the risks of machine translation systems.
## Known Limitations
More information needed
## Sensitive Use:
More information needed
# Training Details
> We train models of various sizes: a 3B, 32-layer parameter model,
> a 7.2B 48-layer parameter model and a 10.7B 32-layer parameter model.
> We share all parameters of the model across language pairs,
> and use a Sentence Piece Model with 256k tokens shared on both the encoder and decoder
> side. Each input sentence has a <2xx> token prepended to the source sentence to indicate the target
> language.
See the [research paper](https://arxiv.org/pdf/2309.04662.pdf) for further details.
## Training Data
> For both the machine translation and language model, MADLAD-400 is used. For the machine translation
> model, a combination of parallel datasources covering 157 languages is also used. Further details are
> described in the [paper](https://arxiv.org/pdf/2309.04662.pdf).
## Training Procedure
See the [research paper](https://arxiv.org/pdf/2309.04662.pdf) for further details.
# Evaluation
## Testing Data, Factors & Metrics
> For evaluation, we used WMT, NTREX, Flores-200 and Gatones datasets as described in Section 4.3 in the [paper](https://arxiv.org/pdf/2309.04662.pdf).
> The translation quality of this model varies based on language, as seen in the paper, and likely varies on
> domain, though we have not assessed this.
## Results
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64b7f632037d6452a321fa15/EzsMD1AwCuFH0S0DeD-n8.png)
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64b7f632037d6452a321fa15/CJ5zCUVy7vTU76Lc8NZcK.png)
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64b7f632037d6452a321fa15/NK0S-yVeWuhKoidpLYh3m.png)
See the [research paper](https://arxiv.org/pdf/2309.04662.pdf) for further details.
# Environmental Impact
More information needed
# Citation
**BibTeX:**
```bibtex
@misc{kudugunta2023madlad400,
title={MADLAD-400: A Multilingual And Document-Level Large Audited Dataset},
author={Sneha Kudugunta and Isaac Caswell and Biao Zhang and Xavier Garcia and Christopher A. Choquette-Choo and Katherine Lee and Derrick Xin and Aditya Kusupati and Romi Stella and Ankur Bapna and Orhan Firat},
year={2023},
eprint={2309.04662},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```