kyteinsky commited on
Commit
22c864a
1 Parent(s): 9a308ea

update readme

Browse files

Signed-off-by: Anupam Kumar <kyteinsky@gmail.com>

Files changed (1) hide show
  1. README.md +647 -3
README.md CHANGED
@@ -1,3 +1,647 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - multilingual
5
+ - en
6
+ - ru
7
+ - es
8
+ - fr
9
+ - de
10
+ - it
11
+ - pt
12
+ - pl
13
+ - nl
14
+ - vi
15
+ - tr
16
+ - sv
17
+ - id
18
+ - ro
19
+ - cs
20
+ - zh
21
+ - hu
22
+ - ja
23
+ - th
24
+ - fi
25
+ - fa
26
+ - uk
27
+ - da
28
+ - el
29
+ - "no"
30
+ - bg
31
+ - sk
32
+ - ko
33
+ - ar
34
+ - lt
35
+ - ca
36
+ - sl
37
+ - he
38
+ - et
39
+ - lv
40
+ - hi
41
+ - sq
42
+ - ms
43
+ - az
44
+ - sr
45
+ - ta
46
+ - hr
47
+ - kk
48
+ - is
49
+ - ml
50
+ - mr
51
+ - te
52
+ - af
53
+ - gl
54
+ - fil
55
+ - be
56
+ - mk
57
+ - eu
58
+ - bn
59
+ - ka
60
+ - mn
61
+ - bs
62
+ - uz
63
+ - ur
64
+ - sw
65
+ - yue
66
+ - ne
67
+ - kn
68
+ - kaa
69
+ - gu
70
+ - si
71
+ - cy
72
+ - eo
73
+ - la
74
+ - hy
75
+ - ky
76
+ - tg
77
+ - ga
78
+ - mt
79
+ - my
80
+ - km
81
+ - tt
82
+ - so
83
+ - ku
84
+ - ps
85
+ - pa
86
+ - rw
87
+ - lo
88
+ - ha
89
+ - dv
90
+ - fy
91
+ - lb
92
+ - ckb
93
+ - mg
94
+ - gd
95
+ - am
96
+ - ug
97
+ - ht
98
+ - grc
99
+ - hmn
100
+ - sd
101
+ - jv
102
+ - mi
103
+ - tk
104
+ - ceb
105
+ - yi
106
+ - ba
107
+ - fo
108
+ - or
109
+ - xh
110
+ - su
111
+ - kl
112
+ - ny
113
+ - sm
114
+ - sn
115
+ - co
116
+ - zu
117
+ - ig
118
+ - yo
119
+ - pap
120
+ - st
121
+ - haw
122
+ - as
123
+ - oc
124
+ - cv
125
+ - lus
126
+ - tet
127
+ - gsw
128
+ - sah
129
+ - br
130
+ - rm
131
+ - sa
132
+ - bo
133
+ - om
134
+ - se
135
+ - ce
136
+ - cnh
137
+ - ilo
138
+ - hil
139
+ - udm
140
+ - os
141
+ - lg
142
+ - ti
143
+ - vec
144
+ - ts
145
+ - tyv
146
+ - kbd
147
+ - ee
148
+ - iba
149
+ - av
150
+ - kha
151
+ - to
152
+ - tn
153
+ - nso
154
+ - fj
155
+ - zza
156
+ - ak
157
+ - ada
158
+ - otq
159
+ - dz
160
+ - bua
161
+ - cfm
162
+ - ln
163
+ - chm
164
+ - gn
165
+ - krc
166
+ - wa
167
+ - hif
168
+ - yua
169
+ - srn
170
+ - war
171
+ - rom
172
+ - bik
173
+ - pam
174
+ - sg
175
+ - lu
176
+ - ady
177
+ - kbp
178
+ - syr
179
+ - ltg
180
+ - myv
181
+ - iso
182
+ - kac
183
+ - bho
184
+ - ay
185
+ - kum
186
+ - qu
187
+ - za
188
+ - pag
189
+ - ngu
190
+ - ve
191
+ - pck
192
+ - zap
193
+ - tyz
194
+ - hui
195
+ - bbc
196
+ - tzo
197
+ - tiv
198
+ - ksd
199
+ - gom
200
+ - min
201
+ - ang
202
+ - nhe
203
+ - bgp
204
+ - nzi
205
+ - nnb
206
+ - nv
207
+ - zxx
208
+ - bci
209
+ - kv
210
+ - new
211
+ - mps
212
+ - alt
213
+ - meu
214
+ - bew
215
+ - fon
216
+ - iu
217
+ - abt
218
+ - mgh
219
+ - mnw
220
+ - tvl
221
+ - dov
222
+ - tlh
223
+ - ho
224
+ - kw
225
+ - mrj
226
+ - meo
227
+ - crh
228
+ - mbt
229
+ - emp
230
+ - ace
231
+ - ium
232
+ - mam
233
+ - gym
234
+ - mai
235
+ - crs
236
+ - pon
237
+ - ubu
238
+ - fip
239
+ - quc
240
+ - gv
241
+ - kj
242
+ - btx
243
+ - ape
244
+ - chk
245
+ - rcf
246
+ - shn
247
+ - tzh
248
+ - mdf
249
+ - ppk
250
+ - ss
251
+ - gag
252
+ - cab
253
+ - kri
254
+ - seh
255
+ - ibb
256
+ - tbz
257
+ - bru
258
+ - enq
259
+ - ach
260
+ - cuk
261
+ - kmb
262
+ - wo
263
+ - kek
264
+ - qub
265
+ - tab
266
+ - bts
267
+ - kos
268
+ - rwo
269
+ - cak
270
+ - tuc
271
+ - bum
272
+ - cjk
273
+ - gil
274
+ - stq
275
+ - tsg
276
+ - quh
277
+ - mak
278
+ - arn
279
+ - ban
280
+ - jiv
281
+ - sja
282
+ - yap
283
+ - tcy
284
+ - toj
285
+ - twu
286
+ - xal
287
+ - amu
288
+ - rmc
289
+ - hus
290
+ - nia
291
+ - kjh
292
+ - bm
293
+ - guh
294
+ - mas
295
+ - acf
296
+ - dtp
297
+ - ksw
298
+ - bzj
299
+ - din
300
+ - zne
301
+ - mad
302
+ - msi
303
+ - mag
304
+ - mkn
305
+ - kg
306
+ - lhu
307
+ - ch
308
+ - qvi
309
+ - mh
310
+ - djk
311
+ - sus
312
+ - mfe
313
+ - srm
314
+ - dyu
315
+ - ctu
316
+ - gui
317
+ - pau
318
+ - inb
319
+ - bi
320
+ - mni
321
+ - guc
322
+ - jam
323
+ - wal
324
+ - jac
325
+ - bas
326
+ - gor
327
+ - skr
328
+ - nyu
329
+ - noa
330
+ - sda
331
+ - gub
332
+ - nog
333
+ - cni
334
+ - teo
335
+ - tdx
336
+ - sxn
337
+ - rki
338
+ - nr
339
+ - frp
340
+ - alz
341
+ - taj
342
+ - lrc
343
+ - cce
344
+ - rn
345
+ - jvn
346
+ - hvn
347
+ - nij
348
+ - dwr
349
+ - izz
350
+ - msm
351
+ - bus
352
+ - ktu
353
+ - chr
354
+ - maz
355
+ - tzj
356
+ - suz
357
+ - knj
358
+ - bim
359
+ - gvl
360
+ - bqc
361
+ - tca
362
+ - pis
363
+ - prk
364
+ - laj
365
+ - mel
366
+ - qxr
367
+ - niq
368
+ - ahk
369
+ - shp
370
+ - hne
371
+ - spp
372
+ - koi
373
+ - krj
374
+ - quf
375
+ - luz
376
+ - agr
377
+ - tsc
378
+ - mqy
379
+ - gof
380
+ - gbm
381
+ - miq
382
+ - dje
383
+ - awa
384
+ - bjj
385
+ - qvz
386
+ - sjp
387
+ - tll
388
+ - raj
389
+ - kjg
390
+ - bgz
391
+ - quy
392
+ - cbk
393
+ - akb
394
+ - oj
395
+ - ify
396
+ - mey
397
+ - ks
398
+ - cac
399
+ - brx
400
+ - qup
401
+ - syl
402
+ - jax
403
+ - ff
404
+ - ber
405
+ - tks
406
+ - trp
407
+ - mrw
408
+ - adh
409
+ - smt
410
+ - srr
411
+ - ffm
412
+ - qvc
413
+ - mtr
414
+ - ann
415
+ - kaa
416
+ - aa
417
+ - noe
418
+ - nut
419
+ - gyn
420
+ - kwi
421
+ - xmm
422
+ - msb
423
+ library_name: ctranslate2
424
+ tags:
425
+ - text2text-generation
426
+ - text-generation-inference
427
+ datasets:
428
+ - allenai/MADLAD-400
429
+ pipeline_tag: translation
430
+
431
+ widget:
432
+ - text: "<2en> Como vai, amigo?"
433
+ example_title: "Translation to English"
434
+ - text: "<2de> Do you speak German?"
435
+ example_title: "Translation to German"
436
+
437
+ ---
438
+
439
+ # MADLAD-400-3B-MT (int8 quantized using CTranslate2)
440
+
441
+ ```
442
+ ct2-transformers-converter --model ./madlad400-3b-mt --quantization int8 --output_dir ctranslate-madlad400-3b-mt-8bit --copy_files added_tokens.json generation_config.json special_tokens_map.json spiece.model tokenizer.json tokenizer_config.json
443
+ ```
444
+
445
+ ---
446
+
447
+ Original model card below
448
+
449
+ ---
450
+
451
+
452
+ # Model Card for MADLAD-400-3B-MT
453
+
454
+ # Table of Contents
455
+
456
+ 0. [TL;DR](#TL;DR)
457
+ 1. [Model Details](#model-details)
458
+ 2. [Usage](#usage)
459
+ 3. [Uses](#uses)
460
+ 4. [Bias, Risks, and Limitations](#bias-risks-and-limitations)
461
+ 5. [Training Details](#training-details)
462
+ 6. [Evaluation](#evaluation)
463
+ 7. [Environmental Impact](#environmental-impact)
464
+ 8. [Citation](#citation)
465
+
466
+
467
+ # TL;DR
468
+
469
+ MADLAD-400-3B-MT is a multilingual machine translation model based on the T5 architecture that was
470
+ trained on 1 trillion tokens covering over 450 languages using publicly available data.
471
+ It is competitive with models that are significantly larger.
472
+
473
+ **Disclaimer**: [Juarez Bochi](https://huggingface.co/jbochi), who was not involved in this research, converted
474
+ the original weights and wrote the contents of this model card based on the original paper and Flan-T5.
475
+
476
+ # Model Details
477
+
478
+ ## Model Description
479
+
480
+ - **Model type:** Language model
481
+ - **Language(s) (NLP):** Multilingual (400+ languages)
482
+ - **License:** Apache 2.0
483
+ - **Related Models:** [All MADLAD-400 Checkpoints](https://huggingface.co/models?search=madlad)
484
+ - **Original Checkpoints:** [All Original MADLAD-400 Checkpoints](https://github.com/google-research/google-research/tree/master/madlad_400)
485
+ - **Resources for more information:**
486
+ - [Research paper](https://arxiv.org/abs/2309.04662)
487
+ - [GitHub Repo](https://github.com/google-research/t5x)
488
+ - [Hugging Face MADLAD-400 Docs (Similar to T5) ](https://huggingface.co/docs/transformers/model_doc/MADLAD-400) - [Pending PR](https://github.com/huggingface/transformers/pull/27471)
489
+
490
+ # Usage
491
+
492
+ Find below some example scripts on how to use the model:
493
+
494
+ ## Using the Pytorch model with `transformers`
495
+
496
+ ### Running the model on a CPU or GPU
497
+
498
+ <details>
499
+ <summary> Click to expand </summary>
500
+
501
+ First, install the Python packages that are required:
502
+
503
+ `pip install transformers accelerate sentencepiece`
504
+
505
+ ```python
506
+ from transformers import T5ForConditionalGeneration, T5Tokenizer
507
+
508
+ model_name = 'jbochi/madlad400-3b-mt'
509
+ model = T5ForConditionalGeneration.from_pretrained(model_name, device_map="auto")
510
+ tokenizer = T5Tokenizer.from_pretrained(model_name)
511
+
512
+ text = "<2pt> I love pizza!"
513
+ input_ids = tokenizer(text, return_tensors="pt").input_ids.to(model.device)
514
+ outputs = model.generate(input_ids=input_ids)
515
+
516
+ tokenizer.decode(outputs[0], skip_special_tokens=True)
517
+ # Eu adoro pizza!
518
+ ```
519
+
520
+ </details>
521
+
522
+ ## Running the model with Candle
523
+
524
+ <details>
525
+ <summary> Click to expand </summary>
526
+
527
+ Usage with [candle](https://github.com/huggingface/candle):
528
+
529
+ ```bash
530
+ $ cargo run --example t5 --release -- \
531
+ --model-id "jbochi/madlad400-3b-mt" \
532
+ --prompt "<2de> How are you, my friend?" \
533
+ --decode --temperature 0
534
+ ```
535
+
536
+ We also provide a quantized model (1.65 GB vs the original 11.8 GB file):
537
+
538
+ ```
539
+ cargo run --example quantized-t5 --release -- \
540
+ --model-id "jbochi/madlad400-3b-mt" --weight-file "model-q4k.gguf" \
541
+ --prompt "<2de> How are you, my friend?" \
542
+ --temperature 0
543
+ ...
544
+ Wie geht es dir, mein Freund?
545
+ ```
546
+
547
+ </details>
548
+
549
+
550
+ # Uses
551
+
552
+ ## Direct Use and Downstream Use
553
+
554
+ > Primary intended uses: Machine Translation and multilingual NLP tasks on over 400 languages.
555
+ > Primary intended users: Research community.
556
+
557
+ ## Out-of-Scope Use
558
+
559
+ > These models are trained on general domain data and are therefore not meant to
560
+ > work on domain-specific models out-of-the box. Moreover, these research models have not been assessed
561
+ > for production usecases.
562
+
563
+ # Bias, Risks, and Limitations
564
+
565
+ > We note that we evaluate on only 204 of the languages supported by these models and on machine translation
566
+ > and few-shot machine translation tasks. Users must consider use of this model carefully for their own
567
+ > usecase.
568
+
569
+ ## Ethical considerations and risks
570
+
571
+ > We trained these models with MADLAD-400 and publicly available data to create baseline models that
572
+ > support NLP for over 400 languages, with a focus on languages underrepresented in large-scale corpora.
573
+ > Given that these models were trained with web-crawled datasets that may contain sensitive, offensive or
574
+ > otherwise low-quality content despite extensive preprocessing, it is still possible that these issues to the
575
+ > underlying training data may cause differences in model performance and toxic (or otherwise problematic)
576
+ > output for certain domains. Moreover, large models are dual use technologies that have specific risks
577
+ > associated with their use and development. We point the reader to surveys such as those written by
578
+ > Weidinger et al. or Bommasani et al. for a more detailed discussion of these risks, and to Liebling
579
+ > et al. for a thorough discussion of the risks of machine translation systems.
580
+
581
+ ## Known Limitations
582
+
583
+ More information needed
584
+
585
+ ## Sensitive Use:
586
+
587
+ More information needed
588
+
589
+ # Training Details
590
+
591
+ > We train models of various sizes: a 3B, 32-layer parameter model,
592
+ > a 7.2B 48-layer parameter model and a 10.7B 32-layer parameter model.
593
+ > We share all parameters of the model across language pairs,
594
+ > and use a Sentence Piece Model with 256k tokens shared on both the encoder and decoder
595
+ > side. Each input sentence has a <2xx> token prepended to the source sentence to indicate the target
596
+ > language.
597
+
598
+ See the [research paper](https://arxiv.org/pdf/2309.04662.pdf) for further details.
599
+
600
+ ## Training Data
601
+
602
+ > For both the machine translation and language model, MADLAD-400 is used. For the machine translation
603
+ > model, a combination of parallel datasources covering 157 languages is also used. Further details are
604
+ > described in the [paper](https://arxiv.org/pdf/2309.04662.pdf).
605
+
606
+ ## Training Procedure
607
+
608
+ See the [research paper](https://arxiv.org/pdf/2309.04662.pdf) for further details.
609
+
610
+ # Evaluation
611
+
612
+ ## Testing Data, Factors & Metrics
613
+
614
+ > For evaluation, we used WMT, NTREX, Flores-200 and Gatones datasets as described in Section 4.3 in the [paper](https://arxiv.org/pdf/2309.04662.pdf).
615
+
616
+ > The translation quality of this model varies based on language, as seen in the paper, and likely varies on
617
+ > domain, though we have not assessed this.
618
+
619
+ ## Results
620
+
621
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64b7f632037d6452a321fa15/EzsMD1AwCuFH0S0DeD-n8.png)
622
+
623
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64b7f632037d6452a321fa15/CJ5zCUVy7vTU76Lc8NZcK.png)
624
+
625
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64b7f632037d6452a321fa15/NK0S-yVeWuhKoidpLYh3m.png)
626
+
627
+ See the [research paper](https://arxiv.org/pdf/2309.04662.pdf) for further details.
628
+
629
+ # Environmental Impact
630
+
631
+ More information needed
632
+
633
+ # Citation
634
+
635
+ **BibTeX:**
636
+
637
+ ```bibtex
638
+ @misc{kudugunta2023madlad400,
639
+ title={MADLAD-400: A Multilingual And Document-Level Large Audited Dataset},
640
+ author={Sneha Kudugunta and Isaac Caswell and Biao Zhang and Xavier Garcia and Christopher A. Choquette-Choo and Katherine Lee and Derrick Xin and Aditya Kusupati and Romi Stella and Ankur Bapna and Orhan Firat},
641
+ year={2023},
642
+ eprint={2309.04662},
643
+ archivePrefix={arXiv},
644
+ primaryClass={cs.CL}
645
+ }
646
+ ```
647
+