Emilio407 commited on
Commit
18ace37
·
verified ·
1 Parent(s): fee5128

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +557 -134
README.md CHANGED
@@ -1,199 +1,622 @@
1
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  library_name: transformers
3
- tags: []
4
- ---
5
-
6
- # Model Card for Model ID
7
-
8
- <!-- Provide a quick summary of what the model is/does. -->
9
-
10
-
11
-
12
- ## Model Details
13
-
14
- ### Model Description
15
-
16
- <!-- Provide a longer summary of what this model is. -->
17
-
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
-
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
-
28
- ### Model Sources [optional]
29
-
30
- <!-- Provide the basic links for the model. -->
31
-
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
35
-
36
- ## Uses
37
-
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
-
40
- ### Direct Use
41
-
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
 
44
- [More Information Needed]
45
-
46
- ### Downstream Use [optional]
47
-
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
-
50
- [More Information Needed]
51
-
52
- ### Out-of-Scope Use
53
-
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
 
56
- [More Information Needed]
57
 
58
- ## Bias, Risks, and Limitations
59
 
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
 
 
 
 
 
 
 
 
61
 
62
- [More Information Needed]
63
 
64
- ### Recommendations
 
 
65
 
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
 
67
 
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
 
70
- ## How to Get Started with the Model
71
 
72
- Use the code below to get started with the model.
 
 
 
 
 
 
 
 
73
 
74
- [More Information Needed]
75
 
76
- ## Training Details
77
 
78
- ### Training Data
79
 
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
 
82
- [More Information Needed]
 
83
 
84
- ### Training Procedure
85
 
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
 
88
- #### Preprocessing [optional]
 
89
 
90
- [More Information Needed]
 
 
91
 
 
 
 
92
 
93
- #### Training Hyperparameters
 
 
94
 
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
 
97
- #### Speeds, Sizes, Times [optional]
98
 
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
 
100
 
101
- [More Information Needed]
102
 
103
- ## Evaluation
 
 
 
 
 
104
 
105
- <!-- This section describes the evaluation protocols and provides the results. -->
106
 
107
- ### Testing Data, Factors & Metrics
108
 
109
- #### Testing Data
110
 
111
- <!-- This should link to a Dataset Card if possible. -->
112
 
113
- [More Information Needed]
 
114
 
115
- #### Factors
116
 
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
 
 
118
 
119
- [More Information Needed]
120
 
121
- #### Metrics
 
 
122
 
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
 
125
- [More Information Needed]
 
 
 
 
 
 
 
 
126
 
127
- ### Results
128
 
129
- [More Information Needed]
130
 
131
- #### Summary
132
 
 
133
 
 
134
 
135
- ## Model Examination [optional]
 
 
 
 
 
136
 
137
- <!-- Relevant interpretability work for the model goes here -->
138
 
139
- [More Information Needed]
140
 
141
- ## Environmental Impact
 
 
142
 
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
 
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
 
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
 
153
- ## Technical Specifications [optional]
154
 
155
- ### Model Architecture and Objective
156
 
157
- [More Information Needed]
 
158
 
159
- ### Compute Infrastructure
160
 
161
- [More Information Needed]
162
 
163
- #### Hardware
164
 
165
- [More Information Needed]
166
 
167
- #### Software
168
 
169
- [More Information Needed]
170
 
171
- ## Citation [optional]
172
 
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
 
175
  **BibTeX:**
176
 
177
- [More Information Needed]
178
-
179
- **APA:**
180
-
181
- [More Information Needed]
182
-
183
- ## Glossary [optional]
184
-
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
-
187
- [More Information Needed]
188
-
189
- ## More Information [optional]
190
-
191
- [More Information Needed]
192
-
193
- ## Model Card Authors [optional]
194
-
195
- [More Information Needed]
196
-
197
- ## Model Card Contact
198
-
199
- [More Information Needed]
 
1
  ---
2
+ base_model: google/madlad400-10b-mt
3
+ license: apache-2.0
4
+ language:
5
+ - multilingual
6
+ - en
7
+ - ru
8
+ - es
9
+ - fr
10
+ - de
11
+ - it
12
+ - pt
13
+ - pl
14
+ - nl
15
+ - vi
16
+ - tr
17
+ - sv
18
+ - id
19
+ - ro
20
+ - cs
21
+ - zh
22
+ - hu
23
+ - ja
24
+ - th
25
+ - fi
26
+ - fa
27
+ - uk
28
+ - da
29
+ - el
30
+ - "no"
31
+ - bg
32
+ - sk
33
+ - ko
34
+ - ar
35
+ - lt
36
+ - ca
37
+ - sl
38
+ - he
39
+ - et
40
+ - lv
41
+ - hi
42
+ - sq
43
+ - ms
44
+ - az
45
+ - sr
46
+ - ta
47
+ - hr
48
+ - kk
49
+ - is
50
+ - ml
51
+ - mr
52
+ - te
53
+ - af
54
+ - gl
55
+ - fil
56
+ - be
57
+ - mk
58
+ - eu
59
+ - bn
60
+ - ka
61
+ - mn
62
+ - bs
63
+ - uz
64
+ - ur
65
+ - sw
66
+ - yue
67
+ - ne
68
+ - kn
69
+ - kaa
70
+ - gu
71
+ - si
72
+ - cy
73
+ - eo
74
+ - la
75
+ - hy
76
+ - ky
77
+ - tg
78
+ - ga
79
+ - mt
80
+ - my
81
+ - km
82
+ - tt
83
+ - so
84
+ - ku
85
+ - ps
86
+ - pa
87
+ - rw
88
+ - lo
89
+ - ha
90
+ - dv
91
+ - fy
92
+ - lb
93
+ - ckb
94
+ - mg
95
+ - gd
96
+ - am
97
+ - ug
98
+ - ht
99
+ - grc
100
+ - hmn
101
+ - sd
102
+ - jv
103
+ - mi
104
+ - tk
105
+ - ceb
106
+ - yi
107
+ - ba
108
+ - fo
109
+ - or
110
+ - xh
111
+ - su
112
+ - kl
113
+ - ny
114
+ - sm
115
+ - sn
116
+ - co
117
+ - zu
118
+ - ig
119
+ - yo
120
+ - pap
121
+ - st
122
+ - haw
123
+ - as
124
+ - oc
125
+ - cv
126
+ - lus
127
+ - tet
128
+ - gsw
129
+ - sah
130
+ - br
131
+ - rm
132
+ - sa
133
+ - bo
134
+ - om
135
+ - se
136
+ - ce
137
+ - cnh
138
+ - ilo
139
+ - hil
140
+ - udm
141
+ - os
142
+ - lg
143
+ - ti
144
+ - vec
145
+ - ts
146
+ - tyv
147
+ - kbd
148
+ - ee
149
+ - iba
150
+ - av
151
+ - kha
152
+ - to
153
+ - tn
154
+ - nso
155
+ - fj
156
+ - zza
157
+ - ak
158
+ - ada
159
+ - otq
160
+ - dz
161
+ - bua
162
+ - cfm
163
+ - ln
164
+ - chm
165
+ - gn
166
+ - krc
167
+ - wa
168
+ - hif
169
+ - yua
170
+ - srn
171
+ - war
172
+ - rom
173
+ - bik
174
+ - pam
175
+ - sg
176
+ - lu
177
+ - ady
178
+ - kbp
179
+ - syr
180
+ - ltg
181
+ - myv
182
+ - iso
183
+ - kac
184
+ - bho
185
+ - ay
186
+ - kum
187
+ - qu
188
+ - za
189
+ - pag
190
+ - ngu
191
+ - ve
192
+ - pck
193
+ - zap
194
+ - tyz
195
+ - hui
196
+ - bbc
197
+ - tzo
198
+ - tiv
199
+ - ksd
200
+ - gom
201
+ - min
202
+ - ang
203
+ - nhe
204
+ - bgp
205
+ - nzi
206
+ - nnb
207
+ - nv
208
+ - zxx
209
+ - bci
210
+ - kv
211
+ - new
212
+ - mps
213
+ - alt
214
+ - meu
215
+ - bew
216
+ - fon
217
+ - iu
218
+ - abt
219
+ - mgh
220
+ - mnw
221
+ - tvl
222
+ - dov
223
+ - tlh
224
+ - ho
225
+ - kw
226
+ - mrj
227
+ - meo
228
+ - crh
229
+ - mbt
230
+ - emp
231
+ - ace
232
+ - ium
233
+ - mam
234
+ - gym
235
+ - mai
236
+ - crs
237
+ - pon
238
+ - ubu
239
+ - fip
240
+ - quc
241
+ - gv
242
+ - kj
243
+ - btx
244
+ - ape
245
+ - chk
246
+ - rcf
247
+ - shn
248
+ - tzh
249
+ - mdf
250
+ - ppk
251
+ - ss
252
+ - gag
253
+ - cab
254
+ - kri
255
+ - seh
256
+ - ibb
257
+ - tbz
258
+ - bru
259
+ - enq
260
+ - ach
261
+ - cuk
262
+ - kmb
263
+ - wo
264
+ - kek
265
+ - qub
266
+ - tab
267
+ - bts
268
+ - kos
269
+ - rwo
270
+ - cak
271
+ - tuc
272
+ - bum
273
+ - cjk
274
+ - gil
275
+ - stq
276
+ - tsg
277
+ - quh
278
+ - mak
279
+ - arn
280
+ - ban
281
+ - jiv
282
+ - sja
283
+ - yap
284
+ - tcy
285
+ - toj
286
+ - twu
287
+ - xal
288
+ - amu
289
+ - rmc
290
+ - hus
291
+ - nia
292
+ - kjh
293
+ - bm
294
+ - guh
295
+ - mas
296
+ - acf
297
+ - dtp
298
+ - ksw
299
+ - bzj
300
+ - din
301
+ - zne
302
+ - mad
303
+ - msi
304
+ - mag
305
+ - mkn
306
+ - kg
307
+ - lhu
308
+ - ch
309
+ - qvi
310
+ - mh
311
+ - djk
312
+ - sus
313
+ - mfe
314
+ - srm
315
+ - dyu
316
+ - ctu
317
+ - gui
318
+ - pau
319
+ - inb
320
+ - bi
321
+ - mni
322
+ - guc
323
+ - jam
324
+ - wal
325
+ - jac
326
+ - bas
327
+ - gor
328
+ - skr
329
+ - nyu
330
+ - noa
331
+ - sda
332
+ - gub
333
+ - nog
334
+ - cni
335
+ - teo
336
+ - tdx
337
+ - sxn
338
+ - rki
339
+ - nr
340
+ - frp
341
+ - alz
342
+ - taj
343
+ - lrc
344
+ - cce
345
+ - rn
346
+ - jvn
347
+ - hvn
348
+ - nij
349
+ - dwr
350
+ - izz
351
+ - msm
352
+ - bus
353
+ - ktu
354
+ - chr
355
+ - maz
356
+ - tzj
357
+ - suz
358
+ - knj
359
+ - bim
360
+ - gvl
361
+ - bqc
362
+ - tca
363
+ - pis
364
+ - prk
365
+ - laj
366
+ - mel
367
+ - qxr
368
+ - niq
369
+ - ahk
370
+ - shp
371
+ - hne
372
+ - spp
373
+ - koi
374
+ - krj
375
+ - quf
376
+ - luz
377
+ - agr
378
+ - tsc
379
+ - mqy
380
+ - gof
381
+ - gbm
382
+ - miq
383
+ - dje
384
+ - awa
385
+ - bjj
386
+ - qvz
387
+ - sjp
388
+ - tll
389
+ - raj
390
+ - kjg
391
+ - bgz
392
+ - quy
393
+ - cbk
394
+ - akb
395
+ - oj
396
+ - ify
397
+ - mey
398
+ - ks
399
+ - cac
400
+ - brx
401
+ - qup
402
+ - syl
403
+ - jax
404
+ - ff
405
+ - ber
406
+ - tks
407
+ - trp
408
+ - mrw
409
+ - adh
410
+ - smt
411
+ - srr
412
+ - ffm
413
+ - qvc
414
+ - mtr
415
+ - ann
416
+ - kaa
417
+ - aa
418
+ - noe
419
+ - nut
420
+ - gyn
421
+ - kwi
422
+ - xmm
423
+ - msb
424
  library_name: transformers
425
+ tags:
426
+ - text2text-generation
427
+ - text-generation-inference
428
+ datasets:
429
+ - allenai/MADLAD-400
430
+ pipeline_tag: translation
431
+
432
+ widget:
433
+ - text: "<2en> Como vai, amigo?"
434
+ example_title: "Translation to English"
435
+ - text: "<2de> Do you speak German?"
436
+ example_title: "Translation to German"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
437
 
438
+ ---
 
 
 
 
 
 
 
 
 
 
439
 
440
+ # Model Card for MADLAD-400-10B-MT
441
 
442
+ # Table of Contents
443
 
444
+ 0. [TL;DR](#TL;DR)
445
+ 1. [Model Details](#model-details)
446
+ 2. [Usage](#usage)
447
+ 3. [Uses](#uses)
448
+ 4. [Bias, Risks, and Limitations](#bias-risks-and-limitations)
449
+ 5. [Training Details](#training-details)
450
+ 6. [Evaluation](#evaluation)
451
+ 7. [Environmental Impact](#environmental-impact)
452
+ 8. [Citation](#citation)
453
 
454
+ # TL;DR
455
 
456
+ MADLAD-400-10B-MT is a multilingual machine translation model based on the T5 architecture that was
457
+ trained on 250 billion tokens covering over 450 languages using publicly available data.
458
+ It is competitive with models that are significantly larger.
459
 
460
+ **Disclaimer**: [Juarez Bochi](https://huggingface.co/jbochi), who was not involved in this research, converted
461
+ the original weights and wrote the contents of this model card based on the original paper and Flan-T5.
462
 
463
+ # Model Details
464
 
465
+ ## Model Description
466
 
467
+ - **Model type:** Language model
468
+ - **Language(s) (NLP):** Multilingual (400+ languages)
469
+ - **License:** Apache 2.0
470
+ - **Related Models:** [All MADLAD-400 Checkpoints](https://huggingface.co/models?search=madlad)
471
+ - **Original Checkpoints:** [All Original MADLAD-400 Checkpoints](https://github.com/google-research/google-research/tree/master/madlad_400)
472
+ - **Resources for more information:**
473
+ - [Research paper](https://arxiv.org/abs/2309.04662)
474
+ - [GitHub Repo](https://github.com/google-research/t5x)
475
+ - [Hugging Face MADLAD-400 Docs (Similar to T5) ](https://huggingface.co/docs/transformers/model_doc/MADLAD-400) - [Pending PR](https://github.com/huggingface/transformers/pull/27471)
476
 
477
+ # Usage
478
 
479
+ Find below some example scripts on how to use the model:
480
 
481
+ ## Using the Pytorch model with `transformers`
482
 
483
+ ### Running the model on a CPU or GPU
484
 
485
+ <details>
486
+ <summary> Click to expand </summary>
487
 
488
+ First, install the Python packages that are required:
489
 
490
+ `pip install transformers accelerate sentencepiece`
491
 
492
+ ```python
493
+ from transformers import T5ForConditionalGeneration, T5Tokenizer
494
 
495
+ model_name = 'google/madlad400-10b-mt'
496
+ model = T5ForConditionalGeneration.from_pretrained(model_name, device_map="auto")
497
+ tokenizer = T5Tokenizer.from_pretrained(model_name)
498
 
499
+ text = "<2pt> I love pizza!"
500
+ input_ids = tokenizer(text, return_tensors="pt").input_ids.to(model.device)
501
+ outputs = model.generate(input_ids=input_ids)
502
 
503
+ tokenizer.decode(outputs[0], skip_special_tokens=True)
504
+ # Eu adoro pizza!
505
+ ```
506
 
507
+ </details>
508
 
509
+ ## Running the model with Candle
510
 
511
+ <details>
512
+ <summary> Click to expand </summary>
513
 
514
+ Usage with [candle](https://github.com/huggingface/candle):
515
 
516
+ ```bash
517
+ $ cargo run --example t5 --release -- \
518
+ --model-id "google/madlad400-10b-mt" \
519
+ --prompt "<2de> How are you, my friend?" \
520
+ --decode --temperature 0
521
+ ```
522
 
523
+ </details>
524
 
 
525
 
526
+ # Uses
527
 
528
+ ## Direct Use and Downstream Use
529
 
530
+ > Primary intended uses: Machine Translation and multilingual NLP tasks on over 400 languages.
531
+ > Primary intended users: Research community.
532
 
533
+ ## Out-of-Scope Use
534
 
535
+ > These models are trained on general domain data and are therefore not meant to
536
+ > work on domain-specific models out-of-the box. Moreover, these research models have not been assessed
537
+ > for production usecases.
538
 
539
+ # Bias, Risks, and Limitations
540
 
541
+ > We note that we evaluate on only 204 of the languages supported by these models and on machine translation
542
+ > and few-shot machine translation tasks. Users must consider use of this model carefully for their own
543
+ > usecase.
544
 
545
+ ## Ethical considerations and risks
546
 
547
+ > We trained these models with MADLAD-400 and publicly available data to create baseline models that
548
+ > support NLP for over 400 languages, with a focus on languages underrepresented in large-scale corpora.
549
+ > Given that these models were trained with web-crawled datasets that may contain sensitive, offensive or
550
+ > otherwise low-quality content despite extensive preprocessing, it is still possible that these issues to the
551
+ > underlying training data may cause differences in model performance and toxic (or otherwise problematic)
552
+ > output for certain domains. Moreover, large models are dual use technologies that have specific risks
553
+ > associated with their use and development. We point the reader to surveys such as those written by
554
+ > Weidinger et al. or Bommasani et al. for a more detailed discussion of these risks, and to Liebling
555
+ > et al. for a thorough discussion of the risks of machine translation systems.
556
 
557
+ ## Known Limitations
558
 
559
+ More information needed
560
 
561
+ ## Sensitive Use:
562
 
563
+ More information needed
564
 
565
+ # Training Details
566
 
567
+ > We train models of various sizes: a 3B, 32-layer parameter model,
568
+ > a 7.2B 48-layer parameter model and a 10.7B 32-layer parameter model.
569
+ > We share all parameters of the model across language pairs,
570
+ > and use a Sentence Piece Model with 256k tokens shared on both the encoder and decoder
571
+ > side. Each input sentence has a <2xx> token prepended to the source sentence to indicate the target
572
+ > language.
573
 
574
+ See the [research paper](https://arxiv.org/pdf/2309.04662.pdf) for further details.
575
 
576
+ ## Training Data
577
 
578
+ > For both the machine translation and language model, MADLAD-400 is used. For the machine translation
579
+ > model, a combination of parallel datasources covering 157 languages is also used. Further details are
580
+ > described in the [paper](https://arxiv.org/pdf/2309.04662.pdf).
581
 
582
+ ## Training Procedure
583
 
584
+ See the [research paper](https://arxiv.org/pdf/2309.04662.pdf) for further details.
585
 
586
+ # Evaluation
 
 
 
 
587
 
588
+ ## Testing Data, Factors & Metrics
589
 
590
+ > For evaluation, we used WMT, NTREX, Flores-200 and Gatones datasets as described in Section 4.3 in the [paper](https://arxiv.org/pdf/2309.04662.pdf).
591
 
592
+ > The translation quality of this model varies based on language, as seen in the paper, and likely varies on
593
+ > domain, though we have not assessed this.
594
 
595
+ ## Results
596
 
597
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64b7f632037d6452a321fa15/EzsMD1AwCuFH0S0DeD-n8.png)
598
 
599
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64b7f632037d6452a321fa15/CJ5zCUVy7vTU76Lc8NZcK.png)
600
 
601
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64b7f632037d6452a321fa15/NK0S-yVeWuhKoidpLYh3m.png)
602
 
603
+ See the [research paper](https://arxiv.org/pdf/2309.04662.pdf) for further details.
604
 
605
+ # Environmental Impact
606
 
607
+ More information needed
608
 
609
+ # Citation
610
 
611
  **BibTeX:**
612
 
613
+ ```bibtex
614
+ @misc{kudugunta2023madlad400,
615
+ title={MADLAD-400: A Multilingual And Document-Level Large Audited Dataset},
616
+ author={Sneha Kudugunta and Isaac Caswell and Biao Zhang and Xavier Garcia and Christopher A. Choquette-Choo and Katherine Lee and Derrick Xin and Aditya Kusupati and Romi Stella and Ankur Bapna and Orhan Firat},
617
+ year={2023},
618
+ eprint={2309.04662},
619
+ archivePrefix={arXiv},
620
+ primaryClass={cs.CL}
621
+ }
622
+ ```