Files changed (3) hide show
  1. .gitattributes +1 -0
  2. README.md +309 -0
  3. aya-fig1.png +3 -0
.gitattributes CHANGED
@@ -34,3 +34,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  *.json filter=lfs diff=lfs merge=lfs -text
 
 
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  *.json filter=lfs diff=lfs merge=lfs -text
37
+ aya-fig1.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,312 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ datasets:
4
+ - CohereForAI/xP3x
5
+ - CohereForAI/aya_dataset
6
+ - CohereForAI/aya_collection
7
+ - DataProvenanceInitiative/Commercially-Verified-Licenses
8
+ - CohereForAI/aya_evaluation_suite
9
+ language:
10
+ - afr
11
+ - amh
12
+ - ara
13
+ - aze
14
+ - bel
15
+ - ben
16
+ - bul
17
+ - cat
18
+ - ceb
19
+ - ces
20
+ - cym
21
+ - dan
22
+ - deu
23
+ - ell
24
+ - eng
25
+ - epo
26
+ - est
27
+ - eus
28
+ - fin
29
+ - fil
30
+ - fra
31
+ - fry
32
+ - gla
33
+ - gle
34
+ - glg
35
+ - guj
36
+ - hat
37
+ - hau
38
+ - heb
39
+ - hin
40
+ - hun
41
+ - hye
42
+ - ibo
43
+ - ind
44
+ - isl
45
+ - ita
46
+ - jav
47
+ - jpn
48
+ - kan
49
+ - kat
50
+ - kaz
51
+ - khm
52
+ - kir
53
+ - kor
54
+ - kur
55
+ - lao
56
+ - lav
57
+ - lat
58
+ - lit
59
+ - ltz
60
+ - mal
61
+ - mar
62
+ - mkd
63
+ - mlg
64
+ - mlt
65
+ - mon
66
+ - mri
67
+ - msa
68
+ - mya
69
+ - nep
70
+ - nld
71
+ - nor
72
+ - nso
73
+ - nya
74
+ - ory
75
+ - pan
76
+ - pes
77
+ - pol
78
+ - por
79
+ - pus
80
+ - ron
81
+ - rus
82
+ - sin
83
+ - slk
84
+ - slv
85
+ - smo
86
+ - sna
87
+ - snd
88
+ - som
89
+ - sot
90
+ - spa
91
+ - sqi
92
+ - srp
93
+ - sun
94
+ - swa
95
+ - swe
96
+ - tam
97
+ - tel
98
+ - tgk
99
+ - tha
100
+ - tur
101
+ - twi
102
+ - ukr
103
+ - urd
104
+ - uzb
105
+ - vie
106
+ - xho
107
+ - yid
108
+ - yor
109
+ - zho
110
+ - zul
111
+ metrics:
112
+ - accuracy
113
+ - bleu
114
  ---
115
+
116
+ <img src="aya-fig1.png" alt="Aya model summary image" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
117
+
118
+ # Model Card for Aya Model
119
+
120
+ ## Model Summary
121
+
122
+ > The Aya model is a massively multilingual generative language model that follows instructions in 101 languages.
123
+ > Aya outperforms [mT0](https://huggingface.co/bigscience/mt0-xxl) and [BLOOMZ](https://huggingface.co/bigscience/bloomz) a wide variety of automatic and human evaluations despite covering double the number of languages.
124
+ > The Aya model is trained using [xP3x](https://huggingface.co/datasets/CohereForAI/xP3x), [Aya Dataset](https://huggingface.co/datasets/CohereForAI/aya_dataset), [Aya Collection](https://huggingface.co/datasets/CohereForAI/aya_collection), a subset of [DataProvenance collection](https://huggingface.co/datasets/DataProvenanceInitiative/Commercially-Verified-Licenses) and ShareGPT-Command.
125
+ > We release the checkpoints under a Apache-2.0 license to further our mission of multilingual technologies empowering a
126
+ > multilingual world.
127
+
128
+ - **Developed by:** Cohere For AI
129
+ - **Model type:** a Transformer style autoregressive massively multilingual language model.
130
+ - **Paper**: [Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model](arxiv.com)
131
+ - **Point of Contact**: [Ahmet Ustun](mailto:ahmet@cohere.com)
132
+ - **Languages**: Refer to the list of languages in the `language` section of this model card.
133
+ - **License**: Apache-2.0
134
+ - **Model**: [Aya](https://huggingface.co/CohereForAI/aya)
135
+ - **Model Size**: 13 billion parameters
136
+ - **Datasets**: [xP3x](https://huggingface.co/datasets/CohereForAI/xP3x), [Aya Dataset](https://huggingface.co/datasets/CohereForAI/aya_dataset), [Aya Collection](https://huggingface.co/datasets/CohereForAI/aya_collection), [DataProvenance collection](https://huggingface.co/datasets/DataProvenanceInitiative/Commercially-Verified-Licenses), ShareGPT-Command.
137
+
138
+ ## Use
139
+
140
+ ```bash
141
+ # pip install -q transformers
142
+ from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
143
+
144
+ checkpoint = "CohereForAI/aya_model"
145
+
146
+ tokenizer = AutoTokenizer.from_pretrained(checkpoint)
147
+ aya_model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
148
+
149
+ inputs = tokenizer.encode("Translate to English: Je t’aime.", return_tensors="pt")
150
+ outputs = aya_model.generate(inputs)
151
+ print(tokenizer.decode(outputs[0]))
152
+ ```
153
+
154
+ ## Model Details
155
+
156
+ ### Training
157
+
158
+ - Architecture: Same as [mt5-xxl](https://huggingface.co/google/mt5-xxl)
159
+ - Number of Finetuning Samples: 25M
160
+ - Batch size: 256
161
+ - Hardware: TPUv4-128
162
+ - Software: T5X, Jax
163
+
164
+ ### Data Sources
165
+
166
+ The Aya model is trained on the following datasets:
167
+
168
+ - [xP3x](https://huggingface.co/datasets/CohereForAI/xP3x)
169
+ - [Aya Dataset](https://huggingface.co/datasets/CohereForAI/aya_dataset)
170
+ - [Aya Collection](https://huggingface.co/datasets/CohereForAI/aya_collection)
171
+ - [DataProvenance collection](https://huggingface.co/datasets/DataProvenanceInitiative/Commercially-Verified-Licenses)
172
+ - ShareGPT-Command
173
+
174
+ All datasets are subset to the 101 languages supported by [mT5]. See the [paper](arxiv.com) for details about filtering and pruning.
175
+
176
+ ## Evaluation
177
+
178
+ We refer to Section 5 from our paper for multilingual eval across 99 languages – including discriminative, generative tasks, human evaluation and simulated win rates that cover both held-out tasks and in-distribution performance.
179
+
180
+ ## Bias, Risks, and Limitations
181
+
182
+
183
+ For a detailed overview of our effort at safety mitigation and benchmarking toxicity and bias across multiple languages, we refer Sections 6 and 7 of our paper: [Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model](arxiv.com).
184
+
185
+ We hope that the release of the Aya model will make community-based redteaming efforts possible, by exposing an open-source massively-multilingual model for community research.
186
+
187
+ ## Citation
188
+
189
+ **BibTeX:**
190
+
191
+ ```
192
+ @article{,
193
+ title={},
194
+ author={},
195
+ journal={Preprint},
196
+ year={2024}
197
+ }
198
+ ```
199
+
200
+ **APA:**
201
+
202
+ ## Languages Covered
203
+
204
+ Below is the list of languages used in finetuning the Aya Model. We group languages into higher-, mid-, and lower-resourcedness based on a language classification by [Joshi et. al, 2020](https://microsoft.github.io/linguisticdiversity/). For further details, refer to our [paper]()
205
+
206
+ | ISO Code | Language Name | Script | Family | Subgrouping | Resourcedness |
207
+ | :------- | :-------------- | :----------: | :-------------: | :---------------: | :-----------: |
208
+ | afr | Afrikaans | Latin | Indo-European | Germanic | Mid |
209
+ | amh | Amharic | Ge'ez | Afro-Asiatic | Semitic | Low |
210
+ | ara | Arabic | Arabic | Afro-Asiatic | Semitic | High |
211
+ | aze | Azerbaijani | Arabic/Latin | Turkic | Common Turkic | Low |
212
+ | bel | Belarusian | Cyrillic | Indo-European | Balto-Slavic | Mid |
213
+ | ben | Bengali | Bengali | Indo-European | Indo-Aryan | Mid |
214
+ | bul | Bulgarian | Cyrillic | Indo-European | Balto-Slavic | Mid |
215
+ | cat | Catalan | Latin | Indo-European | Italic | High |
216
+ | ceb | Cebuano | Latin | Austronesian | Malayo-Polynesian | Mid |
217
+ | ces | Czech | Latin | Indo-European | Balto-Slavic | High |
218
+ | cym | Welsh | Latin | Indo-European | Celtic | Low |
219
+ | dan | Danish | Latin | Indo-European | Germanic | Mid |
220
+ | deu | German | Latin | Indo-European | Germanic | High |
221
+ | ell | Greek | Greek | Indo-European | Graeco-Phrygian | Mid |
222
+ | eng | English | Latin | Indo-European | Germanic | High |
223
+ | epo | Esperanto | Latin | Constructed | Esperantic | Low |
224
+ | est | Estonian | Latin | Uralic | Finnic | Mid |
225
+ | eus | Basque | Latin | Basque | - | High |
226
+ | fin | Finnish | Latin | Uralic | Finnic | High |
227
+ | fil | Tagalog | Latin | Austronesian | Malayo-Polynesian | Mid |
228
+ | fra | French | Latin | Indo-European | Italic | High |
229
+ | fry | Western Frisian | Latin | Indo-European | Germanic | Low |
230
+ | gla | Scottish Gaelic | Latin | Indo-European | Celtic | Low |
231
+ | gle | Irish | Latin | Indo-European | Celtic | Low |
232
+ | glg | Galician | Latin | Indo-European | Italic | Mid |
233
+ | guj | Gujarati | Gujarati | Indo-European | Indo-Aryan | Low |
234
+ | hat | Haitian Creole | Latin | Indo-European | Italic | Low |
235
+ | hau | Hausa | Latin | Afro-Asiatic | Chadic | Low |
236
+ | heb | Hebrew | Hebrew | Afro-Asiatic | Semitic | Mid |
237
+ | hin | Hindi | Devanagari | Indo-European | Indo-Aryan | High |
238
+ | hun | Hungarian | Latin | Uralic | - | High |
239
+ | hye | Armenian | Armenian | Indo-European | Armenic | Low |
240
+ | ibo | Igbo | Latin | Atlantic-Congo | Benue-Congo | Low |
241
+ | ind | Indonesian | Latin | Austronesian | Malayo-Polynesian | Mid |
242
+ | isl | Icelandic | Latin | Indo-European | Germanic | Low |
243
+ | ita | Italian | Latin | Indo-European | Italic | High |
244
+ | jav | Javanese | Latin | Austronesian | Malayo-Polynesian | Low |
245
+ | jpn | Japanese | Japanese | Japonic | Japanesic | High |
246
+ | kan | Kannada | Kannada | Dravidian | South Dravidian | Low |
247
+ | kat | Georgian | Georgian | Kartvelian | Georgian-Zan | Mid |
248
+ | kaz | Kazakh | Cyrillic | Turkic | Common Turkic | Mid |
249
+ | khm | Khmer | Khmer | Austroasiatic | Khmeric | Low |
250
+ | kir | Kyrgyz | Cyrillic | Turkic | Common Turkic | Low |
251
+ | kor | Korean | Hangul | Koreanic | Korean | High |
252
+ | kur | Kurdish | Latin | Indo-European | Iranian | Low |
253
+ | lao | Lao | Lao | Tai-Kadai | Kam-Tai | Low |
254
+ | lav | Latvian | Latin | Indo-European | Balto-Slavic | Mid |
255
+ | lat | Latin | Latin | Indo-European | Italic | Mid |
256
+ | lit | Lithuanian | Latin | Indo-European | Balto-Slavic | Mid |
257
+ | ltz | Luxembourgish | Latin | Indo-European | Germanic | Low |
258
+ | mal | Malayalam | Malayalam | Dravidian | South Dravidian | Low |
259
+ | mar | Marathi | Devanagari | Indo-European | Indo-Aryan | Low |
260
+ | mkd | Macedonian | Cyrillic | Indo-European | Balto-Slavic | Low |
261
+ | mlg | Malagasy | Latin | Austronesian | Malayo-Polynesian | Low |
262
+ | mlt | Maltese | Latin | Afro-Asiatic | Semitic | Low |
263
+ | mon | Mongolian | Cyrillic | Mongolic-Khitan | Mongolic | Low |
264
+ | mri | Maori | Latin | Austronesian | Malayo-Polynesian | Low |
265
+ | msa | Malay | Latin | Austronesian | Malayo-Polynesian | Mid |
266
+ | mya | Burmese | Myanmar | Sino-Tibetan | Burmo-Qiangic | Low |
267
+ | nep | Nepali | Devanagari | Indo-European | Indo-Aryan | Low |
268
+ | nld | Dutch | Latin | Indo-European | Germanic | High |
269
+ | nor | Norwegian | Latin | Indo-European | Germanic | Low |
270
+ | nso | Northern Sotho | Latin | Atlantic-Congo | Benue-Congo | Low |
271
+ | nya | Chichewa | Latin | Atlantic-Congo | Benue-Congo | Low |
272
+ | ory | Oriya | Oriya | Indo-European | Indo-Aryan | Low |
273
+ | pan | Punjabi | Gurmukhi | Indo-European | Indo-Aryan | Low |
274
+ | pes | Persian | Arabic | Indo-European | Iranian | High |
275
+ | pol | Polish | Latin | Indo-European | Balto-Slavic | High |
276
+ | por | Portuguese | Latin | Indo-European | Italic | High |
277
+ | pus | Pashto | Arabic | Indo-European | Iranian | Low |
278
+ | ron | Romanian | Latin | Indo-European | Italic | Mid |
279
+ | rus | Russian | Cyrillic | Indo-European | Balto-Slavic | High |
280
+ | sin | Sinhala | Sinhala | Indo-European | Indo-Aryan | Low |
281
+ | slk | Slovak | Latin | Indo-European | Balto-Slavic | Mid |
282
+ | slv | Slovenian | Latin | Indo-European | Balto-Slavic | Mid |
283
+ | smo | Samoan | Latin | Austronesian | Malayo-Polynesian | Low |
284
+ | sna | Shona | Latin | Indo-European | Indo-Aryan | Low |
285
+ | snd | Sindhi | Arabic | Indo-European | Indo-Aryan | Low |
286
+ | som | Somali | Latin | Afro-Asiatic | Cushitic | Low |
287
+ | sot | Southern Sotho | Latin | Atlantic-Congo | Benue-Congo | Low |
288
+ | spa | Spanish | Latin | Indo-European | Italic | High |
289
+ | sqi | Albanian | Latin | Indo-European | Albanian | Low |
290
+ | srp | Serbian | Cyrillic | Indo-European | Balto-Slavic | High |
291
+ | sun | Sundanese | Latin | Austronesian | Malayo-Polynesian | Low |
292
+ | swa | Swahili | Latin | Atlantic-Congo | Benue-Congo | Low |
293
+ | swe | Swedish | Latin | Indo-European | Germanic | High |
294
+ | tam | Tamil | Tamil | Dravidian | South Dravidian | Mid |
295
+ | tel | Telugu | Telugu | Dravidian | South Dravidian | Low |
296
+ | tgk | Tajik | Cyrillic | Indo-European | Iranian | Low |
297
+ | tha | Thai | Thai | Tai-Kadai | Kam-Tai | Mid |
298
+ | tur | Turkish | Latin | Turkic | Common Turkic | High |
299
+ | twi | Twi | Latin | Atlantic-Congo | Niger-Congo | Low |
300
+ | ukr | Ukrainian | Cyrillic | Indo-European | Balto-Slavic | Mid |
301
+ | urd | Urdu | Arabic | Indo-European | Indo-Aryan | Mid |
302
+ | uzb | Uzbek | Latin | Turkic | Common Turkic | Mid |
303
+ | vie | Vietnamese | Latin | Austroasiatic | Vietic | High |
304
+ | xho | Xhosa | Latin | Atlantic-Congo | Benue-Congo | Low |
305
+ | yid | Yiddish | Hebrew | Indo-European | Germanic | Low |
306
+ | yor | Yoruba | Latin | Atlantic-Congo | Benue-Congo | Low |
307
+ | zho | Chinese | Han | Sino-Tibetan | Sinitic | High |
308
+ | zul | Zulu | Latin | Atlantic-Congo | Benue-Congo | Low |
309
+
310
+ ## Model Card Contact
311
+
312
+ For errors in this model card, contact Ahmet or Viraat, `{ahmet, viraat} at cohere dot com`.
aya-fig1.png ADDED

Git LFS Details

  • SHA256: 7edfbbb3281eaad6e86565f6f4dbed56a63132faf08fee437599fbe13930fe4f
  • Pointer size: 132 Bytes
  • Size of remote file: 1.23 MB