Commit
•
2d84ea7
1
Parent(s):
3739b7d
Add Model Card (#1)
Browse files- Upload aya-fig1.png (3abb7c602e62cdbf30475a287902e8c2f9c0181f)
- Add Model Card details (b1595698cb965a84ad366eef03f56c3d44565cba)
- Correct some stuff (aec01f8635f5fd3ad48335530ce72fa2ff9e0b45)
Co-authored-by: Viraat Aryabumi <viraat@users.noreply.huggingface.co>
- .gitattributes +1 -0
- README.md +309 -0
- aya-fig1.png +3 -0
.gitattributes
CHANGED
@@ -34,3 +34,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
36 |
*.json filter=lfs diff=lfs merge=lfs -text
|
|
|
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
36 |
*.json filter=lfs diff=lfs merge=lfs -text
|
37 |
+
aya-fig1.png filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
@@ -1,3 +1,312 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
+
datasets:
|
4 |
+
- CohereForAI/xP3x
|
5 |
+
- CohereForAI/aya_dataset
|
6 |
+
- CohereForAI/aya_collection
|
7 |
+
- DataProvenanceInitiative/Commercially-Verified-Licenses
|
8 |
+
- CohereForAI/aya_evaluation_suite
|
9 |
+
language:
|
10 |
+
- afr
|
11 |
+
- amh
|
12 |
+
- ara
|
13 |
+
- aze
|
14 |
+
- bel
|
15 |
+
- ben
|
16 |
+
- bul
|
17 |
+
- cat
|
18 |
+
- ceb
|
19 |
+
- ces
|
20 |
+
- cym
|
21 |
+
- dan
|
22 |
+
- deu
|
23 |
+
- ell
|
24 |
+
- eng
|
25 |
+
- epo
|
26 |
+
- est
|
27 |
+
- eus
|
28 |
+
- fin
|
29 |
+
- fil
|
30 |
+
- fra
|
31 |
+
- fry
|
32 |
+
- gla
|
33 |
+
- gle
|
34 |
+
- glg
|
35 |
+
- guj
|
36 |
+
- hat
|
37 |
+
- hau
|
38 |
+
- heb
|
39 |
+
- hin
|
40 |
+
- hun
|
41 |
+
- hye
|
42 |
+
- ibo
|
43 |
+
- ind
|
44 |
+
- isl
|
45 |
+
- ita
|
46 |
+
- jav
|
47 |
+
- jpn
|
48 |
+
- kan
|
49 |
+
- kat
|
50 |
+
- kaz
|
51 |
+
- khm
|
52 |
+
- kir
|
53 |
+
- kor
|
54 |
+
- kur
|
55 |
+
- lao
|
56 |
+
- lav
|
57 |
+
- lat
|
58 |
+
- lit
|
59 |
+
- ltz
|
60 |
+
- mal
|
61 |
+
- mar
|
62 |
+
- mkd
|
63 |
+
- mlg
|
64 |
+
- mlt
|
65 |
+
- mon
|
66 |
+
- mri
|
67 |
+
- msa
|
68 |
+
- mya
|
69 |
+
- nep
|
70 |
+
- nld
|
71 |
+
- nor
|
72 |
+
- nso
|
73 |
+
- nya
|
74 |
+
- ory
|
75 |
+
- pan
|
76 |
+
- pes
|
77 |
+
- pol
|
78 |
+
- por
|
79 |
+
- pus
|
80 |
+
- ron
|
81 |
+
- rus
|
82 |
+
- sin
|
83 |
+
- slk
|
84 |
+
- slv
|
85 |
+
- smo
|
86 |
+
- sna
|
87 |
+
- snd
|
88 |
+
- som
|
89 |
+
- sot
|
90 |
+
- spa
|
91 |
+
- sqi
|
92 |
+
- srp
|
93 |
+
- sun
|
94 |
+
- swa
|
95 |
+
- swe
|
96 |
+
- tam
|
97 |
+
- tel
|
98 |
+
- tgk
|
99 |
+
- tha
|
100 |
+
- tur
|
101 |
+
- twi
|
102 |
+
- ukr
|
103 |
+
- urd
|
104 |
+
- uzb
|
105 |
+
- vie
|
106 |
+
- xho
|
107 |
+
- yid
|
108 |
+
- yor
|
109 |
+
- zho
|
110 |
+
- zul
|
111 |
+
metrics:
|
112 |
+
- accuracy
|
113 |
+
- bleu
|
114 |
---
|
115 |
+
|
116 |
+
<img src="aya-fig1.png" alt="Aya model summary image" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
|
117 |
+
|
118 |
+
# Model Card for Aya Model
|
119 |
+
|
120 |
+
## Model Summary
|
121 |
+
|
122 |
+
> The Aya model is a massively multilingual generative language model that follows instructions in 101 languages.
|
123 |
+
> Aya outperforms [mT0](https://huggingface.co/bigscience/mt0-xxl) and [BLOOMZ](https://huggingface.co/bigscience/bloomz) a wide variety of automatic and human evaluations despite covering double the number of languages.
|
124 |
+
> The Aya model is trained using [xP3x](https://huggingface.co/datasets/CohereForAI/xP3x), [Aya Dataset](https://huggingface.co/datasets/CohereForAI/aya_dataset), [Aya Collection](https://huggingface.co/datasets/CohereForAI/aya_collection), a subset of [DataProvenance collection](https://huggingface.co/datasets/DataProvenanceInitiative/Commercially-Verified-Licenses) and ShareGPT-Command.
|
125 |
+
> We release the checkpoints under a Apache-2.0 license to further our mission of multilingual technologies empowering a
|
126 |
+
> multilingual world.
|
127 |
+
|
128 |
+
- **Developed by:** Cohere For AI
|
129 |
+
- **Model type:** a Transformer style autoregressive massively multilingual language model.
|
130 |
+
- **Paper**: [Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model](arxiv.com)
|
131 |
+
- **Point of Contact**: [Ahmet Ustun](mailto:ahmet@cohere.com)
|
132 |
+
- **Languages**: Refer to the list of languages in the `language` section of this model card.
|
133 |
+
- **License**: Apache-2.0
|
134 |
+
- **Model**: [Aya](https://huggingface.co/CohereForAI/aya)
|
135 |
+
- **Model Size**: 13 billion parameters
|
136 |
+
- **Datasets**: [xP3x](https://huggingface.co/datasets/CohereForAI/xP3x), [Aya Dataset](https://huggingface.co/datasets/CohereForAI/aya_dataset), [Aya Collection](https://huggingface.co/datasets/CohereForAI/aya_collection), [DataProvenance collection](https://huggingface.co/datasets/DataProvenanceInitiative/Commercially-Verified-Licenses), ShareGPT-Command.
|
137 |
+
|
138 |
+
## Use
|
139 |
+
|
140 |
+
```bash
|
141 |
+
# pip install -q transformers
|
142 |
+
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
|
143 |
+
|
144 |
+
checkpoint = "CohereForAI/aya_model"
|
145 |
+
|
146 |
+
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
|
147 |
+
aya_model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
|
148 |
+
|
149 |
+
inputs = tokenizer.encode("Translate to English: Je t’aime.", return_tensors="pt")
|
150 |
+
outputs = aya_model.generate(inputs)
|
151 |
+
print(tokenizer.decode(outputs[0]))
|
152 |
+
```
|
153 |
+
|
154 |
+
## Model Details
|
155 |
+
|
156 |
+
### Training
|
157 |
+
|
158 |
+
- Architecture: Same as [mt5-xxl](https://huggingface.co/google/mt5-xxl)
|
159 |
+
- Number of Finetuning Samples: 25M
|
160 |
+
- Batch size: 256
|
161 |
+
- Hardware: TPUv4-128
|
162 |
+
- Software: T5X, Jax
|
163 |
+
|
164 |
+
### Data Sources
|
165 |
+
|
166 |
+
The Aya model is trained on the following datasets:
|
167 |
+
|
168 |
+
- [xP3x](https://huggingface.co/datasets/CohereForAI/xP3x)
|
169 |
+
- [Aya Dataset](https://huggingface.co/datasets/CohereForAI/aya_dataset)
|
170 |
+
- [Aya Collection](https://huggingface.co/datasets/CohereForAI/aya_collection)
|
171 |
+
- [DataProvenance collection](https://huggingface.co/datasets/DataProvenanceInitiative/Commercially-Verified-Licenses)
|
172 |
+
- ShareGPT-Command
|
173 |
+
|
174 |
+
All datasets are subset to the 101 languages supported by [mT5]. See the [paper](arxiv.com) for details about filtering and pruning.
|
175 |
+
|
176 |
+
## Evaluation
|
177 |
+
|
178 |
+
We refer to Section 5 from our paper for multilingual eval across 99 languages – including discriminative, generative tasks, human evaluation and simulated win rates that cover both held-out tasks and in-distribution performance.
|
179 |
+
|
180 |
+
## Bias, Risks, and Limitations
|
181 |
+
|
182 |
+
|
183 |
+
For a detailed overview of our effort at safety mitigation and benchmarking toxicity and bias across multiple languages, we refer Sections 6 and 7 of our paper: [Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model](arxiv.com).
|
184 |
+
|
185 |
+
We hope that the release of the Aya model will make community-based redteaming efforts possible, by exposing an open-source massively-multilingual model for community research.
|
186 |
+
|
187 |
+
## Citation
|
188 |
+
|
189 |
+
**BibTeX:**
|
190 |
+
|
191 |
+
```
|
192 |
+
@article{,
|
193 |
+
title={},
|
194 |
+
author={},
|
195 |
+
journal={Preprint},
|
196 |
+
year={2024}
|
197 |
+
}
|
198 |
+
```
|
199 |
+
|
200 |
+
**APA:**
|
201 |
+
|
202 |
+
## Languages Covered
|
203 |
+
|
204 |
+
Below is the list of languages used in finetuning the Aya Model. We group languages into higher-, mid-, and lower-resourcedness based on a language classification by [Joshi et. al, 2020](https://microsoft.github.io/linguisticdiversity/). For further details, refer to our [paper]()
|
205 |
+
|
206 |
+
| ISO Code | Language Name | Script | Family | Subgrouping | Resourcedness |
|
207 |
+
| :------- | :-------------- | :----------: | :-------------: | :---------------: | :-----------: |
|
208 |
+
| afr | Afrikaans | Latin | Indo-European | Germanic | Mid |
|
209 |
+
| amh | Amharic | Ge'ez | Afro-Asiatic | Semitic | Low |
|
210 |
+
| ara | Arabic | Arabic | Afro-Asiatic | Semitic | High |
|
211 |
+
| aze | Azerbaijani | Arabic/Latin | Turkic | Common Turkic | Low |
|
212 |
+
| bel | Belarusian | Cyrillic | Indo-European | Balto-Slavic | Mid |
|
213 |
+
| ben | Bengali | Bengali | Indo-European | Indo-Aryan | Mid |
|
214 |
+
| bul | Bulgarian | Cyrillic | Indo-European | Balto-Slavic | Mid |
|
215 |
+
| cat | Catalan | Latin | Indo-European | Italic | High |
|
216 |
+
| ceb | Cebuano | Latin | Austronesian | Malayo-Polynesian | Mid |
|
217 |
+
| ces | Czech | Latin | Indo-European | Balto-Slavic | High |
|
218 |
+
| cym | Welsh | Latin | Indo-European | Celtic | Low |
|
219 |
+
| dan | Danish | Latin | Indo-European | Germanic | Mid |
|
220 |
+
| deu | German | Latin | Indo-European | Germanic | High |
|
221 |
+
| ell | Greek | Greek | Indo-European | Graeco-Phrygian | Mid |
|
222 |
+
| eng | English | Latin | Indo-European | Germanic | High |
|
223 |
+
| epo | Esperanto | Latin | Constructed | Esperantic | Low |
|
224 |
+
| est | Estonian | Latin | Uralic | Finnic | Mid |
|
225 |
+
| eus | Basque | Latin | Basque | - | High |
|
226 |
+
| fin | Finnish | Latin | Uralic | Finnic | High |
|
227 |
+
| fil | Tagalog | Latin | Austronesian | Malayo-Polynesian | Mid |
|
228 |
+
| fra | French | Latin | Indo-European | Italic | High |
|
229 |
+
| fry | Western Frisian | Latin | Indo-European | Germanic | Low |
|
230 |
+
| gla | Scottish Gaelic | Latin | Indo-European | Celtic | Low |
|
231 |
+
| gle | Irish | Latin | Indo-European | Celtic | Low |
|
232 |
+
| glg | Galician | Latin | Indo-European | Italic | Mid |
|
233 |
+
| guj | Gujarati | Gujarati | Indo-European | Indo-Aryan | Low |
|
234 |
+
| hat | Haitian Creole | Latin | Indo-European | Italic | Low |
|
235 |
+
| hau | Hausa | Latin | Afro-Asiatic | Chadic | Low |
|
236 |
+
| heb | Hebrew | Hebrew | Afro-Asiatic | Semitic | Mid |
|
237 |
+
| hin | Hindi | Devanagari | Indo-European | Indo-Aryan | High |
|
238 |
+
| hun | Hungarian | Latin | Uralic | - | High |
|
239 |
+
| hye | Armenian | Armenian | Indo-European | Armenic | Low |
|
240 |
+
| ibo | Igbo | Latin | Atlantic-Congo | Benue-Congo | Low |
|
241 |
+
| ind | Indonesian | Latin | Austronesian | Malayo-Polynesian | Mid |
|
242 |
+
| isl | Icelandic | Latin | Indo-European | Germanic | Low |
|
243 |
+
| ita | Italian | Latin | Indo-European | Italic | High |
|
244 |
+
| jav | Javanese | Latin | Austronesian | Malayo-Polynesian | Low |
|
245 |
+
| jpn | Japanese | Japanese | Japonic | Japanesic | High |
|
246 |
+
| kan | Kannada | Kannada | Dravidian | South Dravidian | Low |
|
247 |
+
| kat | Georgian | Georgian | Kartvelian | Georgian-Zan | Mid |
|
248 |
+
| kaz | Kazakh | Cyrillic | Turkic | Common Turkic | Mid |
|
249 |
+
| khm | Khmer | Khmer | Austroasiatic | Khmeric | Low |
|
250 |
+
| kir | Kyrgyz | Cyrillic | Turkic | Common Turkic | Low |
|
251 |
+
| kor | Korean | Hangul | Koreanic | Korean | High |
|
252 |
+
| kur | Kurdish | Latin | Indo-European | Iranian | Low |
|
253 |
+
| lao | Lao | Lao | Tai-Kadai | Kam-Tai | Low |
|
254 |
+
| lav | Latvian | Latin | Indo-European | Balto-Slavic | Mid |
|
255 |
+
| lat | Latin | Latin | Indo-European | Italic | Mid |
|
256 |
+
| lit | Lithuanian | Latin | Indo-European | Balto-Slavic | Mid |
|
257 |
+
| ltz | Luxembourgish | Latin | Indo-European | Germanic | Low |
|
258 |
+
| mal | Malayalam | Malayalam | Dravidian | South Dravidian | Low |
|
259 |
+
| mar | Marathi | Devanagari | Indo-European | Indo-Aryan | Low |
|
260 |
+
| mkd | Macedonian | Cyrillic | Indo-European | Balto-Slavic | Low |
|
261 |
+
| mlg | Malagasy | Latin | Austronesian | Malayo-Polynesian | Low |
|
262 |
+
| mlt | Maltese | Latin | Afro-Asiatic | Semitic | Low |
|
263 |
+
| mon | Mongolian | Cyrillic | Mongolic-Khitan | Mongolic | Low |
|
264 |
+
| mri | Maori | Latin | Austronesian | Malayo-Polynesian | Low |
|
265 |
+
| msa | Malay | Latin | Austronesian | Malayo-Polynesian | Mid |
|
266 |
+
| mya | Burmese | Myanmar | Sino-Tibetan | Burmo-Qiangic | Low |
|
267 |
+
| nep | Nepali | Devanagari | Indo-European | Indo-Aryan | Low |
|
268 |
+
| nld | Dutch | Latin | Indo-European | Germanic | High |
|
269 |
+
| nor | Norwegian | Latin | Indo-European | Germanic | Low |
|
270 |
+
| nso | Northern Sotho | Latin | Atlantic-Congo | Benue-Congo | Low |
|
271 |
+
| nya | Chichewa | Latin | Atlantic-Congo | Benue-Congo | Low |
|
272 |
+
| ory | Oriya | Oriya | Indo-European | Indo-Aryan | Low |
|
273 |
+
| pan | Punjabi | Gurmukhi | Indo-European | Indo-Aryan | Low |
|
274 |
+
| pes | Persian | Arabic | Indo-European | Iranian | High |
|
275 |
+
| pol | Polish | Latin | Indo-European | Balto-Slavic | High |
|
276 |
+
| por | Portuguese | Latin | Indo-European | Italic | High |
|
277 |
+
| pus | Pashto | Arabic | Indo-European | Iranian | Low |
|
278 |
+
| ron | Romanian | Latin | Indo-European | Italic | Mid |
|
279 |
+
| rus | Russian | Cyrillic | Indo-European | Balto-Slavic | High |
|
280 |
+
| sin | Sinhala | Sinhala | Indo-European | Indo-Aryan | Low |
|
281 |
+
| slk | Slovak | Latin | Indo-European | Balto-Slavic | Mid |
|
282 |
+
| slv | Slovenian | Latin | Indo-European | Balto-Slavic | Mid |
|
283 |
+
| smo | Samoan | Latin | Austronesian | Malayo-Polynesian | Low |
|
284 |
+
| sna | Shona | Latin | Indo-European | Indo-Aryan | Low |
|
285 |
+
| snd | Sindhi | Arabic | Indo-European | Indo-Aryan | Low |
|
286 |
+
| som | Somali | Latin | Afro-Asiatic | Cushitic | Low |
|
287 |
+
| sot | Southern Sotho | Latin | Atlantic-Congo | Benue-Congo | Low |
|
288 |
+
| spa | Spanish | Latin | Indo-European | Italic | High |
|
289 |
+
| sqi | Albanian | Latin | Indo-European | Albanian | Low |
|
290 |
+
| srp | Serbian | Cyrillic | Indo-European | Balto-Slavic | High |
|
291 |
+
| sun | Sundanese | Latin | Austronesian | Malayo-Polynesian | Low |
|
292 |
+
| swa | Swahili | Latin | Atlantic-Congo | Benue-Congo | Low |
|
293 |
+
| swe | Swedish | Latin | Indo-European | Germanic | High |
|
294 |
+
| tam | Tamil | Tamil | Dravidian | South Dravidian | Mid |
|
295 |
+
| tel | Telugu | Telugu | Dravidian | South Dravidian | Low |
|
296 |
+
| tgk | Tajik | Cyrillic | Indo-European | Iranian | Low |
|
297 |
+
| tha | Thai | Thai | Tai-Kadai | Kam-Tai | Mid |
|
298 |
+
| tur | Turkish | Latin | Turkic | Common Turkic | High |
|
299 |
+
| twi | Twi | Latin | Atlantic-Congo | Niger-Congo | Low |
|
300 |
+
| ukr | Ukrainian | Cyrillic | Indo-European | Balto-Slavic | Mid |
|
301 |
+
| urd | Urdu | Arabic | Indo-European | Indo-Aryan | Mid |
|
302 |
+
| uzb | Uzbek | Latin | Turkic | Common Turkic | Mid |
|
303 |
+
| vie | Vietnamese | Latin | Austroasiatic | Vietic | High |
|
304 |
+
| xho | Xhosa | Latin | Atlantic-Congo | Benue-Congo | Low |
|
305 |
+
| yid | Yiddish | Hebrew | Indo-European | Germanic | Low |
|
306 |
+
| yor | Yoruba | Latin | Atlantic-Congo | Benue-Congo | Low |
|
307 |
+
| zho | Chinese | Han | Sino-Tibetan | Sinitic | High |
|
308 |
+
| zul | Zulu | Latin | Atlantic-Congo | Benue-Congo | Low |
|
309 |
+
|
310 |
+
## Model Card Contact
|
311 |
+
|
312 |
+
For errors in this model card, contact Ahmet or Viraat, `{ahmet, viraat} at cohere dot com`.
|
aya-fig1.png
ADDED
Git LFS Details
|