File size: 1,660 Bytes
e7b0b5d 849e7f0 e7b0b5d 99c2c05 e7b0b5d 99c2c05 e7b0b5d f3011e8 e7b0b5d 6c82d3a cd6b128 6c82d3a e7b0b5d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 |
Hugging Face's logo
---
language:
- om
- am
- rw
- rn
- ha
- ig
- pcm
- so
- sw
- ti
- yo
- multilingual
tags:
- T5
---
# afriteva_large
## Model desription
AfriTeVa large is a sequence to sequence model pretrained on 10 African languages
## Languages
Afaan Oromoo(orm), Amharic(amh), Gahuza(gah), Hausa(hau), Igbo(igb), Nigerian Pidgin(pcm), Somali(som), Swahili(swa), Tigrinya(tig), Yoruba(yor)
### More information on the model, dataset:
### The model
- 745M parameters encoder-decoder architecture (T5-like)
- 12 layers, 12 attention heads and 512 token sequence length
### The dataset
- Multilingual: 10 African languages listed above
- 143 Million Tokens (1GB of text data)
- Tokenizer Vocabulary Size: 70,000 tokens
## Intended uses & limitations
`afriteva_large` is pre-trained model and primarily aimed at being fine-tuned on multilingual sequence-to-sequence tasks.
```python
>>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("castorini/afriteva_large")
>>> model = AutoModelForSeq2SeqLM.from_pretrained("castorini/afriteva_large")
>>> src_text = "Ó hùn ọ́ láti di ara wa bí?"
>>> tgt_text = "Would you like to be?"
>>> model_inputs = tokenizer(src_text, return_tensors="pt")
>>> with tokenizer.as_target_tokenizer():
labels = tokenizer(tgt_text, return_tensors="pt").input_ids
>>> model(**model_inputs, labels=labels) # forward pass
```
## Training Procedure
For information on training procedures, please refer to the AfriTeVa [paper](#) or [repository](https://github.com/castorini/afriteva)
## BibTex entry and Citation info
coming soon ...
|