|
Hugging Face's logo |
|
--- |
|
language: |
|
- om |
|
- am |
|
- rw |
|
- rn |
|
- ha |
|
- ig |
|
- pcm |
|
- so |
|
- sw |
|
- ti |
|
- yo |
|
- multilingual |
|
- T5 |
|
|
|
--- |
|
# afriteva_base |
|
|
|
## Model desription |
|
|
|
AfriTeVa base is a sequence to sequence model pretrained on 10 African languages |
|
|
|
## Languages |
|
|
|
Afaan Oromoo(orm), Amharic(amh), Gahuza(gah), Hausa(hau), Igbo(igb), Nigerian Pidgin(pcm), Somali(som), Swahili(swa), Tigrinya(tig), Yoruba(yor) |
|
|
|
### More information on the model, dataset: |
|
|
|
### The model |
|
|
|
- 229M parameters encoder-decoder architecture (T5-like) |
|
- 12 layers, 12 attention heads and 512 token sequence length |
|
|
|
### The dataset |
|
|
|
- Multilingual: 10 African languages listed above |
|
- 143 Million Tokens (1GB of text data) |
|
- Tokenizer Vocabulary Size: 70,000 tokens |
|
|
|
## Intended uses & limitations |
|
|
|
```python |
|
>>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer |
|
|
|
>>> tokenizer = AutoTokenizer.from_pretrained("castorini/afriteva_large") |
|
>>> model = AutoModelForSeq2SeqLM.from_pretrained("castorini/afriteva_large") |
|
|
|
>>> src_text = "Ó hùn ọ́ láti di ara wa bí?" |
|
>>> tgt_text = "Would you like to be?" |
|
|
|
>>> model_inputs = tokenizer(src_text, return_tensors="pt") |
|
>>> with tokenizer.as_target_tokenizer(): |
|
labels = tokenizer(tgt_text, return_tensors="pt").input_ids |
|
|
|
>>> model(**model_inputs, labels=labels) # forward pass |
|
``` |
|
|
|
## Training Procedure |
|
|
|
For information on training procedures, please refer to the AfriTeVa [paper](#) or [repository](https://github.com/castorini/afriteva) |
|
|
|
## BibTex entry and Citation info |
|
|
|
coming soon ... |
|
|
|
|
|
|
|
|