---
language:
  - ar
datasets:
  - mc4
  - oscar
  - arabic_billion_words
---

# arabic-t5-small

This is a T5v1.1 (small) trained on the concatenation of the Arabic Billion Words corpus and the Arabic subsets of the mC4 and Oscar datasets. The model could only be trained for about `10%` of the whole dataset due to time limitations.

## Training parameters

|                       |               |
| :-------------------: | :-----------: |
|         steps         |   `22'000`    |
|  Training batch size  |     `384`     |
| Evaluation batch size |     `768`     |
|     learning rate     |    `1e-2`     |
|         dtype         | `jnp.float32` |

## Results

|                     |               |
| :-----------------: | :-----------: |
| evaluation accuracy |   `56.84%`    |
|   evaluation loss   |    `2.423`    |
|    training loss    |    `2.392`    |
|    training time    | `22h 23m 51s` |

## Note for finetuning

This model was pretrained with dropout turned off, so the default `dropout_rate` in the model config is `0`.
To finetune the model dropout should be turned be back on, like this:

```python
model = T5ForConditionalGeneration.from_pretrained("flax-community/arabic-t5-small", dropout_rate=0.1)
```

or,

```python
model = AutoModelForSeq2SeqLM.from_pretrained("flax-community/arabic-t5-small", dropout_rate=0.1)
```