metadata

language:
  - ar
datasets:
  - mc4
  - oscar
  - arabic_billion_words

arabic-t5-small

This is a T5v1.1 (small) trained on the concatenation of the Arabic Billion Words corpus and the Arabic subsets of the mC4 and Oscar datasets. The model could only be trained for about 10% of the whole dataset due to time limitations.

Training parameters


steps	`22'000`
Training batch size	`384`
Evaluation batch size	`768`
learning rate	`1e-2`
dtype	`jnp.float32`

Note for finetuning:

This model was pretrained with dropout turned off, so the default dropout_rate in the model config is 0. To finetune the model dropout should be turned be back on, like this:

model = T5ForConditionalGeneration.from_pretrained("flax-community/arabic-t5-small", dropout_rate=0.1)

or,

model = AutoModelForSeq2SeqLM.from_pretrained("flax-community/arabic-t5-small", dropout_rate=0.1)