arabic-t5-small / README.md
salti's picture
Update README.md
e4555b1
|
raw
history blame
2.15 kB
metadata
language:
  - ar
datasets:
  - mc4
  - oscar
  - arabic_billion_words

arabic-t5-small

This is a T5v1.1 (small) trained on the concatenation of the Arabic Billion Words corpus and the Arabic subsets of the mC4 and Oscar datasets.

The model could only be trained for about 10% of the whole dataset due to time limitations. This is equivalent to 22'000 steps or about 4.3 Billion tokens.

Training parameters

Training batch size 384
Evaluation batch size 768
learning rate 1e-2
dtype jnp.float32

Preprocessing and the tokenizer

We tried to keep the preprocessing to the bare minimum. We ony replaced URLs, emails and social media user mentions with fixed tokens.

Contrary to other pretrained Arabic LMs, we decided to not strip the Arabic diacritics and to keep them in the vocabulary.

The tokenizer was trained on 5% of the training set, with a vocabulary size of 64'000.

For more details about preprocessing, check the tokenizer code

Data

The model was trained on the concatenation of the Arabic Billion Words corpus and the Arabic subsets of the mC4 and Oscar datasets.

A random 0.1% subset of the data was reserved for evaluation and the rest for training.

Results

Evaluation accuracy 56.84%
Evaluation Loss 2.423
Training Loss 2.392
Training Time 22h 23m 51s

Note for finetuning

This model was pretrained with dropout turned off, so the default dropout_rate in the model config is 0. To finetune the model dropout should be turned be back on, like this:

model = T5ForConditionalGeneration.from_pretrained("flax-community/arabic-t5-small", dropout_rate=0.1)

or,

model = AutoModelForSeq2SeqLM.from_pretrained("flax-community/arabic-t5-small", dropout_rate=0.1)