xlm-roberta-base-focus-arabic
XLM-R adapted to Arabic using "FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models".
Code: https://github.com/konstantinjdobler/focus
Paper: https://arxiv.org/abs/2305.14481
Usage
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("konstantindobler/xlm-roberta-base-focus-arabic")
model = AutoModelForMaskedLM.from_pretrained("konstantindobler/xlm-roberta-base-focus-arabic")
# Use model and tokenizer as usual
Details
The model is based on xlm-roberta-base and was adapted to Arabic. The original multilingual tokenizer was replaced by a language-specific Arabic tokenizer with a vocabulary of 50k tokens. The new embeddings were initialized with FOCUS. The model was then trained on data from CC100 for 390k optimizer steps. More details and hyperparameters can be found in the paper.
Disclaimer
The web-scale dataset used for pretraining and tokenizer training (CC100) might contain personal and sensitive information.
Such behavior needs to be assessed carefully before any real-world deployment of the models. Also, the tokenizer training was conducted using a sentencepiece character_coverage
of 100%. As a result, the vocabulary contains characters which are usually not used in Arabic.
Citation
Please cite FOCUS as follows:
@misc{dobler-demelo-2023-focus,
title={FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models},
author={Konstantin Dobler and Gerard de Melo},
year={2023},
eprint={2305.14481},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
- Downloads last month
- 2