language: | |
- ar | |
- en | |
- multilingual | |
license: mit | |
tags: | |
- bert | |
- roberta | |
- exbert | |
datasets: | |
- arabic_billion_words | |
- cc100 | |
- gigaword | |
- oscar | |
- wikipedia | |
# An English-Arabic Bilingual Encoder | |
``` | |
from transformers import AutoModelForMaskedLM, AutoTokenizer | |
tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/roberta-large-eng-ara-128k") | |
model = AutoModelForMaskedLM.from_pretrained("jhu-clsp/roberta-large-eng-ara-128k") | |
``` | |
`roberta-large-eng-ara-128k` is an English�Arabic bilingual encoders of 24-layer Transformers (d\_model= 1024), the same size as XLM-R large. We use the same Common Crawl corpus as XLM-R for pretraining. Additionally, we also use English and Arabic Wikipedia, Arabic Gigaword (Parker et al., 2011), Arabic OSCAR (Ortiz Su�rez et al., 2020), Arabic News Corpus (El-Khair, 2016), and Arabic OSIAN (Zeroual et al.,2019). In total, we train with 9.2B words of Arabic text and 26.8B words of English text, more than either XLM-R (2.9B words/23.6B words) or GigaBERT v4 (Lan et al., 2020) (4.3B words/6.1B words). We build an English�Arabic joint vocabulary using SentencePiece (Kudo and Richardson, 2018) with size of 128K. We additionally enforce coverage of all Arabic characters after normalization. | |
## Pretraining Detail | |
We pretrain each encoder with a batch size of 2048 sequences and 512 sequence length for 250K steps from scratch roughly 1/24 the amount of pretraining compute of XLM-R. Training takes 8 RTX6000 GPUs roughly three weeks. We follow the pretraining recipe of RoBERTa (Liu et al., 2019) and XLM-R. We omit the next sentence prediction task and use a learning rate of 2e-4, Adam optimizer, and linear warmup of 10K steps then decay linearly to 0, multilingual sampling alpha of 0.3, and the fairseq (Ott et al., 2019) implementation. | |
## Citation | |
Please cite this paper for reference: | |
```bibtex | |
@inproceedings{yarmohammadi-etal-2021-everything, | |
title = "Everything Is All It Takes: A Multipronged Strategy for Zero-Shot Cross-Lingual Information Extraction", | |
author = "Yarmohammadi, Mahsa and | |
Wu, Shijie and | |
Marone, Marc and | |
Xu, Haoran and | |
Ebner, Seth and | |
Qin, Guanghui and | |
Chen, Yunmo and | |
Guo, Jialiang and | |
Harman, Craig and | |
Murray, Kenton and | |
White, Aaron Steven and | |
Dredze, Mark and | |
Van Durme, Benjamin", | |
booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing", | |
year = "2021", | |
} | |
``` | |