jhu-clsp
/

roberta-large-eng-ara-128k

Inference Endpoints

Model card Files Files and versions Community

Shijie Wu commited on Sep 14, 2021

Commit

e3d2acc

·

1 Parent(s): f7685c1

Create README.md

Files changed (1) hide show

README.md +49 -0

README.md ADDED Viewed

	@@ -0,0 +1,49 @@

+---
+language:
+- ar
+- en
+tags:
+- bert
+- roberta
+- exbert
+license: mit
+datasets:
+- arabic_billion_words
+- cc100
+- gigaword
+- oscar
+- wikipedia
+---
+# An English-Arabic Bilingual Encoder
+`roberta-large-eng-ara-128k` is an English–Arabic bilingual encoders of 24-layer Transformers (`d_model= 1024`), the same size as XLM-R large. We use the same Common Crawl corpus as XLM-R for pretraining. Additionally, we also use English and Arabic Wikipedia, Arabic Gigaword (Parker et al., 2011), Arabic OSCAR (Ortiz Suárez et al., 2020), Arabic News Corpus (El-Khair, 2016), and Arabic OSIAN (Zeroual et al.,2019). In total, we train with 9.2B words of Arabic text and 26.8B words of English text, more than either XLM-R (2.9B words/23.6B words) or GigaBERT v4 (Lan et al., 2020) (4.3B words/6.1B words). We build an English–Arabic joint vocabularies using SentencePiece (Kudo and Richardson, 2018) with size of 128K. We additionally enforce coverage of all Arabic characters after normalization.
+## Pretraining Detail
+We pretrain each encoder with a batch size of 2048 sequences and 512 sequence length for 250K steps from scratch roughly 1/24 the amount of pretraining compute of XLM-R. Training takes 8 RTX6000 GPUs roughly three weeks. We follow the pretraining recipe of RoBERTa (Liu et al., 2019) and XLM-R. We omit the next sentence prediction task and use a learning rate of 2e-4, Adam optimizer, and linear warmup of 10K steps then decaylinearly to 0, multilingual sampling alpha of 0.3,and the fairseq (Ott et al., 2019) implementation.
+## Citation
+Please cite this paper for reference:
+```bibtex
+@inproceedings{yarmohammadi-etal-2021-everything,
+    title = "Everything Is All It Takes: A Multipronged Strategy for Zero-Shot Cross-Lingual Information Extraction",
+    author = "Yarmohammadi, Mahsa  and
+      Wu, Shijie  and
+      Marone, Marc  and
+      Xu, Haoran  and
+      Ebner, Seth  and
+      Qin, Guanghui  and
+      Chen, Yunmo and
+      Guo, Jialiang and
+      Harman, Craig  and
+      Murray, Kenton and
+      White, Aaron Steven  and
+      Dredze, Mark and
+      Van Durme, Benjamin",
+    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
+    year = "2021",
+}
+```