Shijie Wu
commited on
Commit
·
e3d2acc
1
Parent(s):
f7685c1
Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,49 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- ar
|
4 |
+
- en
|
5 |
+
tags:
|
6 |
+
- bert
|
7 |
+
- roberta
|
8 |
+
- exbert
|
9 |
+
license: mit
|
10 |
+
datasets:
|
11 |
+
- arabic_billion_words
|
12 |
+
- cc100
|
13 |
+
- gigaword
|
14 |
+
- oscar
|
15 |
+
- wikipedia
|
16 |
+
---
|
17 |
+
|
18 |
+
# An English-Arabic Bilingual Encoder
|
19 |
+
|
20 |
+
`roberta-large-eng-ara-128k` is an English–Arabic bilingual encoders of 24-layer Transformers (`d_model= 1024`), the same size as XLM-R large. We use the same Common Crawl corpus as XLM-R for pretraining. Additionally, we also use English and Arabic Wikipedia, Arabic Gigaword (Parker et al., 2011), Arabic OSCAR (Ortiz Suárez et al., 2020), Arabic News Corpus (El-Khair, 2016), and Arabic OSIAN (Zeroual et al.,2019). In total, we train with 9.2B words of Arabic text and 26.8B words of English text, more than either XLM-R (2.9B words/23.6B words) or GigaBERT v4 (Lan et al., 2020) (4.3B words/6.1B words). We build an English–Arabic joint vocabularies using SentencePiece (Kudo and Richardson, 2018) with size of 128K. We additionally enforce coverage of all Arabic characters after normalization.
|
21 |
+
|
22 |
+
## Pretraining Detail
|
23 |
+
|
24 |
+
We pretrain each encoder with a batch size of 2048 sequences and 512 sequence length for 250K steps from scratch roughly 1/24 the amount of pretraining compute of XLM-R. Training takes 8 RTX6000 GPUs roughly three weeks. We follow the pretraining recipe of RoBERTa (Liu et al., 2019) and XLM-R. We omit the next sentence prediction task and use a learning rate of 2e-4, Adam optimizer, and linear warmup of 10K steps then decaylinearly to 0, multilingual sampling alpha of 0.3,and the fairseq (Ott et al., 2019) implementation.
|
25 |
+
|
26 |
+
## Citation
|
27 |
+
|
28 |
+
Please cite this paper for reference:
|
29 |
+
|
30 |
+
```bibtex
|
31 |
+
@inproceedings{yarmohammadi-etal-2021-everything,
|
32 |
+
title = "Everything Is All It Takes: A Multipronged Strategy for Zero-Shot Cross-Lingual Information Extraction",
|
33 |
+
author = "Yarmohammadi, Mahsa and
|
34 |
+
Wu, Shijie and
|
35 |
+
Marone, Marc and
|
36 |
+
Xu, Haoran and
|
37 |
+
Ebner, Seth and
|
38 |
+
Qin, Guanghui and
|
39 |
+
Chen, Yunmo and
|
40 |
+
Guo, Jialiang and
|
41 |
+
Harman, Craig and
|
42 |
+
Murray, Kenton and
|
43 |
+
White, Aaron Steven and
|
44 |
+
Dredze, Mark and
|
45 |
+
Van Durme, Benjamin",
|
46 |
+
booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
|
47 |
+
year = "2021",
|
48 |
+
}
|
49 |
+
```
|