Shijie Wu commited on
Commit
25fb6b2
1 Parent(s): e3d2acc
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -17,11 +17,11 @@ datasets:
17
 
18
  # An English-Arabic Bilingual Encoder
19
 
20
- `roberta-large-eng-ara-128k` is an English–Arabic bilingual encoders of 24-layer Transformers (`d_model= 1024`), the same size as XLM-R large. We use the same Common Crawl corpus as XLM-R for pretraining. Additionally, we also use English and Arabic Wikipedia, Arabic Gigaword (Parker et al., 2011), Arabic OSCAR (Ortiz Suárez et al., 2020), Arabic News Corpus (El-Khair, 2016), and Arabic OSIAN (Zeroual et al.,2019). In total, we train with 9.2B words of Arabic text and 26.8B words of English text, more than either XLM-R (2.9B words/23.6B words) or GigaBERT v4 (Lan et al., 2020) (4.3B words/6.1B words). We build an English–Arabic joint vocabularies using SentencePiece (Kudo and Richardson, 2018) with size of 128K. We additionally enforce coverage of all Arabic characters after normalization.
21
 
22
  ## Pretraining Detail
23
 
24
- We pretrain each encoder with a batch size of 2048 sequences and 512 sequence length for 250K steps from scratch roughly 1/24 the amount of pretraining compute of XLM-R. Training takes 8 RTX6000 GPUs roughly three weeks. We follow the pretraining recipe of RoBERTa (Liu et al., 2019) and XLM-R. We omit the next sentence prediction task and use a learning rate of 2e-4, Adam optimizer, and linear warmup of 10K steps then decaylinearly to 0, multilingual sampling alpha of 0.3,and the fairseq (Ott et al., 2019) implementation.
25
 
26
  ## Citation
27
 
 
17
 
18
  # An English-Arabic Bilingual Encoder
19
 
20
+ `roberta-large-eng-ara-128k` is an English–Arabic bilingual encoders of 24-layer Transformers (d\_model= 1024), the same size as XLM-R large. We use the same Common Crawl corpus as XLM-R for pretraining. Additionally, we also use English and Arabic Wikipedia, Arabic Gigaword (Parker et al., 2011), Arabic OSCAR (Ortiz Suárez et al., 2020), Arabic News Corpus (El-Khair, 2016), and Arabic OSIAN (Zeroual et al.,2019). In total, we train with 9.2B words of Arabic text and 26.8B words of English text, more than either XLM-R (2.9B words/23.6B words) or GigaBERT v4 (Lan et al., 2020) (4.3B words/6.1B words). We build an English–Arabic joint vocabulary using SentencePiece (Kudo and Richardson, 2018) with size of 128K. We additionally enforce coverage of all Arabic characters after normalization.
21
 
22
  ## Pretraining Detail
23
 
24
+ We pretrain each encoder with a batch size of 2048 sequences and 512 sequence length for 250K steps from scratch roughly 1/24 the amount of pretraining compute of XLM-R. Training takes 8 RTX6000 GPUs roughly three weeks. We follow the pretraining recipe of RoBERTa (Liu et al., 2019) and XLM-R. We omit the next sentence prediction task and use a learning rate of 2e-4, Adam optimizer, and linear warmup of 10K steps then decay linearly to 0, multilingual sampling alpha of 0.3, and the fairseq (Ott et al., 2019) implementation.
25
 
26
  ## Citation
27