shijie-wu lbourdois commited on
Commit
71aa017
1 Parent(s): 8557e84

Add multilingual to the language tag (#1)

Browse files

- Add multilingual to the language tag (275a230a736c69cc9131e652f0e5ee7f07925125)


Co-authored-by: Loïck BOURDOIS <lbourdois@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +3 -2
README.md CHANGED
@@ -2,11 +2,12 @@
2
  language:
3
  - ar
4
  - en
 
 
5
  tags:
6
  - bert
7
  - roberta
8
  - exbert
9
- license: mit
10
  datasets:
11
  - arabic_billion_words
12
  - cc100
@@ -23,7 +24,7 @@ tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/roberta-large-eng-ara-128k")
23
  model = AutoModelForMaskedLM.from_pretrained("jhu-clsp/roberta-large-eng-ara-128k")
24
  ```
25
 
26
- `roberta-large-eng-ara-128k` is an EnglishArabic bilingual encoders of 24-layer Transformers (d\_model= 1024), the same size as XLM-R large. We use the same Common Crawl corpus as XLM-R for pretraining. Additionally, we also use English and Arabic Wikipedia, Arabic Gigaword (Parker et al., 2011), Arabic OSCAR (Ortiz Suárez et al., 2020), Arabic News Corpus (El-Khair, 2016), and Arabic OSIAN (Zeroual et al.,2019). In total, we train with 9.2B words of Arabic text and 26.8B words of English text, more than either XLM-R (2.9B words/23.6B words) or GigaBERT v4 (Lan et al., 2020) (4.3B words/6.1B words). We build an EnglishArabic joint vocabulary using SentencePiece (Kudo and Richardson, 2018) with size of 128K. We additionally enforce coverage of all Arabic characters after normalization.
27
 
28
  ## Pretraining Detail
29
 
 
2
  language:
3
  - ar
4
  - en
5
+ - multilingual
6
+ license: mit
7
  tags:
8
  - bert
9
  - roberta
10
  - exbert
 
11
  datasets:
12
  - arabic_billion_words
13
  - cc100
 
24
  model = AutoModelForMaskedLM.from_pretrained("jhu-clsp/roberta-large-eng-ara-128k")
25
  ```
26
 
27
+ `roberta-large-eng-ara-128k` is an EnglishArabic bilingual encoders of 24-layer Transformers (d\_model= 1024), the same size as XLM-R large. We use the same Common Crawl corpus as XLM-R for pretraining. Additionally, we also use English and Arabic Wikipedia, Arabic Gigaword (Parker et al., 2011), Arabic OSCAR (Ortiz Su�rez et al., 2020), Arabic News Corpus (El-Khair, 2016), and Arabic OSIAN (Zeroual et al.,2019). In total, we train with 9.2B words of Arabic text and 26.8B words of English text, more than either XLM-R (2.9B words/23.6B words) or GigaBERT v4 (Lan et al., 2020) (4.3B words/6.1B words). We build an EnglishArabic joint vocabulary using SentencePiece (Kudo and Richardson, 2018) with size of 128K. We additionally enforce coverage of all Arabic characters after normalization.
28
 
29
  ## Pretraining Detail
30