Add multilingual to the language tag (#1)
Browse files- Add multilingual to the language tag (275a230a736c69cc9131e652f0e5ee7f07925125)
Co-authored-by: Loïck BOURDOIS <lbourdois@users.noreply.huggingface.co>
README.md
CHANGED
@@ -2,11 +2,12 @@
|
|
2 |
language:
|
3 |
- ar
|
4 |
- en
|
|
|
|
|
5 |
tags:
|
6 |
- bert
|
7 |
- roberta
|
8 |
- exbert
|
9 |
-
license: mit
|
10 |
datasets:
|
11 |
- arabic_billion_words
|
12 |
- cc100
|
@@ -23,7 +24,7 @@ tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/roberta-large-eng-ara-128k")
|
|
23 |
model = AutoModelForMaskedLM.from_pretrained("jhu-clsp/roberta-large-eng-ara-128k")
|
24 |
```
|
25 |
|
26 |
-
`roberta-large-eng-ara-128k` is an English
|
27 |
|
28 |
## Pretraining Detail
|
29 |
|
|
|
2 |
language:
|
3 |
- ar
|
4 |
- en
|
5 |
+
- multilingual
|
6 |
+
license: mit
|
7 |
tags:
|
8 |
- bert
|
9 |
- roberta
|
10 |
- exbert
|
|
|
11 |
datasets:
|
12 |
- arabic_billion_words
|
13 |
- cc100
|
|
|
24 |
model = AutoModelForMaskedLM.from_pretrained("jhu-clsp/roberta-large-eng-ara-128k")
|
25 |
```
|
26 |
|
27 |
+
`roberta-large-eng-ara-128k` is an English�Arabic bilingual encoders of 24-layer Transformers (d\_model= 1024), the same size as XLM-R large. We use the same Common Crawl corpus as XLM-R for pretraining. Additionally, we also use English and Arabic Wikipedia, Arabic Gigaword (Parker et al., 2011), Arabic OSCAR (Ortiz Su�rez et al., 2020), Arabic News Corpus (El-Khair, 2016), and Arabic OSIAN (Zeroual et al.,2019). In total, we train with 9.2B words of Arabic text and 26.8B words of English text, more than either XLM-R (2.9B words/23.6B words) or GigaBERT v4 (Lan et al., 2020) (4.3B words/6.1B words). We build an English�Arabic joint vocabulary using SentencePiece (Kudo and Richardson, 2018) with size of 128K. We additionally enforce coverage of all Arabic characters after normalization.
|
28 |
|
29 |
## Pretraining Detail
|
30 |
|