wissamantoun
commited on
Commit
•
85bafc6
1
Parent(s):
73ec04c
Update README.md
Browse files
README.md
CHANGED
@@ -14,12 +14,12 @@ widget:
|
|
14 |
|
15 |
<img src="https://raw.githubusercontent.com/aub-mind/arabert/master/arabert_logo.png" width="100" align="left"/>
|
16 |
|
17 |
-
**AraBERT** is an Arabic pretrained
|
18 |
|
19 |
-
There are two versions of the model, AraBERTv0.1 and AraBERTv1, with the difference being that AraBERTv1 uses pre-segmented text where prefixes and suffixes were
|
20 |
|
21 |
|
22 |
-
We
|
23 |
|
24 |
# AraBERTv2
|
25 |
|
@@ -46,9 +46,9 @@ All models are available in the `HuggingFace` model page under the [aubmindlab](
|
|
46 |
|
47 |
We identified an issue with AraBERTv1's wordpiece vocabulary. The issue came from punctuations and numbers that were still attached to words when learned the wordpiece vocab. We now insert a space between numbers and characters and around punctuation characters.
|
48 |
|
49 |
-
The new vocabulary was
|
50 |
|
51 |
-
**P.S.**: All the old BERT codes should work with the new BERT, just change the model name and check the new preprocessing
|
52 |
**Please read the section on how to use the [preprocessing function](#Preprocessing)**
|
53 |
|
54 |
## Bigger Dataset and More Compute
|
@@ -125,7 +125,7 @@ Google Scholar has our Bibtex wrong (missing name), use this instead
|
|
125 |
}
|
126 |
```
|
127 |
# Acknowledgments
|
128 |
-
Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs, couldn't have done it without this program, and to the [AUB MIND Lab](https://sites.aub.edu.lb/mindlab/) Members for the
|
129 |
|
130 |
# Contacts
|
131 |
**Wissam Antoun**: [Linkedin](https://www.linkedin.com/in/wissam-antoun-622142b4/) | [Twitter](https://twitter.com/wissam_antoun) | [Github](https://github.com/WissamAntoun) | <wfa07@mail.aub.edu> | <wissam.antoun@gmail.com>
|
|
|
14 |
|
15 |
<img src="https://raw.githubusercontent.com/aub-mind/arabert/master/arabert_logo.png" width="100" align="left"/>
|
16 |
|
17 |
+
**AraBERT** is an Arabic pretrained language model based on [Google's BERT architechture](https://github.com/google-research/bert). AraBERT uses the same BERT-Base config. More details are available in the [AraBERT Paper](https://arxiv.org/abs/2003.00104) and in the [AraBERT Meetup](https://github.com/WissamAntoun/pydata_khobar_meetup)
|
18 |
|
19 |
+
There are two versions of the model, AraBERTv0.1 and AraBERTv1, with the difference being that AraBERTv1 uses pre-segmented text where prefixes and suffixes were split using the [Farasa Segmenter](http://alt.qcri.org/farasa/segmenter.html).
|
20 |
|
21 |
|
22 |
+
We evaluate AraBERT models on different downstream tasks and compare them to [mBERT]((https://github.com/google-research/bert/blob/master/multilingual.md)), and other state of the art models (*To the extent of our knowledge*). The Tasks were Sentiment Analysis on 6 different datasets ([HARD](https://github.com/elnagara/HARD-Arabic-Dataset), [ASTD-Balanced](https://www.aclweb.org/anthology/D15-1299), [ArsenTD-Lev](https://staff.aub.edu.lb/~we07/Publications/ArSentD-LEV_Sentiment_Corpus.pdf), [LABR](https://github.com/mohamedadaly/LABR)), Named Entity Recognition with the [ANERcorp](http://curtis.ml.cmu.edu/w/courses/index.php/ANERcorp), and Arabic Question Answering on [Arabic-SQuAD and ARCD](https://github.com/husseinmozannar/SOQAL)
|
23 |
|
24 |
# AraBERTv2
|
25 |
|
|
|
46 |
|
47 |
We identified an issue with AraBERTv1's wordpiece vocabulary. The issue came from punctuations and numbers that were still attached to words when learned the wordpiece vocab. We now insert a space between numbers and characters and around punctuation characters.
|
48 |
|
49 |
+
The new vocabulary was learned using the `BertWordpieceTokenizer` from the `tokenizers` library, and should now support the Fast tokenizer implementation from the `transformers` library.
|
50 |
|
51 |
+
**P.S.**: All the old BERT codes should work with the new BERT, just change the model name and check the new preprocessing function
|
52 |
**Please read the section on how to use the [preprocessing function](#Preprocessing)**
|
53 |
|
54 |
## Bigger Dataset and More Compute
|
|
|
125 |
}
|
126 |
```
|
127 |
# Acknowledgments
|
128 |
+
Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs, couldn't have done it without this program, and to the [AUB MIND Lab](https://sites.aub.edu.lb/mindlab/) Members for the continuous support. Also thanks to [Yakshof](https://www.yakshof.com/#/) and Assafir for data and storage access. Another thanks for Habib Rahal (https://www.behance.net/rahalhabib), for putting a face to AraBERT.
|
129 |
|
130 |
# Contacts
|
131 |
**Wissam Antoun**: [Linkedin](https://www.linkedin.com/in/wissam-antoun-622142b4/) | [Twitter](https://twitter.com/wissam_antoun) | [Github](https://github.com/WissamAntoun) | <wfa07@mail.aub.edu> | <wissam.antoun@gmail.com>
|