Ahmed Abdelali
commited on
Commit
•
38c6922
1
Parent(s):
17270fd
push farasa base model
Browse files- .gitattributes +2 -0
- README.md +83 -0
- config.json +19 -0
- model.ckpt.data-00000-of-00001 +3 -0
- model.ckpt.index +0 -0
- model.ckpt.meta +0 -0
- pytorch_model.bin +3 -0
- vocab.txt +0 -0
.gitattributes
CHANGED
@@ -14,3 +14,5 @@
|
|
14 |
*.pb filter=lfs diff=lfs merge=lfs -text
|
15 |
*.pt filter=lfs diff=lfs merge=lfs -text
|
16 |
*.pth filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
14 |
*.pb filter=lfs diff=lfs merge=lfs -text
|
15 |
*.pt filter=lfs diff=lfs merge=lfs -text
|
16 |
*.pth filter=lfs diff=lfs merge=lfs -text
|
17 |
+
pytorch_model.bin filter=lfs diff=lfs merge=lfs -text
|
18 |
+
model.ckpt.data-00000-of-00001 filter=lfs diff=lfs merge=lfs -text
|
README.md
ADDED
@@ -0,0 +1,83 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: ar
|
3 |
+
tags:
|
4 |
+
- pytorch
|
5 |
+
- tf
|
6 |
+
- QARiB
|
7 |
+
- qarib
|
8 |
+
datasets:
|
9 |
+
- arabic_billion_words
|
10 |
+
- open_subtitles
|
11 |
+
- twitter
|
12 |
+
- Farasa
|
13 |
+
metrics:
|
14 |
+
- f1
|
15 |
+
widget:
|
16 |
+
- text: "و+قام ال+مدير [MASK]"
|
17 |
+
---
|
18 |
+
# QARiB: QCRI Arabic and Dialectal BERT
|
19 |
+
## About QARiB Farasa
|
20 |
+
QCRI Arabic and Dialectal BERT (QARiB) model, was trained on a collection of ~ 420 Million tweets and ~ 180 Million sentences of text.
|
21 |
+
For the tweets, the data was collected using twitter API and using language filter. `lang:ar`. For the text data, it was a combination from
|
22 |
+
[Arabic GigaWord](url), [Abulkhair Arabic Corpus]() and [OPUS](http://opus.nlpl.eu/).
|
23 |
+
QARiB: Is the Arabic name for "Boat".
|
24 |
+
## Model and Parameters:
|
25 |
+
- Data size: 14B tokens
|
26 |
+
- Vocabulary: 64k
|
27 |
+
- Iterations: 10M
|
28 |
+
- Number of Layers: 12
|
29 |
+
## Training QARiB
|
30 |
+
See details in [Training QARiB](https://github.com/qcri/QARIB/Training_QARiB.md)
|
31 |
+
## Using QARiB
|
32 |
+
You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task. See the model hub to look for fine-tuned versions on a task that interests you. For more details, see [Using QARiB](https://github.com/qcri/QARIB/Using_QARiB.md)
|
33 |
+
This model expects the data to be segmented. You may use [Farasa Segmenter](https://farasa-api.qcri.org/segmentation/) API.
|
34 |
+
|
35 |
+
### How to use
|
36 |
+
You can use this model directly with a pipeline for masked language modeling:
|
37 |
+
```python
|
38 |
+
>>>from transformers import pipeline
|
39 |
+
>>>fill_mask = pipeline("fill-mask", model="./models/bert-base-qarib_far")
|
40 |
+
>>> fill_mask("و+قام ال+مدير [MASK]")
|
41 |
+
[
|
42 |
+
{'sequence': '[CLS] وقام المدير بالعمل [SEP]', 'score': 0.0678194984793663, 'token': 4230, 'token_str': 'بالعمل'},
|
43 |
+
{'sequence': '[CLS] وقام المدير بذلك [SEP]', 'score': 0.05191086605191231, 'token': 984, 'token_str': 'بذلك'},
|
44 |
+
{'sequence': '[CLS] وقام المدير بالاتصال [SEP]', 'score': 0.045264165848493576, 'token': 26096, 'token_str': 'بالاتصال'},
|
45 |
+
{'sequence': '[CLS] وقام المدير بعمله [SEP]', 'score': 0.03732728958129883, 'token': 40486, 'token_str': 'بعمله'},
|
46 |
+
{'sequence': '[CLS] وقام المدير بالامر [SEP]', 'score': 0.0246378555893898, 'token': 29124, 'token_str': 'بالامر'}
|
47 |
+
]
|
48 |
+
>>> fill_mask("و+قام+ت ال+مدير+ة [MASK]")
|
49 |
+
[{'sequence': '[CLS] وقامت المديرة بذلك [SEP]', 'score': 0.23992691934108734, 'token': 984, 'token_str': 'بذلك'},
|
50 |
+
{'sequence': '[CLS] وقامت المديرة بالامر [SEP]', 'score': 0.108805812895298, 'token': 29124, 'token_str': 'بالامر'},
|
51 |
+
{'sequence': '[CLS] وقامت المديرة بالعمل [SEP]', 'score': 0.06639821827411652, 'token': 4230, 'token_str': 'بالعمل'},
|
52 |
+
{'sequence': '[CLS] وقامت المديرة بالاتصال [SEP]', 'score': 0.05613093823194504, 'token': 26096, 'token_str': 'بالاتصال'},
|
53 |
+
{'sequence': '[CLS] وقامت المديرة المديرة [SEP]', 'score': 0.021778125315904617, 'token': 41635, 'token_str': 'المديرة'}]
|
54 |
+
>>> fill_mask("قللي وشفيييك يرحم [MASK]")
|
55 |
+
[{'sequence': '[CLS] قللي وشفيييك يرحم والديك [SEP]', 'score': 0.4152909517288208, 'token': 9650, 'token_str': 'والديك'},
|
56 |
+
{'sequence': '[CLS] قللي وشفيييك يرحملي [SEP]', 'score': 0.07663793861865997, 'token': 294, 'token_str': '##لي'},
|
57 |
+
{'sequence': '[CLS] قللي وشفيييك يرحم حالك [SEP]', 'score': 0.0453166700899601, 'token': 2663, 'token_str': 'حالك'},
|
58 |
+
{'sequence': '[CLS] قللي وشفيييك يرحم امك [SEP]', 'score': 0.04390475153923035, 'token': 1942, 'token_str': 'امك'},
|
59 |
+
{'sequence': '[CLS] قللي وشفيييك يرحمونك [SEP]', 'score': 0.027349254116415977, 'token': 3283, 'token_str': '##ونك'}]
|
60 |
+
```
|
61 |
+
## Evaluations:
|
62 |
+
|**Experiment** |**mBERT**|**AraBERT0.1**|**AraBERT1.0**|**ArabicBERT**|**QARiB**|
|
63 |
+
|---------------|---------|--------------|--------------|--------------|---------|
|
64 |
+
|Dialect Identification | 6.06% | 59.92% | 59.85% | 61.70% | **65.21%** |
|
65 |
+
|Emotion Detection | 27.90% | 43.89% | 42.37% | 41.65% | **44.35%** |
|
66 |
+
|Named-Entity Recognition (NER) | 49.38% | 64.97% | **66.63%** | 64.04% | 61.62% |
|
67 |
+
|Offensive Language Detection | 83.14% | 88.07% | 88.97% | 88.19% | **91.94%** |
|
68 |
+
|Sentiment Analysis | 86.61% | 90.80% | **93.58%** | 83.27% | 93.31% |
|
69 |
+
## Model Weights and Vocab Download
|
70 |
+
From Huggingface site: https://huggingface.co/qarib/bert-base-qarib_far
|
71 |
+
## Contacts
|
72 |
+
Ahmed Abdelali, Sabit Hassan, Hamdy Mubarak, Kareem Darwish and Younes Samih
|
73 |
+
## Reference
|
74 |
+
```
|
75 |
+
@article{abdelali2021pretraining,
|
76 |
+
title={Pre-Training BERT on Arabic Tweets: Practical Considerations},
|
77 |
+
author={Ahmed Abdelali and Sabit Hassan and Hamdy Mubarak and Kareem Darwish and Younes Samih},
|
78 |
+
year={2021},
|
79 |
+
eprint={2102.10684},
|
80 |
+
archivePrefix={arXiv},
|
81 |
+
primaryClass={cs.CL}
|
82 |
+
}
|
83 |
+
```
|
config.json
ADDED
@@ -0,0 +1,19 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"attention_probs_dropout_prob": 0.1,
|
3 |
+
"directionality": "bidi",
|
4 |
+
"hidden_act": "gelu",
|
5 |
+
"hidden_dropout_prob": 0.1,
|
6 |
+
"hidden_size": 768,
|
7 |
+
"initializer_range": 0.02,
|
8 |
+
"intermediate_size": 3072,
|
9 |
+
"max_position_embeddings": 512,
|
10 |
+
"num_attention_heads": 12,
|
11 |
+
"num_hidden_layers": 12,
|
12 |
+
"pooler_fc_size": 768,
|
13 |
+
"pooler_num_attention_heads": 12,
|
14 |
+
"pooler_num_fc_layers": 3,
|
15 |
+
"pooler_size_per_head": 128,
|
16 |
+
"pooler_type": "first_token_transform",
|
17 |
+
"type_vocab_size": 2,
|
18 |
+
"vocab_size": 64000
|
19 |
+
}
|
model.ckpt.data-00000-of-00001
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:695426cff626845b9ed3f5eaf78afad0f82484655259d29120c4bd161eb44dc3
|
3 |
+
size 1630212128
|
model.ckpt.index
ADDED
Binary file (9.37 kB). View file
|
|
model.ckpt.meta
ADDED
Binary file (4.71 MB). View file
|
|
pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:2dbeb83de5300a0fc9f903e04ae5676bb8aa0dd9cccbb1fa424cc8f0dd646abd
|
3 |
+
size 543488365
|
vocab.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|