Spaces:
Sleeping
Sleeping
Update readme
Browse files
README.md
CHANGED
@@ -1,6 +1,7 @@
|
|
1 |
# Arabic Dialect Classifier
|
2 |
This project is a classifier of arabic dialects at a country level:
|
3 |
Given some arabic text, the goal is to predict the country of the text's dialect.
|
|
|
4 |
You can use the "/classify" endpoint through a POST request with a json input of the form: '{"text": "Your arabic text"}'
|
5 |
```
|
6 |
curl -X POST -H "Content-Type: application/json" -d '{"text": "Your Arabic text"}' http://localhost:8080/classify
|
@@ -29,4 +30,20 @@ The response should be a json of the form:
|
|
29 |
{
|
30 |
"class": "country_name"
|
31 |
}
|
32 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
# Arabic Dialect Classifier
|
2 |
This project is a classifier of arabic dialects at a country level:
|
3 |
Given some arabic text, the goal is to predict the country of the text's dialect.
|
4 |
+
|
5 |
You can use the "/classify" endpoint through a POST request with a json input of the form: '{"text": "Your arabic text"}'
|
6 |
```
|
7 |
curl -X POST -H "Content-Type: application/json" -d '{"text": "Your Arabic text"}' http://localhost:8080/classify
|
|
|
30 |
{
|
31 |
"class": "country_name"
|
32 |
}
|
33 |
+
```
|
34 |
+
|
35 |
+
## How I built this project:
|
36 |
+
The data used to train the classifier comes from the NADI 2021 dataset for Arabic Dialect Identification [(Abdul-Mageed et al., 2021)](#cite-mageed-2021).
|
37 |
+
It is a corpus of tweets collected using Twitter's API and labeled thanks to the users location with the country and region.
|
38 |
+
|
39 |
+
I used the language model `https://huggingface.co/moussaKam/AraBART` to extract features from the input text by taking the output of its last hidden layer. I used these word embeddings as the input for a Multinomial Logistic Regression to classify the input text into one of the 21 dialects (Countries).
|
40 |
+
|
41 |
+
For more detail, please refer to the docs directory.
|
42 |
+
|
43 |
+
## References
|
44 |
+
- <a name="cite-mageed-2021"></a>
|
45 |
+
[Abdul-Mageed et al., 2021](https://arxiv.org/abs/2103.08466)
|
46 |
+
*Title:* NADI 2021: The Second Nuanced Arabic Dialect Identification Shared Task
|
47 |
+
*Authors:* Abdul-Mageed, Muhammad; Zhang, Chiyu; Elmadany, AbdelRahim; Bouamor, Houda; Habash, Nizar
|
48 |
+
*Year:* 2021
|
49 |
+
*Conference/Book Title:* Proceedings of the Sixth Arabic Natural Language Processing Workshop (WANLP 2021)
|