zaidmehdi commited on
Commit
b230754
1 Parent(s): 7584983

Update readme

Browse files
Files changed (1) hide show
  1. README.md +18 -1
README.md CHANGED
@@ -1,6 +1,7 @@
1
  # Arabic Dialect Classifier
2
  This project is a classifier of arabic dialects at a country level:
3
  Given some arabic text, the goal is to predict the country of the text's dialect.
 
4
  You can use the "/classify" endpoint through a POST request with a json input of the form: '{"text": "Your arabic text"}'
5
  ```
6
  curl -X POST -H "Content-Type: application/json" -d '{"text": "Your Arabic text"}' http://localhost:8080/classify
@@ -29,4 +30,20 @@ The response should be a json of the form:
29
  {
30
  "class": "country_name"
31
  }
32
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # Arabic Dialect Classifier
2
  This project is a classifier of arabic dialects at a country level:
3
  Given some arabic text, the goal is to predict the country of the text's dialect.
4
+
5
  You can use the "/classify" endpoint through a POST request with a json input of the form: '{"text": "Your arabic text"}'
6
  ```
7
  curl -X POST -H "Content-Type: application/json" -d '{"text": "Your Arabic text"}' http://localhost:8080/classify
 
30
  {
31
  "class": "country_name"
32
  }
33
+ ```
34
+
35
+ ## How I built this project:
36
+ The data used to train the classifier comes from the NADI 2021 dataset for Arabic Dialect Identification [(Abdul-Mageed et al., 2021)](#cite-mageed-2021).
37
+ It is a corpus of tweets collected using Twitter's API and labeled thanks to the users location with the country and region.
38
+
39
+ I used the language model `https://huggingface.co/moussaKam/AraBART` to extract features from the input text by taking the output of its last hidden layer. I used these word embeddings as the input for a Multinomial Logistic Regression to classify the input text into one of the 21 dialects (Countries).
40
+
41
+ For more detail, please refer to the docs directory.
42
+
43
+ ## References
44
+ - <a name="cite-mageed-2021"></a>
45
+ [Abdul-Mageed et al., 2021](https://arxiv.org/abs/2103.08466)
46
+ *Title:* NADI 2021: The Second Nuanced Arabic Dialect Identification Shared Task
47
+ *Authors:* Abdul-Mageed, Muhammad; Zhang, Chiyu; Elmadany, AbdelRahim; Bouamor, Houda; Habash, Nizar
48
+ *Year:* 2021
49
+ *Conference/Book Title:* Proceedings of the Sixth Arabic Natural Language Processing Workshop (WANLP 2021)