File size: 2,080 Bytes
7584983
 
 
b230754
4aedf29
6a6b754
7584983
 
 
 
 
 
ce6d0cc
7584983
 
 
4aedf29
 
 
 
 
 
7584983
b230754
 
 
361156c
b230754
e16cc69
b230754
a81f3b8
b230754
e16cc69
 
 
 
 
6a6b754
b230754
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# Arabic Dialect Classifier
This project is a classifier of arabic dialects at a country level:  
Given some arabic text, the goal is to predict the country of the text's dialect.  
  
![Demo App](docs/images/gradio_app.png "Demo App")
## Run the app locally with Docker:
1. Clone the repository with Git:  
```
git clone https://github.com/zaidmehdi/arabic-dialect-classifier.git
```
2. Build the Docker image:  
```
sudo docker build -t adc .
```
3. Run the Docker Container:
```
sudo docker run -p 8080:8080 adc
```
  
Now you can access the demo locally at:
```
http://localhost:8080
```

## How I built this project:
The data used to train the classifier comes from the NADI 2021 dataset for Arabic Dialect Identification [(Abdul-Mageed et al., 2021)](#cite-mageed-2021).  
It is a corpus of tweets collected using Twitter's API and labeled thanks to the users' locations with the country and region.  

In the current version, I used the language model `https://huggingface.co/moussaKam/AraBART` to extract features from the input text by taking the output of its last hidden layer. I used these vector embeddings as the input for a Multinomial Logistic Regression to classify the input text into one of the 21 dialects (Countries).

For more details, you can refer to the docs directory.

## Releases
### v0.0.1
In the first release, I used the language model `https://huggingface.co/moussaKam/AraBART` to extract features from the input text by taking the output of its last hidden layer. I used these vector embeddings as the input for a Multinomial Logistic Regression to classify the input text into one of the 21 dialects (Countries).
### v0.0.2

## References:
- <a name="cite-mageed-2021"></a>
[Abdul-Mageed et al., 2021](https://arxiv.org/abs/2103.08466)  
*Title:* NADI 2021: The Second Nuanced Arabic Dialect Identification Shared Task  
*Authors:* Abdul-Mageed, Muhammad; Zhang, Chiyu; Elmadany, AbdelRahim; Bouamor, Houda; Habash, Nizar  
*Year:* 2021  
*Conference/Book Title:* Proceedings of the Sixth Arabic Natural Language Processing Workshop (WANLP 2021)