zaidmehdi's picture
update readme
e16cc69
|
raw
history blame
2.08 kB

Arabic Dialect Classifier

This project is a classifier of arabic dialects at a country level:
Given some arabic text, the goal is to predict the country of the text's dialect.

Demo App

Run the app locally with Docker:

  1. Clone the repository with Git:
git clone https://github.com/zaidmehdi/arabic-dialect-classifier.git
  1. Build the Docker image:
sudo docker build -t adc .
  1. Run the Docker Container:
sudo docker run -p 8080:8080 adc

Now you can access the demo locally at:

http://localhost:8080

How I built this project:

The data used to train the classifier comes from the NADI 2021 dataset for Arabic Dialect Identification (Abdul-Mageed et al., 2021).
It is a corpus of tweets collected using Twitter's API and labeled thanks to the users' locations with the country and region.

In the current version, I used the language model https://huggingface.co/moussaKam/AraBART to extract features from the input text by taking the output of its last hidden layer. I used these vector embeddings as the input for a Multinomial Logistic Regression to classify the input text into one of the 21 dialects (Countries).

For more details, you can refer to the docs directory.

Releases

v0.0.1

In the first release, I used the language model https://huggingface.co/moussaKam/AraBART to extract features from the input text by taking the output of its last hidden layer. I used these vector embeddings as the input for a Multinomial Logistic Regression to classify the input text into one of the 21 dialects (Countries).

v0.0.2

References:

  • Abdul-Mageed et al., 2021
    Title: NADI 2021: The Second Nuanced Arabic Dialect Identification Shared Task
    Authors: Abdul-Mageed, Muhammad; Zhang, Chiyu; Elmadany, AbdelRahim; Bouamor, Houda; Habash, Nizar
    Year: 2021
    Conference/Book Title: Proceedings of the Sixth Arabic Natural Language Processing Workshop (WANLP 2021)