Spaces:

zaidmehdi
/

arabic-dialect-classifier

Sleeping

App Files Files Community

arabic-dialect-classifier / README.md

zaidmehdi

Update README.md

916fa06 unverified 9 months ago

preview code

raw

history blame

2.07 kB

	# Arabic Dialect Classifier
	This project is a classifier of arabic dialects at a country level:
	Given some arabic text, the goal is to predict the country of the text's dialect.

	You can use the "/classify" endpoint through a POST request with a json input of the form: '{"text": "Your arabic text"}'
	```
	curl -X POST -H "Content-Type: application/json" -d '{"text": "Your arabic text"}' http://localhost:8080/classify
	```

	## Run the app locally with Docker:
	1. Clone the repository with Git:
	```
	git clone https://github.com/zaidmehdi/arabic-dialect-classifier.git
	```
	2. Build the Docker image:
	```
	docker build -t adc .
	```
	3. Run the Docker Container:
	```
	docker run -p 8080:80 adc
	```

	Now you can try sending a POST request:
	```
	curl -X POST -H "Content-Type: application/json" -d '{"text": "Your Arabic text"}' http://localhost:8080/classify
	```
	The response should be a json with the following fields:
	```
	{
	"class": "country_name"
	}
	```

	## How I built this project:
	The data used to train the classifier comes from the NADI 2021 dataset for Arabic Dialect Identification [(Abdul-Mageed et al., 2021)](#cite-mageed-2021).
	It is a corpus of tweets collected using Twitter's API and labeled thanks to the users' locations with the country and region.

	I used the language model `https://huggingface.co/moussaKam/AraBART` to extract features from the input text by taking the output of its last hidden layer. I used these vector embeddings as the input for a Multinomial Logistic Regression to classify the input text into one of the 21 dialects (Countries).

	For more details, you can refer to the docs directory.

	## References:
	- <a name="cite-mageed-2021"></a>
	[Abdul-Mageed et al., 2021](https://arxiv.org/abs/2103.08466)
	Title: NADI 2021: The Second Nuanced Arabic Dialect Identification Shared Task
	Authors: Abdul-Mageed, Muhammad; Zhang, Chiyu; Elmadany, AbdelRahim; Bouamor, Houda; Habash, Nizar
	Year: 2021
	Conference/Book Title: Proceedings of the Sixth Arabic Natural Language Processing Workshop (WANLP 2021)