Papers
arxiv:2103.09687

Moroccan Dialect -Darija- Open Dataset

Published on Feb 28, 2021
Authors:
,

Abstract

Darija Open Dataset (DODa) is an open-source project for the Moroccan dialect. With more than 10,000 entries DODa is arguably the largest open-source collaborative project for Darija-English translation built for Natural Language Processing purposes. In fact, besides semantic categorization, DODa also adopts a syntactic one, presents words under different spellings, offers verb-to-noun and masculine-to-feminine correspondences, contains the conjugation of hundreds of verbs in different tenses, and many other subsets to help researchers better understand and study Moroccan dialect. This data paper presents a description of DODa, its features, how it was collected, as well as a first application in Image Classification using ImageNet labels translated to Darija. This collaborative project is hosted on GitHub platform under MIT's Open-Source license and aims to be a standard resource for researchers, students, and anyone who is interested in Moroccan Dialect

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2103.09687 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2103.09687 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.