arxiv:2103.09687

Moroccan Dialect -Darija- Open Dataset

Published on Feb 28, 2021

Authors:

Abstract

Darija Open Dataset (DODa) is an open-source project for the Moroccan dialect. With more than 10,000 entries DODa is arguably the largest open-source collaborative project for Darija-English translation built for Natural Language Processing purposes. In fact, besides semantic categorization, DODa also adopts a syntactic one, presents words under different spellings, offers verb-to-noun and masculine-to-feminine correspondences, contains the conjugation of hundreds of verbs in different tenses, and many other subsets to help researchers better understand and study Moroccan dialect. This data paper presents a description of DODa, its features, how it was collected, as well as a first application in Image Classification using ImageNet labels translated to Darija. This collaborative project is hosted on GitHub platform under MIT's Open-Source license and aims to be a standard resource for researchers, students, and anyone who is interested in Moroccan Dialect

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2103.09687 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2103.09687 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.