language:
- bn
- gu
- hi
- mr
- ne
- or
- pa
- sa
- ur
library_name: transformers
pipeline_tag: fill-mask
IA-Original
IA-Original is a multilingual RoBERTa model pre-trained exclusively on 11 Indian languages from the Indo-Aryan language family. It is pre-trained on the monolingual corpora of these languages and subsequently evaluated on a set of diverse tasks.
The 11 languages covered by IA-Original are: Bhojpuri, Bengali, Gujarati, Hindi, Magahi, Marathi, Nepali, Oriya, Punjabi, Sanskrit, Urdu.
The code can be found here. For more information, check-out our paper.
Pretraining Corpus
We pre-trained IA-Original on the publicly available monolingual corpus. The corpus has the following distribution of languages:
Language | # Sentences | # Tokens | |
---|---|---|---|
# Total | # Unique | ||
Hindi (hi) | 1552.89 | 20,098.73 | 25.01 |
Bengali (bn) | 353.44 | 4,021.30 | 6.5 |
Sanskrit (sa) | 165.35 | 1,381.04 | 11.13 |
Urdu (ur) | 153.27 | 2,465.48 | 4.61 |
Marathi (mr) | 132.93 | 1,752.43 | 4.92 |
Gujarati (gu) | 131.22 | 1,565.08 | 4.73 |
Nepali (ne) | 84.21 | 1,139.54 | 3.43 |
Punjabi (pa) | 68.02 | 945.68 | 2.00 |
Oriya (or) | 17.88 | 274.99 | 1.10 |
Bhojpuri (bh) | 10.25 | 134.37 | 1.13 |
Magahi (mag) | 0.36 | 3.47 | 0.15 |
Evaluation Results
IA-Original is evaluated on IndicGLUE and some additional tasks. For more details about the tasks, refer to the paper.
Downloads
You can also download it from Huggingface.
Citing
If you are using any of the resources, please cite the following article:
@inproceedings{dhamecha-etal-2021-role,
title = "Role of {L}anguage {R}elatedness in {M}ultilingual {F}ine-tuning of {L}anguage {M}odels: {A} {C}ase {S}tudy in {I}ndo-{A}ryan {L}anguages",
author = "Dhamecha, Tejas and
Murthy, Rudra and
Bharadwaj, Samarth and
Sankaranarayanan, Karthik and
Bhattacharyya, Pushpak",
booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2021",
address = "Online and Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.emnlp-main.675",
doi = "10.18653/v1/2021.emnlp-main.675",
pages = "8584--8595",
}
Contributors
- Tejas Dhamecha
- Rudra Murthy
- Samarth Bharadwaj
- Karthik Sankaranarayanan
- Pushpak Bhattacharyya
Contact
- Rudra Murthy (rmurthyv@in.ibm.com)