File size: 3,165 Bytes
1efdafb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 |
---
language:
- bn
- gu
- hi
- mr
- ne
- or
- pa
- sa
- ur
library_name: transformers
pipeline_tag: fill-mask
---
# IA-Original
IA-Original is a multilingual RoBERTa model pre-trained exclusively on 11 Indian languages from the Indo-Aryan language family. It is pre-trained on the monolingual corpora of these languages and subsequently evaluated on a set of diverse tasks.
The 11 languages covered by IA-Original are: Bhojpuri, Bengali, Gujarati, Hindi, Magahi, Marathi, Nepali, Oriya, Punjabi, Sanskrit, Urdu.
The code can be found [here](https://github.com/IBM/NL-FM-Toolkit). For more information, check-out our [paper](https://aclanthology.org/2021.emnlp-main.675/).
## Pretraining Corpus
We pre-trained IA-Original on the publicly available monolingual corpus. The corpus has the following distribution of languages:
| **Language** | **\# Sentences** | **\# Tokens** | |
| :------------ | ---------------: | ------------: | ------------: |
| | | **\# Total** | **\# Unique** |
| Hindi (hi) | 1552\.89 | 20,098\.73 | 25\.01 |
| Bengali (bn) | 353\.44 | 4,021\.30 | 6\.5 |
| Sanskrit (sa) | 165\.35 | 1,381\.04 | 11\.13 |
| Urdu (ur) | 153\.27 | 2,465\.48 | 4\.61 |
| Marathi (mr) | 132\.93 | 1,752\.43 | 4\.92 |
| Gujarati (gu) | 131\.22 | 1,565\.08 | 4\.73 |
| Nepali (ne) | 84\.21 | 1,139\.54 | 3\.43 |
| Punjabi (pa) | 68\.02 | 945\.68 | 2\.00 |
| Oriya (or) | 17\.88 | 274\.99 | 1\.10 |
| Bhojpuri (bh) | 10\.25 | 134\.37 | 1\.13 |
| Magahi (mag) | 0\.36 | 3\.47 | 0\.15 |
## Evaluation Results
IA-Original is evaluated on IndicGLUE and some additional tasks. For more details about the tasks, refer to the [paper](https://aclanthology.org/2021.emnlp-main.675/).
## Downloads
You can also download it from [Huggingface](https://huggingface.co/ibm/ia-multilingual-original-script-roberta).
## Citing
If you are using any of the resources, please cite the following article:
```
@inproceedings{dhamecha-etal-2021-role,
title = "Role of {L}anguage {R}elatedness in {M}ultilingual {F}ine-tuning of {L}anguage {M}odels: {A} {C}ase {S}tudy in {I}ndo-{A}ryan {L}anguages",
author = "Dhamecha, Tejas and
Murthy, Rudra and
Bharadwaj, Samarth and
Sankaranarayanan, Karthik and
Bhattacharyya, Pushpak",
booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2021",
address = "Online and Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.emnlp-main.675",
doi = "10.18653/v1/2021.emnlp-main.675",
pages = "8584--8595",
}
```
## Contributors
- Tejas Dhamecha
- Rudra Murthy
- Samarth Bharadwaj
- Karthik Sankaranarayanan
- Pushpak Bhattacharyya
## Contact
- Rudra Murthy ([rmurthyv@in.ibm.com](mailto:rmurthyv@in.ibm.com)) |