ibm
/

ia-multilingual-original-script-roberta

Inference Endpoints

Model card Files Files and versions Community

ia-multilingual-original-script-roberta / README.md

murthyrudra's picture

Created Readme

1efdafb almost 2 years ago

|

history blame contribute delete

3.17 kB

	---
	language:
	- bn
	- gu
	- hi
	- mr
	- ne
	- or
	- pa
	- sa
	- ur

	library_name: transformers
	pipeline_tag: fill-mask
	---

	# IA-Original

	IA-Original is a multilingual RoBERTa model pre-trained exclusively on 11 Indian languages from the Indo-Aryan language family. It is pre-trained on the monolingual corpora of these languages and subsequently evaluated on a set of diverse tasks.

	The 11 languages covered by IA-Original are: Bhojpuri, Bengali, Gujarati, Hindi, Magahi, Marathi, Nepali, Oriya, Punjabi, Sanskrit, Urdu.

	The code can be found [here](https://github.com/IBM/NL-FM-Toolkit). For more information, check-out our [paper](https://aclanthology.org/2021.emnlp-main.675/).


	## Pretraining Corpus

	We pre-trained IA-Original on the publicly available monolingual corpus. The corpus has the following distribution of languages:


	\| Language \| \# Sentences \| \# Tokens \| \|
	\| :------------ \| ---------------: \| ------------: \| ------------: \|
	\| \| \| \# Total \| \# Unique \|
	\| Hindi (hi) \| 1552\.89 \| 20,098\.73 \| 25\.01 \|
	\| Bengali (bn) \| 353\.44 \| 4,021\.30 \| 6\.5 \|
	\| Sanskrit (sa) \| 165\.35 \| 1,381\.04 \| 11\.13 \|
	\| Urdu (ur) \| 153\.27 \| 2,465\.48 \| 4\.61 \|
	\| Marathi (mr) \| 132\.93 \| 1,752\.43 \| 4\.92 \|
	\| Gujarati (gu) \| 131\.22 \| 1,565\.08 \| 4\.73 \|
	\| Nepali (ne) \| 84\.21 \| 1,139\.54 \| 3\.43 \|
	\| Punjabi (pa) \| 68\.02 \| 945\.68 \| 2\.00 \|
	\| Oriya (or) \| 17\.88 \| 274\.99 \| 1\.10 \|
	\| Bhojpuri (bh) \| 10\.25 \| 134\.37 \| 1\.13 \|
	\| Magahi (mag) \| 0\.36 \| 3\.47 \| 0\.15 \|



	## Evaluation Results

	IA-Original is evaluated on IndicGLUE and some additional tasks. For more details about the tasks, refer to the [paper](https://aclanthology.org/2021.emnlp-main.675/).



	## Downloads

	You can also download it from [Huggingface](https://huggingface.co/ibm/ia-multilingual-original-script-roberta).



	## Citing

	If you are using any of the resources, please cite the following article:

	```
	@inproceedings{dhamecha-etal-2021-role,
	title = "Role of {L}anguage {R}elatedness in {M}ultilingual {F}ine-tuning of {L}anguage {M}odels: {A} {C}ase {S}tudy in {I}ndo-{A}ryan {L}anguages",
	author = "Dhamecha, Tejas and
	Murthy, Rudra and
	Bharadwaj, Samarth and
	Sankaranarayanan, Karthik and
	Bhattacharyya, Pushpak",
	booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
	month = nov,
	year = "2021",
	address = "Online and Punta Cana, Dominican Republic",
	publisher = "Association for Computational Linguistics",
	url = "https://aclanthology.org/2021.emnlp-main.675",
	doi = "10.18653/v1/2021.emnlp-main.675",
	pages = "8584--8595",
	}
	```

	## Contributors

	- Tejas Dhamecha
	- Rudra Murthy
	- Samarth Bharadwaj
	- Karthik Sankaranarayanan
	- Pushpak Bhattacharyya


	## Contact

	- Rudra Murthy ([rmurthyv@in.ibm.com](mailto:rmurthyv@in.ibm.com))