Roksana/hindi_wic_muril · Hugging Face

This model has been pushed to the Hub using the PytorchModelHubMixin integration:

Library: [More Information Needed]
Docs: [More Information Needed]

Model

This model is based on the google/muril-base-cased model, fine-tuned on a WiC dataset in Hindi by Dubossarsky and Dairkee (2024) using the Siamese network architecture, WordTransformer from pierluigic by Casotti et al. (2023).

Usage (WordTransformer)

To recreate our setup, you have to first install the WordTransformer architecture from pierluigic/xl-lexeme

git clone git@github.com:pierluigic/xl-lexeme.git
cd xl-lexeme
pip3 install .

Then you have to add "PyTorchModelHubMixin" to the WordTransformer class definition in xl-lexeme/WordTransformer/WordTransformer:

class WordTransformer(nn.Sequential, PyTorchModelHubMixin):

To load the model:

from WordTransformer import WordTransformer, InputExample
model = WordTransformer.from_pretrained(Roksana/hindi_wic_muril)

OR load it as a simple embedding model:

# Load pre-trained model and tokenizer
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)

Citations and Acknowledgements

@inproceedings{dubossarsky-dairkee-2024-strengthening-wic,
    title = "Strengthening the {W}i{C}: New Polysemy Dataset in {H}indi and Lack of Cross Lingual Transfer",
    author = "Dubossarsky, Haim  and
      Dairkee, Farheen",
    editor = "Calzolari, Nicoletta  and
      Kan, Min-Yen  and
      Hoste, Veronique  and
      Lenci, Alessandro  and
      Sakti, Sakriani  and
      Xue, Nianwen",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.1332",
    pages = "15341--15349",
    abstract = "This study addresses the critical issue of Natural Language Processing in low-resource languages such as Hindi, which, despite having substantial number of speakers, is limited in linguistic resources. The paper focuses on Word Sense Disambiguation, a fundamental NLP task that deals with polysemous words. It introduces a novel Hindi WSD dataset in the modern WiC format, enabling the training and testing of contextualized models. The primary contributions of this work lie in testing the efficacy of multilingual models to transfer across languages and hence to handle polysemy in low-resource languages, and in providing insights into the minimum training data required for a viable solution. Experiments compare different contextualized models on the WiC task via transfer learning from English to Hindi. Models purely transferred from English yield poor 55{\%} accuracy, while fine-tuning on Hindi dramatically improves performance to 90{\%} accuracy. This demonstrates the need for language-specific tuning and resources like the introduced Hindi WiC dataset to drive advances in Hindi NLP. The findings offer valuable insights into addressing the NLP needs of widely spoken yet low-resourced languages, shedding light on the problem of transfer learning in these contexts.",
}```