metadata

license: apache-2.0
language:
  - en
  - gu
  - mr
  - hi

Model Card for Model ID

Model Details

The technique of marking the words in a phrase to their appropriate POS tags is known as part-of-speech tagging (POS tagging or POST). There are two sorts of POS tagging algorithms: rule-based and stochastic, and monolingual and multilingual are different types from a modelling standpoint. POS tags provide grammatical context to a sentence, which can be employed in NLP tasks such as NER, NLU and QNA systems. In this research field, a lot of researchers had already tried to propose various novel approaches, tags and models like Weightless Artificial Neural Network (WANN), different forms of CRF, Bi-LSTM CRF, and transformers, various techniques for language tag mixed POS tags to handle mixed languages. All this research work leads to the enhancement or creating a benchmark for different popular and low resource languages, In the state of monolingual or multilingual context. In this model we are trying to achieve state-of-the-art model for the Indian language context in both native and its Romanised format.

Model Description

The model has been trained on the romanized forms of the Indian languages as well as English, Hindi, Gujarati, and Marathi.i.e(en,gu,mr,hi,gu_romanised,mr_romanised,hi_romanised) To use this model you have import this class

from transformers import BertPreTrainedModel, BertModel
from transformers.modeling_outputs import  TokenClassifierOutput
from torch import nn
from torch.nn import CrossEntropyLoss
import torch

from torchcrf import CRF
from transformers import BertTokenizerFast
from transformers import BertTokenizerFast, Trainer, TrainingArguments
from transformers.trainer_utils import IntervalStrategy

class BertCRF(BertPreTrainedModel):

    _keys_to_ignore_on_load_unexpected = [r"pooler"]

    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels

        self.bert = BertModel(config, add_pooling_layer=False)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
        self.crf = CRF(num_tags=config.num_labels, batch_first=True)
        self.init_weights()

    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        token_type_ids=None,
        position_ids=None,
        head_mask=None,
        inputs_embeds=None,
        labels=None,
        output_attentions=None,
        output_hidden_states=None,
        return_dict=None,
    ):
        r"""
        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
            Labels for computing the token classification loss. Indices should be in ``[0, ..., config.num_labels -
            1]``.
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        outputs = self.bert(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        sequence_output = outputs[0]
        sequence_output = self.dropout(sequence_output)
        logits = self.classifier(sequence_output)

        loss = None
        if labels is not None:
            log_likelihood, tags = self.crf(logits, labels), self.crf.decode(logits)
            loss = 0 - log_likelihood
        else:
            tags = self.crf.decode(logits)
        tags = torch.Tensor(tags)

        if not return_dict:
            output = (tags,) + outputs[2:]
            return ((loss,) + output) if loss is not None else output

        return loss, tags

Some sample output from the model

This model uses a different kind of labelling system from it will not only be able to detect language, as well as it can detect the POS of the respective language

Types	Output
English	[{'words': ['my', 'name', 'is', 'swagat'], 'labels': ['en-DET', 'enNN', 'en-VB', 'en-NN']}]
Hindi	[{'words': ['मेरा', 'नाम', 'स्वागत', 'है'], 'labels': ['hi-PRP', 'hi-NN', 'hi-NNP', 'hi-VM']}]
Hindi Romanised	[{'words': ['mera', 'naam', 'swagat', 'hai'], 'labels': ['hi_romPRP', 'hi_rom-NN', 'hi_rom-NNP', 'hi_rom-VM']}]
Gujarati	[{'words': ['મારું', 'નામ', 'સ્વગત', 'છે'], 'labels': ['gu-PRP', 'guNN', 'gu-NNP', 'gu-VAUX']}]
Gujarati Romanised	[{'words': ['maru', 'naam', 'swagat', 'che'], 'labels': ['gu_romPRP', 'gu_rom-NN', 'gu_rom-NNP', 'gu_rom-VAUX']}]

Developed by: Swagat Panda
Finetuned from model : google/muril-base-cased

Model Sources

Paper : https://www.academia.edu/87916386/MULTILINGUAL_APPROACH_TOWARDS_THE_NATIVE_AND_ROMANISED_SCRIPTS_FOR_INDIAN_LANGUGE_CONTEXT_ON_POS_TAGGING?source=swp_share