miniLM-go_Emotions / README.md
mkpvishnu's picture
mkpvishnu@gmail.com
8a8a7cb verified
metadata
license: mit
datasets:
  - google-research-datasets/go_emotions
language:
  - en
library_name: transformers
tags:
  - sentiment

Fine-Tuned MiniLM for GoEmotions Sentiment Analysis

This repository contains a fine-tuned version of Microsoft's MiniLM-v2 model, specifically optimized for sentiment analysis using the GoEmotions dataset. The model is capable of classifying text into the following emotional/sentiment categories:

This model is just 90MB making it ideal for memory constraint environments.

  • anger
  • approval
  • confusion
  • disappointment
  • disapproval
  • gratitude
  • joy
  • sadness
  • neutral

These sentiments more or less cover all the sentiments that can be in a sentence. Useful for validating sentiment analysis models.

Label Analogy when using Inference:

{
  "LABEL_0":anger,
  "LABEL_1":approval,
  "LABEL_2":confusion,
  "LABEL_3":disappointment,
  "LABEL_4":disapproval,
  "LABEL_5":gratitude,
  "LABEL_6":joy,
  "LABEL_7":sadness,
  "LABEL_8":neutral
}

Why MiniLM?

MiniLM is a distilled version of larger language models like BERT and RoBERTa. It strikes a remarkable balance between performance and efficiency:

  • Reduced Size: MiniLM is significantly smaller than its parent models, making it faster to load and deploy, especially in resource-constrained environments.
  • Comparable Performance: Despite its compact size, MiniLM maintains surprisingly high accuracy on various natural language processing (NLP) tasks, including sentiment analysis.
  • Distillation Power: MiniLM's distillation technique ensures that it captures the essential knowledge of larger models, making it a potent tool for real-world applications.

GoEmotions Dataset

google-research-datasets/go_emotions

The GoEmotions dataset is a valuable resource for sentiment analysis. It consists of thousands of Reddit comments labeled with the nine emotional/sentiment classes listed above. This dataset's richness in diverse expressions of emotions makes it an ideal choice for training a versatile sentiment analysis model.

Training Procedure

  1. Data Preprocessing: The GoEmotions dataset was preprocessed to ensure consistency and remove noise.
  2. Tokenizer: The MiniLM-v2 tokenizer was used to convert text into numerical representations suitable for the model.
  3. Fine-Tuning: The MiniLM-v2 model was fine-tuned on the GoEmotions dataset using a standard training loop. The model's parameters were adjusted to optimize its performance on sentiment classification.
  4. Evaluation: The fine-tuned model was evaluated on a held-out test set to measure its accuracy and generalization capabilities.

How to Use This Model

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

required_sentiments = ['anger', 'approval', 'confusion', 'disappointment', 'disapproval', 'gratitude', 'joy', 'sadness', 'neutral']


model = AutoModelForSequenceClassification.from_pretrained('./saved_model')
tokenizer = AutoTokenizer.from_pretrained('./saved_model')

text = "How can you be so careless"

inputs = tokenizer(text, return_tensors="pt", truncation=True, padding='max_length', max_length=128)

model.eval()
with torch.no_grad():
    outputs = model(**inputs)
    
predictions = torch.argmax(outputs.logits, dim=-1).item()

# Map the label to sentiment
label_mapping = {idx: sentiment for idx, sentiment in enumerate(required_sentiments)}
predicted_sentiment = label_mapping[predictions]

print(f'Text: {text}')
print(f'Predicted Sentiment: {predicted_sentiment}')