metadata

license: mit
datasets:
  - google-research-datasets/go_emotions
language:
  - en
library_name: transformers
tags:
  - sentiment

Fine-Tuned MiniLM for GoEmotions Sentiment Analysis

This repository contains a fine-tuned version of Microsoft's MiniLM-v2 model, specifically optimized for sentiment analysis using the GoEmotions dataset. The model is capable of classifying text into the following emotional/sentiment categories:

This model is just 90MB making it ideal for memory constraint environments.

anger
approval
confusion
disappointment
disapproval
gratitude
joy
sadness
neutral

These sentiments more or less cover all the sentiments that can be in a sentence. Useful for validating sentiment analysis models.

Label Analogy when using Inference:

{
  "LABEL_0":anger,
  "LABEL_1":approval,
  "LABEL_2":confusion,
  "LABEL_3":disappointment,
  "LABEL_4":disapproval,
  "LABEL_5":gratitude,
  "LABEL_6":joy,
  "LABEL_7":sadness,
  "LABEL_8":neutral
}

Why MiniLM?

MiniLM is a distilled version of larger language models like BERT and RoBERTa. It strikes a remarkable balance between performance and efficiency:

Reduced Size: MiniLM is significantly smaller than its parent models, making it faster to load and deploy, especially in resource-constrained environments.
Comparable Performance: Despite its compact size, MiniLM maintains surprisingly high accuracy on various natural language processing (NLP) tasks, including sentiment analysis.
Distillation Power: MiniLM's distillation technique ensures that it captures the essential knowledge of larger models, making it a potent tool for real-world applications.

GoEmotions Dataset

google-research-datasets/go_emotions

The GoEmotions dataset is a valuable resource for sentiment analysis. It consists of thousands of Reddit comments labeled with the nine emotional/sentiment classes listed above. This dataset's richness in diverse expressions of emotions makes it an ideal choice for training a versatile sentiment analysis model.

Training Procedure

Data Preprocessing: The GoEmotions dataset was preprocessed to ensure consistency and remove noise.
Tokenizer: The MiniLM-v2 tokenizer was used to convert text into numerical representations suitable for the model.
Fine-Tuning: The MiniLM-v2 model was fine-tuned on the GoEmotions dataset using a standard training loop. The model's parameters were adjusted to optimize its performance on sentiment classification.
Evaluation: The fine-tuned model was evaluated on a held-out test set to measure its accuracy and generalization capabilities.

How to Use This Model

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

required_sentiments = ['anger', 'approval', 'confusion', 'disappointment', 'disapproval', 'gratitude', 'joy', 'sadness', 'neutral']


model = AutoModelForSequenceClassification.from_pretrained('./saved_model')
tokenizer = AutoTokenizer.from_pretrained('./saved_model')

text = "How can you be so careless"

inputs = tokenizer(text, return_tensors="pt", truncation=True, padding='max_length', max_length=128)

model.eval()
with torch.no_grad():
    outputs = model(**inputs)
    
predictions = torch.argmax(outputs.logits, dim=-1).item()

# Map the label to sentiment
label_mapping = {idx: sentiment for idx, sentiment in enumerate(required_sentiments)}
predicted_sentiment = label_mapping[predictions]

print(f'Text: {text}')
print(f'Predicted Sentiment: {predicted_sentiment}')