|
--- |
|
license: mit |
|
datasets: |
|
- google-research-datasets/go_emotions |
|
language: |
|
- en |
|
library_name: transformers |
|
tags: |
|
- sentiment |
|
--- |
|
|
|
# Fine-Tuned MiniLM for GoEmotions Sentiment Analysis |
|
|
|
|
|
This repository contains a fine-tuned version of Microsoft's MiniLM-v2 model, specifically optimized for sentiment analysis using the GoEmotions dataset. The model is capable of classifying text into the following emotional/sentiment categories: |
|
|
|
This model is just **90MB** making it ideal for memory constraint environments. |
|
|
|
* anger |
|
* approval |
|
* confusion |
|
* disappointment |
|
* disapproval |
|
* gratitude |
|
* joy |
|
* sadness |
|
* neutral |
|
|
|
These sentiments more or less cover all the sentiments that can be in a sentence. Useful for validating sentiment analysis models. |
|
|
|
Label Analogy when using Inference: |
|
``` |
|
{ |
|
"LABEL_0":anger, |
|
"LABEL_1":approval, |
|
"LABEL_2":confusion, |
|
"LABEL_3":disappointment, |
|
"LABEL_4":disapproval, |
|
"LABEL_5":gratitude, |
|
"LABEL_6":joy, |
|
"LABEL_7":sadness, |
|
"LABEL_8":neutral |
|
} |
|
``` |
|
|
|
## Why MiniLM? |
|
|
|
MiniLM is a distilled version of larger language models like BERT and RoBERTa. It strikes a remarkable balance between performance and efficiency: |
|
|
|
* **Reduced Size:** MiniLM is significantly smaller than its parent models, making it faster to load and deploy, especially in resource-constrained environments. |
|
* **Comparable Performance:** Despite its compact size, MiniLM maintains surprisingly high accuracy on various natural language processing (NLP) tasks, including sentiment analysis. |
|
* **Distillation Power:** MiniLM's distillation technique ensures that it captures the essential knowledge of larger models, making it a potent tool for real-world applications. |
|
|
|
## GoEmotions Dataset |
|
|
|
google-research-datasets/go_emotions |
|
|
|
The GoEmotions dataset is a valuable resource for sentiment analysis. It consists of thousands of Reddit comments labeled with the nine emotional/sentiment classes listed above. This dataset's richness in diverse expressions of emotions makes it an ideal choice for training a versatile sentiment analysis model. |
|
|
|
## Training Procedure |
|
|
|
1. **Data Preprocessing:** The GoEmotions dataset was preprocessed to ensure consistency and remove noise. |
|
2. **Tokenizer:** The MiniLM-v2 tokenizer was used to convert text into numerical representations suitable for the model. |
|
3. **Fine-Tuning:** The MiniLM-v2 model was fine-tuned on the GoEmotions dataset using a standard training loop. The model's parameters were adjusted to optimize its performance on sentiment classification. |
|
4. **Evaluation:** The fine-tuned model was evaluated on a held-out test set to measure its accuracy and generalization capabilities. |
|
|
|
## How to Use This Model |
|
|
|
```python |
|
import torch |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
|
required_sentiments = ['anger', 'approval', 'confusion', 'disappointment', 'disapproval', 'gratitude', 'joy', 'sadness', 'neutral'] |
|
|
|
|
|
model = AutoModelForSequenceClassification.from_pretrained('./saved_model') |
|
tokenizer = AutoTokenizer.from_pretrained('./saved_model') |
|
|
|
text = "How can you be so careless" |
|
|
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding='max_length', max_length=128) |
|
|
|
model.eval() |
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
|
|
predictions = torch.argmax(outputs.logits, dim=-1).item() |
|
|
|
# Map the label to sentiment |
|
label_mapping = {idx: sentiment for idx, sentiment in enumerate(required_sentiments)} |
|
predicted_sentiment = label_mapping[predictions] |
|
|
|
print(f'Text: {text}') |
|
print(f'Predicted Sentiment: {predicted_sentiment}') |