datasets:
- go_emotions
language:
- en
- zh
metrics:
- accuracy
- bertscore
pipeline_tag: text-classification
tags:
- sentiment-analysis
- multilingual
widget:
- query: Example 1
text: I love you, I like you.
- query: Example 2
text: 我喜欢你。
license: apache-2.0
Multilingual (English and Chinese) GoEmotions Classification Model
This repository hosts a fine-tuned BERT model for cross-language emotion classification on the GoEmotions dataset. This model is unique as it has been trained on a multilingual dataset comprising of English and Chinese texts. It is capable of classifying text into one of 28 different emotion categories.
The 28 emotion categories, according to the GoEmotions taxonomy, are: 'admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion', 'curiosity', 'desire', 'disappointment', 'disapproval', 'disgust', 'embarrassment', 'excitement', 'fear', 'gratitude', 'grief', 'joy', 'love', 'nervousness', 'optimism', 'pride', 'realization', 'relief', 'remorse', 'sadness', 'surprise', and 'neutral'.
Model Performance
The model demonstrates high performance on the validation set, with the following scores:
Accuracy: 85.95% Precision: 91.99% Recall: 89.56% F1 Score: 90.17% These results indicate the model's high accuracy and precision in predicting the correct emotion category for a given input text, regardless of the language (English or Chinese).
Training data
The dataset used for training the model is a combined dataset of the original English GoEmotions dataset and a machine translated Chinese version of the GoEmotions dataset.
The dataset is split into two parts:
- Labeled data: Used for initial training. It includes both English and machine translated Chinese samples. This labeled data is further split into a training set (80%) and a validation set (20%).
- Unlabeled data: Used for making predictions and adding confidently predicted samples to the training data. It includes both English and machine translated Chinese samples.
Training
The model is trained in two stages:
- Initial training on the labeled data.
- Predictions are made on the unlabeled data, and the most confidently predicted samples are added to the training data. The model is then retrained on this updated labeled data.
The model is trained for a total of 20 epochs (10 epochs for each stage). Precision, recall, and F1 score are logged during training.
Usage
Here is a code snippet showing how to use this model:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline
# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("SchuylerH/bert-multilingual-go-emtions")
model = AutoModelForSequenceClassification.from_pretrained("SchuylerH/bert-multilingual-go-emtions")
text = "I love you."
nlp = pipeline("sentiment-analysis", model = model, tokenizer = tokenizer)
result = nlp(text)
print(result)