Edit model card

BERT ForSequenceClassification Fine-tuned for Sentiment Analysis

This model is a fine-tuned version of the BERT ForSequenceClassification model for sentiment analysis. It is trained on a dataset of texts with six different emotions: anger, fear, joy, love, sadness, and surprise. The model was trained and tested on a labeled dataset from Kaggle.

Github link: https://github.com/hennypurwadi/Bert_FineTune_Sentiment_Analysis

The labeled dataset I used to fine-tune and train the model can be found at: https://www.kaggle.com/datasets/praveengovi/emotions-dataset-for-nlp?select=train.txt

Model Training Details

  • Pretrained model: bert-base-uncased ("uncased" means the model was trained on lowercased text)
  • Number of labels: 6:
  • "Label_0": "anger",
  • "Label_1": "fear",
  • "Label_2": "joy"
  • "Label_3": "love",
  • "Label_4": "sadness",
  • "Label_5": "surprise"
  • Learning rate: 2e-5
  • Epsilon: 1e-8
  • Epochs: 10
  • Warmup steps: 0
  • Optimizer: AdamW with correct_bias=False

Dataset

The model was trained and tested on a labeled dataset from Kaggle.

##To predict the sentiments on unlabeled datasets, use the predict_sentiments function provided in this repository.

The unlabeled daataset to be predicted should have a single column named "text".

Predict Unlabeled dataset collected from Twitter (dc_America.csv)

predict_sentiments(model_name, tokenizer_name, '/content/drive/MyDrive/DLBBT01/data/c_unlabeled/dc_America.csv')

##To load and use the model and tokenizer, use the following code:

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
import pandas as pd

def predict_sentiments(model_name, tokenizer_name, input_file):

    model = AutoModelForSequenceClassification.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
    
    df = pd.read_csv(input_file)

    # Tokenize input text
    test_inputs = tokenizer(list(df['text']), padding=True, truncation=True, max_length=128, return_tensors='pt')

    # Make predictions
    with torch.no_grad():
        model.eval()
        outputs = model(test_inputs['input_ids'], token_type_ids=None, attention_mask=test_inputs['attention_mask'])
        logits = outputs[0].detach().cpu().numpy()
        predictions = logits.argmax(axis=-1)

    # Map the predicted labels back to their original names
    int2label = {0: 'anger', 1: 'fear', 2: 'joy', 3: 'love', 4: 'sadness', 5: 'surprise'}
    predicted_labels = [int2label[p] for p in predictions]

    # Add the predicted labels to the test dataframe
    df['label'] = predicted_labels

    # Save the predictions to a file
    output_file = input_file.replace(".csv", "_predicted.csv")
    df.to_csv(output_file, index=False)

model_name = "RinInori/bert-base-uncased_finetune_sentiments"
tokenizer_name = "RinInori/bert-base-uncased_finetune_sentiments"

#Predict Unlabeled data
predict_sentiments(model_name, tokenizer_name, '/content/drive/MyDrive/DLBBT01/data/c_unlabeled/dc_America.csv')

# Load predicted data
df_Am = pd.read_csv('/content/drive/MyDrive/DLBBT01/data/c_unlabeled/dc_America_predicted.csv')
df_Am.head()

from transformers import AutoTokenizer
import matplotlib.pyplot as plt

# Load tokenizer
tokenizer_name = "RinInori/bert-base-uncased_finetune_sentiments"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, do_lower_case=True)

# Load dataset
input_file = '/content/drive/MyDrive/DLBBT01/data/c_unlabeled/dc_America_predicted.csv'
df_Am = pd.read_csv(input_file)

# Examine distribution of data based on labels
sentences = df_Am.text.values
print("Distribution of data based on labels: ", df_Am.label.value_counts())

MAX_LEN = 512

# Plot label 
label_count = df_Am['label'].value_counts()
plot_users = label_count.plot.pie(autopct='%1.1f%%', figsize=(4, 4))
plt.rc('axes', unicode_minus=False)
Downloads last month
82
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.