Spam-Bert-Uncased / README.md
prithivMLmods's picture
Update README.md
cac17da verified
|
raw
history blame
3.5 kB
metadata
license: creativeml-openrail-m
datasets:
  - prithivMLmods/Spam-Text-Detect-Analysis
language:
  - en
base_model:
  - google-bert/bert-base-uncased
pipeline_tag: text-classification
library_name: transformers

πŸ“Š Spam Detection with BERT using Hugging Face & Weights & Biases

Overview

This project implements a spam detection model using the BERT (Bidirectional Encoder Representations from Transformers) architecture and leverages Weights & Biases (wandb) for experiment tracking. The model is trained and evaluated using the prithivMLmods/Spam-Text-Detect-Analysis dataset from Hugging Face.


πŸ› οΈ Requirements

  • Python 3.x
  • PyTorch
  • Transformers
  • Datasets
  • Weights & Biases
  • Scikit-learn

Install Dependencies

You can install the required dependencies with the following:

pip install transformers datasets wandb scikit-learn

πŸ“ˆ Model Training

Model Architecture

The model uses BERT for sequence classification:

  • Pre-trained Model: bert-base-uncased
  • Task: Binary classification (Spam / Ham)
  • Optimization: Cross-entropy loss

Training Arguments

  • Learning rate: 2e-5
  • Batch size: 16
  • Epochs: 3
  • Evaluation: Epoch-based.

πŸ”— Dataset

The model uses the Spam Text Detection Dataset available at Hugging Face Datasets.

You can access the dataset here.


πŸ–₯️ Instructions

Clone and Set Up

Clone the repository, if applicable:

git clone <repository-url>
cd <project-directory>

Ensure dependencies are installed with:

pip install -r requirements.txt

Train the Model

After installing dependencies, you can train the model using:

from train import main  # Assuming training is implemented in a `train.py`

Replace train.py with your script's entry point.


✨ Weights & Biases Integration

We use Weights & Biases for:

  • Real-time logging of training and evaluation metrics.
  • Tracking experiments.
  • Monitoring evaluation loss, precision, recall, and accuracy.

Set up wandb by initializing this in the script:

import wandb
wandb.init(project="spam-detection")

πŸ“Š Metrics

The following metrics were logged:

  • Accuracy: Final validation accuracy.
  • Precision: Fraction of predicted positive cases that were truly positive.
  • Recall: Fraction of actual positive cases predicted.
  • F1 Score: Harmonic mean of precision and recall.
  • Evaluation Loss: Loss during validation on evaluation splits.

πŸš€ Results

Using BERT with the provided dataset:

  • Validation Accuracy: 0.9937
  • Precision: 0.9931
  • Recall: 0.9597
  • F1 Score: 0.9761

πŸ“ Files and Directories

  • model/: Contains trained model checkpoints.
  • data/: Scripts for processing datasets.
  • wandb/: All logged artifacts from Weights & Biases runs.
  • results/: Training and evaluation results are saved here.

πŸ“œ Acknowledgements

Dataset Source: Spam-Text-Detect-Analysis on Hugging Face
Model: BERT for sequence classification from Hugging Face Transformers.