license: creativeml-openrail-m
datasets:
- prithivMLmods/Spam-Text-Detect-Analysis
language:
- en
base_model:
- google-bert/bert-base-uncased
pipeline_tag: text-classification
library_name: transformers
π Spam Detection with BERT using Hugging Face & Weights & Biases
Overview
This project implements a spam detection model using the BERT (Bidirectional Encoder Representations from Transformers) architecture and leverages Weights & Biases (wandb) for experiment tracking. The model is trained and evaluated using the prithivMLmods/Spam-Text-Detect-Analysis dataset from Hugging Face.
π οΈ Requirements
- Python 3.x
- PyTorch
- Transformers
- Datasets
- Weights & Biases
- Scikit-learn
Install Dependencies
You can install the required dependencies with the following:
pip install transformers datasets wandb scikit-learn
π Model Training
Model Architecture
The model uses BERT for sequence classification:
- Pre-trained Model:
bert-base-uncased
- Task: Binary classification (Spam / Ham)
- Optimization: Cross-entropy loss
Training Arguments
- Learning rate:
2e-5
- Batch size: 16
- Epochs: 3
- Evaluation: Epoch-based.
π Dataset
The model uses the Spam Text Detection Dataset available at Hugging Face Datasets.
You can access the dataset here.
π₯οΈ Instructions
Clone and Set Up
Clone the repository, if applicable:
git clone <repository-url>
cd <project-directory>
Ensure dependencies are installed with:
pip install -r requirements.txt
Train the Model
After installing dependencies, you can train the model using:
from train import main # Assuming training is implemented in a `train.py`
Replace train.py
with your script's entry point.
β¨ Weights & Biases Integration
We use Weights & Biases for:
- Real-time logging of training and evaluation metrics.
- Tracking experiments.
- Monitoring evaluation loss, precision, recall, and accuracy.
Set up wandb by initializing this in the script:
import wandb
wandb.init(project="spam-detection")
π Metrics
The following metrics were logged:
- Accuracy: Final validation accuracy.
- Precision: Fraction of predicted positive cases that were truly positive.
- Recall: Fraction of actual positive cases predicted.
- F1 Score: Harmonic mean of precision and recall.
- Evaluation Loss: Loss during validation on evaluation splits.
π Results
Using BERT with the provided dataset:
- Validation Accuracy:
0.9937
- Precision:
0.9931
- Recall:
0.9597
- F1 Score:
0.9761
π Files and Directories
model/
: Contains trained model checkpoints.data/
: Scripts for processing datasets.wandb/
: All logged artifacts from Weights & Biases runs.results/
: Training and evaluation results are saved here.
π Acknowledgements
Dataset Source: Spam-Text-Detect-Analysis on Hugging Face
Model: BERT for sequence classification from Hugging Face Transformers.