README.md · prithivMLmods/Spam-Bert-Uncased at 4af0eb2960c9b705a05f8a14621f986bc1b56c87

metadata

license: creativeml-openrail-m
datasets:
  - prithivMLmods/Spam-Text-Detect-Analysis
language:
  - en
base_model:
  - google-bert/bert-base-uncased
pipeline_tag: text-classification
library_name: transformers

SPAM DETECTION UNCASED [ SPAM / HAM ]

This implementation leverages BERT (Bidirectional Encoder Representations from Transformers) for binary classification (Spam / Ham) using sequence classification. The model uses the prithivMLmods/Spam-Text-Detect-Analysis dataset and integrates Weights & Biases (wandb) for comprehensive experiment tracking.

🛠️ Overview

Core Details:

Model: BERT for sequence classification
Pre-trained Model: bert-base-uncased
Task: Spam detection - Binary classification task (Spam vs Ham).
Metrics Tracked:
- Accuracy
- Precision
- Recall
- F1 Score
- Evaluation loss

📊 Key Results

Results were obtained using BERT and the provided training dataset:

Validation Accuracy: 0.9937
Precision: 0.9931
Recall: 0.9597
F1 Score: 0.9761

📈 Model Training Details

Model Architecture:

The model uses bert-base-uncased as the pre-trained backbone and is fine-tuned for the sequence classification task.

Training Parameters:

Learning Rate: 2e-5
Batch Size: 16
Epochs: 3
Loss: Cross-Entropy

🚀 How to Train the Model

Clone Repository:

git clone <repository-url>
cd <project-directory>

Install Dependencies: Install all necessary dependencies.

pip install -r requirements.txt

or manually:

pip install transformers datasets wandb scikit-learn

Train the Model: Assuming you have a script like train.py, run:
```
from train import main
```

✨ Weights & Biases Integration

Why Use wandb?

Monitor experiments in real time via visualization.
Log metrics such as loss, accuracy, precision, recall, and F1 score.
Provides a history of past runs and their comparisons.

Initialize Weights & Biases

Include this snippet in your training script:

import wandb
wandb.init(project="spam-detection")

📁 Directory Structure

The directory is organized to ensure scalability and clear separation of components:

project-directory/
│
├── data/                # Dataset processing scripts
├── wandb/              # Logged artifacts from wandb runs
├── results/            # Save training and evaluation results
├── model/              # Trained model checkpoints
├── requirements.txt    # List of dependencies
└── train.py            # Main script for training the model

🔗 Dataset Information

The training dataset comes from Spam-Text-Detect-Analysis available on Hugging Face:

Dataset Link: Spam Text Detection Dataset - Hugging Face

Dataset size:

5.57k entries

Let me know if you need assistance setting up the training pipeline, optimizing metrics, visualizing with wandb, or deploying this fine-tuned model. 🚀