--- license: creativeml-openrail-m datasets: - prithivMLmods/Spam-Text-Detect-Analysis language: - en base_model: - google-bert/bert-base-uncased pipeline_tag: text-classification library_name: transformers --- ### **SPAM DETECTION UNCASED [ SPAM / HAM ]** This implementation leverages **BERT (Bidirectional Encoder Representations from Transformers)** for binary classification (Spam / Ham) using sequence classification. The model uses the **`prithivMLmods/Spam-Text-Detect-Analysis` dataset** and integrates **Weights & Biases (wandb)** for comprehensive experiment tracking. --- ## **🛠️ Overview** ### **Core Details:** - **Model:** BERT for sequence classification Pre-trained Model: `bert-base-uncased` - **Task:** Spam detection - Binary classification task (Spam vs Ham). - **Metrics Tracked:** - Accuracy - Precision - Recall - F1 Score - Evaluation loss --- ## **📊 Key Results** Results were obtained using BERT and the provided training dataset: - **Validation Accuracy:** **0.9937** - **Precision:** **0.9931** - **Recall:** **0.9597** - **F1 Score:** **0.9761** --- ## **📈 Model Training Details** ### **Model Architecture:** The model uses `bert-base-uncased` as the pre-trained backbone and is fine-tuned for the sequence classification task. ### **Training Parameters:** - **Learning Rate:** 2e-5 - **Batch Size:** 16 - **Epochs:** 3 - **Loss:** Cross-Entropy --- ## **🚀 How to Train the Model** 1. **Clone Repository:** ```bash git clone cd ``` 2. **Install Dependencies:** Install all necessary dependencies. ```bash pip install -r requirements.txt ``` or manually: ```bash pip install transformers datasets wandb scikit-learn ``` 3. **Train the Model:** Assuming you have a script like `train.py`, run: ```python from train import main ``` --- ## **✨ Weights & Biases Integration** ### Why Use wandb? - **Monitor experiments in real time** via visualization. - Log metrics such as loss, accuracy, precision, recall, and F1 score. - Provides a history of past runs and their comparisons. ### Initialize Weights & Biases Include this snippet in your training script: ```python import wandb wandb.init(project="spam-detection") ``` --- ## 📁 **Directory Structure** The directory is organized to ensure scalability and clear separation of components: ``` project-directory/ │ ├── data/ # Dataset processing scripts ├── wandb/ # Logged artifacts from wandb runs ├── results/ # Save training and evaluation results ├── model/ # Trained model checkpoints ├── requirements.txt # List of dependencies └── train.py # Main script for training the model ``` --- ## 🔗 Dataset Information The training dataset comes from **Spam-Text-Detect-Analysis** available on Hugging Face: - **Dataset Link:** [Spam Text Detection Dataset - Hugging Face](https://huggingface.co/datasets) Dataset size: - **5.57k entries** --- Let me know if you need assistance setting up the training pipeline, optimizing metrics, visualizing with wandb, or deploying this fine-tuned model. 🚀