prithivMLmods
commited on
Commit
β’
4af0eb2
1
Parent(s):
3eecabc
Update README.md
Browse files
README.md
CHANGED
@@ -11,97 +11,83 @@ library_name: transformers
|
|
11 |
---
|
12 |
### **SPAM DETECTION UNCASED [ SPAM / HAM ]**
|
13 |
|
14 |
-
|
15 |
-
|
16 |
-
This project implements a spam detection model using the **BERT (Bidirectional Encoder Representations from Transformers)** architecture and leverages **Weights & Biases (wandb)** for experiment tracking. The model is trained and evaluated using the [prithivMLmods/Spam-Text-Detect-Analysis](https://huggingface.co/datasets/prithivMLmods/Spam-Text-Detect-Analysis) dataset from Hugging Face.
|
17 |
|
18 |
---
|
19 |
|
20 |
-
## **π οΈ
|
21 |
|
22 |
-
|
23 |
-
-
|
24 |
-
-
|
25 |
-
-
|
26 |
-
-
|
27 |
-
-
|
|
|
|
|
|
|
|
|
28 |
|
29 |
---
|
30 |
|
31 |
-
|
32 |
-
|
33 |
-
You can install the required dependencies with the following:
|
34 |
|
35 |
-
|
36 |
-
|
37 |
-
|
|
|
38 |
|
39 |
---
|
40 |
|
41 |
-
## **π Model Training**
|
42 |
|
43 |
-
### **Model Architecture
|
44 |
-
The model uses
|
45 |
-
- Pre-trained Model: `bert-base-uncased`
|
46 |
-
- Task: Binary classification (Spam / Ham)
|
47 |
-
- Optimization: Cross-entropy loss
|
48 |
-
|
49 |
-
---
|
50 |
|
51 |
-
### **Training
|
52 |
-
- **Learning
|
53 |
-
- **Batch
|
54 |
-
- **Epochs:** 3
|
55 |
-
- **
|
56 |
|
57 |
---
|
58 |
|
59 |
-
##
|
60 |
-
|
61 |
-
|
62 |
-
|
63 |
-
|
64 |
-
|
65 |
-
|
66 |
-
|
67 |
-
|
68 |
-
|
69 |
-
|
70 |
-
|
71 |
-
|
72 |
-
|
73 |
-
|
74 |
-
|
75 |
-
```
|
76 |
-
|
77 |
-
|
78 |
-
|
79 |
-
```
|
80 |
-
|
81 |
-
```
|
82 |
-
|
83 |
-
---
|
84 |
-
|
85 |
-
### Train the Model
|
86 |
-
After installing dependencies, you can train the model using:
|
87 |
-
|
88 |
-
```python
|
89 |
-
from train import main # Assuming training is implemented in a `train.py`
|
90 |
-
```
|
91 |
-
|
92 |
-
Replace `train.py` with your script's entry point.
|
93 |
|
94 |
---
|
95 |
|
96 |
## **β¨ Weights & Biases Integration**
|
97 |
|
98 |
-
|
99 |
-
-
|
100 |
-
-
|
101 |
-
-
|
102 |
-
|
103 |
-
Set up wandb by initializing this in the script:
|
104 |
|
|
|
|
|
105 |
```python
|
106 |
import wandb
|
107 |
wandb.init(project="spam-detection")
|
@@ -109,41 +95,30 @@ wandb.init(project="spam-detection")
|
|
109 |
|
110 |
---
|
111 |
|
112 |
-
##
|
113 |
-
|
114 |
-
The following metrics were logged:
|
115 |
-
|
116 |
-
- **Accuracy:** Final validation accuracy.
|
117 |
-
- **Precision:** Fraction of predicted positive cases that were truly positive.
|
118 |
-
- **Recall:** Fraction of actual positive cases predicted.
|
119 |
-
- **F1 Score:** Harmonic mean of precision and recall.
|
120 |
-
- **Evaluation Loss:** Loss during validation on evaluation splits.
|
121 |
-
|
122 |
-
---
|
123 |
-
|
124 |
-
## **π Results**
|
125 |
|
126 |
-
|
127 |
|
128 |
-
|
129 |
-
-
|
130 |
-
|
131 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
132 |
|
133 |
---
|
134 |
|
135 |
-
##
|
|
|
|
|
136 |
|
137 |
-
|
138 |
-
-
|
139 |
-
- `wandb/`: All logged artifacts from Weights & Biases runs.
|
140 |
-
- `results/`: Training and evaluation results are saved here.
|
141 |
|
142 |
---
|
143 |
|
144 |
-
|
145 |
-
|
146 |
-
Dataset Source: [Spam-Text-Detect-Analysis on Hugging Face](https://huggingface.co/datasets/prithivMLmods/Spam-Text-Detect-Analysis)
|
147 |
-
Model: **BERT for sequence classification** from Hugging Face Transformers.
|
148 |
-
|
149 |
-
---
|
|
|
11 |
---
|
12 |
### **SPAM DETECTION UNCASED [ SPAM / HAM ]**
|
13 |
|
14 |
+
This implementation leverages **BERT (Bidirectional Encoder Representations from Transformers)** for binary classification (Spam / Ham) using sequence classification. The model uses the **`prithivMLmods/Spam-Text-Detect-Analysis` dataset** and integrates **Weights & Biases (wandb)** for comprehensive experiment tracking.
|
|
|
|
|
15 |
|
16 |
---
|
17 |
|
18 |
+
## **π οΈ Overview**
|
19 |
|
20 |
+
### **Core Details:**
|
21 |
+
- **Model:** BERT for sequence classification
|
22 |
+
Pre-trained Model: `bert-base-uncased`
|
23 |
+
- **Task:** Spam detection - Binary classification task (Spam vs Ham).
|
24 |
+
- **Metrics Tracked:**
|
25 |
+
- Accuracy
|
26 |
+
- Precision
|
27 |
+
- Recall
|
28 |
+
- F1 Score
|
29 |
+
- Evaluation loss
|
30 |
|
31 |
---
|
32 |
|
33 |
+
## **π Key Results**
|
34 |
+
Results were obtained using BERT and the provided training dataset:
|
|
|
35 |
|
36 |
+
- **Validation Accuracy:** **0.9937**
|
37 |
+
- **Precision:** **0.9931**
|
38 |
+
- **Recall:** **0.9597**
|
39 |
+
- **F1 Score:** **0.9761**
|
40 |
|
41 |
---
|
42 |
|
43 |
+
## **π Model Training Details**
|
44 |
|
45 |
+
### **Model Architecture:**
|
46 |
+
The model uses `bert-base-uncased` as the pre-trained backbone and is fine-tuned for the sequence classification task.
|
|
|
|
|
|
|
|
|
|
|
47 |
|
48 |
+
### **Training Parameters:**
|
49 |
+
- **Learning Rate:** 2e-5
|
50 |
+
- **Batch Size:** 16
|
51 |
+
- **Epochs:** 3
|
52 |
+
- **Loss:** Cross-Entropy
|
53 |
|
54 |
---
|
55 |
|
56 |
+
## **π How to Train the Model**
|
57 |
+
|
58 |
+
1. **Clone Repository:**
|
59 |
+
```bash
|
60 |
+
git clone <repository-url>
|
61 |
+
cd <project-directory>
|
62 |
+
```
|
63 |
+
|
64 |
+
2. **Install Dependencies:**
|
65 |
+
Install all necessary dependencies.
|
66 |
+
```bash
|
67 |
+
pip install -r requirements.txt
|
68 |
+
```
|
69 |
+
or manually:
|
70 |
+
```bash
|
71 |
+
pip install transformers datasets wandb scikit-learn
|
72 |
+
```
|
73 |
+
|
74 |
+
3. **Train the Model:**
|
75 |
+
Assuming you have a script like `train.py`, run:
|
76 |
+
```python
|
77 |
+
from train import main
|
78 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
79 |
|
80 |
---
|
81 |
|
82 |
## **β¨ Weights & Biases Integration**
|
83 |
|
84 |
+
### Why Use wandb?
|
85 |
+
- **Monitor experiments in real time** via visualization.
|
86 |
+
- Log metrics such as loss, accuracy, precision, recall, and F1 score.
|
87 |
+
- Provides a history of past runs and their comparisons.
|
|
|
|
|
88 |
|
89 |
+
### Initialize Weights & Biases
|
90 |
+
Include this snippet in your training script:
|
91 |
```python
|
92 |
import wandb
|
93 |
wandb.init(project="spam-detection")
|
|
|
95 |
|
96 |
---
|
97 |
|
98 |
+
## π **Directory Structure**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
99 |
|
100 |
+
The directory is organized to ensure scalability and clear separation of components:
|
101 |
|
102 |
+
```
|
103 |
+
project-directory/
|
104 |
+
β
|
105 |
+
βββ data/ # Dataset processing scripts
|
106 |
+
βββ wandb/ # Logged artifacts from wandb runs
|
107 |
+
βββ results/ # Save training and evaluation results
|
108 |
+
βββ model/ # Trained model checkpoints
|
109 |
+
βββ requirements.txt # List of dependencies
|
110 |
+
βββ train.py # Main script for training the model
|
111 |
+
```
|
112 |
|
113 |
---
|
114 |
|
115 |
+
## π Dataset Information
|
116 |
+
The training dataset comes from **Spam-Text-Detect-Analysis** available on Hugging Face:
|
117 |
+
- **Dataset Link:** [Spam Text Detection Dataset - Hugging Face](https://huggingface.co/datasets)
|
118 |
|
119 |
+
Dataset size:
|
120 |
+
- **5.57k entries**
|
|
|
|
|
121 |
|
122 |
---
|
123 |
|
124 |
+
Let me know if you need assistance setting up the training pipeline, optimizing metrics, visualizing with wandb, or deploying this fine-tuned model. π
|
|
|
|
|
|
|
|
|
|