prithivMLmods
commited on
Commit
β’
7ff4aff
1
Parent(s):
5d81b2b
Update README.md
Browse files
README.md
CHANGED
@@ -8,4 +8,146 @@ base_model:
|
|
8 |
- google-bert/bert-base-uncased
|
9 |
pipeline_tag: text-classification
|
10 |
library_name: transformers
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
11 |
---
|
|
|
8 |
- google-bert/bert-base-uncased
|
9 |
pipeline_tag: text-classification
|
10 |
library_name: transformers
|
11 |
+
---
|
12 |
+
Below is a `README.md` template that you can use to document the project with appropriate details:
|
13 |
+
|
14 |
+
---
|
15 |
+
|
16 |
+
# π **Spam Detection with BERT using Hugging Face & Weights & Biases**
|
17 |
+
|
18 |
+
## **Overview**
|
19 |
+
|
20 |
+
This project implements a spam detection model using the **BERT (Bidirectional Encoder Representations from Transformers)** architecture and leverages **Weights & Biases (wandb)** for experiment tracking. The model is trained and evaluated using the [prithivMLmods/Spam-Text-Detect-Analysis](https://huggingface.co/datasets/prithivMLmods/Spam-Text-Detect-Analysis) dataset from Hugging Face.
|
21 |
+
|
22 |
+
---
|
23 |
+
|
24 |
+
## **π οΈ Requirements**
|
25 |
+
|
26 |
+
- Python 3.x
|
27 |
+
- PyTorch
|
28 |
+
- Transformers
|
29 |
+
- Datasets
|
30 |
+
- Weights & Biases
|
31 |
+
- Scikit-learn
|
32 |
+
|
33 |
+
---
|
34 |
+
|
35 |
+
### **Install Dependencies**
|
36 |
+
|
37 |
+
You can install the required dependencies with the following:
|
38 |
+
|
39 |
+
```bash
|
40 |
+
pip install transformers datasets wandb scikit-learn
|
41 |
+
```
|
42 |
+
|
43 |
+
---
|
44 |
+
|
45 |
+
## **π Model Training**
|
46 |
+
|
47 |
+
### **Model Architecture**
|
48 |
+
The model uses **BERT for sequence classification**:
|
49 |
+
- Pre-trained Model: `bert-base-uncased`
|
50 |
+
- Task: Binary classification (Spam / Ham)
|
51 |
+
- Optimization: Cross-entropy loss
|
52 |
+
|
53 |
+
---
|
54 |
+
|
55 |
+
### **Training Arguments**
|
56 |
+
- **Learning rate:** `2e-5`
|
57 |
+
- **Batch size:** 16
|
58 |
+
- **Epochs:** 3
|
59 |
+
- **Evaluation:** Epoch-based.
|
60 |
+
|
61 |
+
---
|
62 |
+
|
63 |
+
## **π Dataset**
|
64 |
+
|
65 |
+
The model uses the **Spam Text Detection Dataset** available at [Hugging Face Datasets](https://huggingface.co/datasets/prithivMLmods/Spam-Text-Detect-Analysis).
|
66 |
+
|
67 |
+
You can access the dataset [here](https://huggingface.co/datasets/prithivMLmods/Spam-Text-Detect-Analysis).
|
68 |
+
|
69 |
+
---
|
70 |
+
|
71 |
+
## **π₯οΈ Instructions**
|
72 |
+
|
73 |
+
### Clone and Set Up
|
74 |
+
Clone the repository, if applicable:
|
75 |
+
|
76 |
+
```bash
|
77 |
+
git clone <repository-url>
|
78 |
+
cd <project-directory>
|
79 |
+
```
|
80 |
+
|
81 |
+
Ensure dependencies are installed with:
|
82 |
+
|
83 |
+
```bash
|
84 |
+
pip install -r requirements.txt
|
85 |
+
```
|
86 |
+
|
87 |
+
---
|
88 |
+
|
89 |
+
### Train the Model
|
90 |
+
After installing dependencies, you can train the model using:
|
91 |
+
|
92 |
+
```python
|
93 |
+
from train import main # Assuming training is implemented in a `train.py`
|
94 |
+
```
|
95 |
+
|
96 |
+
Replace `train.py` with your script's entry point.
|
97 |
+
|
98 |
+
---
|
99 |
+
|
100 |
+
## **β¨ Weights & Biases Integration**
|
101 |
+
|
102 |
+
We use **Weights & Biases** for:
|
103 |
+
- Real-time logging of training and evaluation metrics.
|
104 |
+
- Tracking experiments.
|
105 |
+
- Monitoring evaluation loss, precision, recall, and accuracy.
|
106 |
+
|
107 |
+
Set up wandb by initializing this in the script:
|
108 |
+
|
109 |
+
```python
|
110 |
+
import wandb
|
111 |
+
wandb.init(project="spam-detection")
|
112 |
+
```
|
113 |
+
|
114 |
+
---
|
115 |
+
|
116 |
+
## **π Metrics**
|
117 |
+
|
118 |
+
The following metrics were logged:
|
119 |
+
|
120 |
+
- **Accuracy:** Final validation accuracy.
|
121 |
+
- **Precision:** Fraction of predicted positive cases that were truly positive.
|
122 |
+
- **Recall:** Fraction of actual positive cases predicted.
|
123 |
+
- **F1 Score:** Harmonic mean of precision and recall.
|
124 |
+
- **Evaluation Loss:** Loss during validation on evaluation splits.
|
125 |
+
|
126 |
+
---
|
127 |
+
|
128 |
+
## **π Results**
|
129 |
+
|
130 |
+
Using BERT with the provided dataset:
|
131 |
+
|
132 |
+
- **Validation Accuracy:** `0.9937`
|
133 |
+
- **Precision:** `0.9931`
|
134 |
+
- **Recall:** `0.9597`
|
135 |
+
- **F1 Score:** `0.9761`
|
136 |
+
|
137 |
+
---
|
138 |
+
|
139 |
+
## **π Files and Directories**
|
140 |
+
|
141 |
+
- `model/`: Contains trained model checkpoints.
|
142 |
+
- `data/`: Scripts for processing datasets.
|
143 |
+
- `wandb/`: All logged artifacts from Weights & Biases runs.
|
144 |
+
- `results/`: Training and evaluation results are saved here.
|
145 |
+
|
146 |
+
---
|
147 |
+
|
148 |
+
## **π Acknowledgements**
|
149 |
+
|
150 |
+
Dataset Source: [Spam-Text-Detect-Analysis on Hugging Face](https://huggingface.co/datasets/prithivMLmods/Spam-Text-Detect-Analysis)
|
151 |
+
Model: **BERT for sequence classification** from Hugging Face Transformers.
|
152 |
+
|
153 |
---
|