Update README.md
Browse files
README.md
CHANGED
@@ -1,5 +1,97 @@
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
-
|
3 |
-
|
4 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
5 |
---
|
|
|
1 |
+
# PhilBERT: Phishing Detection with DistilBERT
|
2 |
+
|
3 |
+
PhilBERT is a **fine-tuned DistilBERT model** optimized for detecting phishing threats across multiple communication channels, including **emails, SMS, URLs, and websites**. It is trained on a diverse dataset sourced from **Kaggle, Mendeley, Phishing.Database, and Bancolombia**, ensuring high adaptability and real-world applicability.
|
4 |
+
|
5 |
---
|
6 |
+
|
7 |
+
## Key Features
|
8 |
+
|
9 |
+
- **Multi-Channel Detection** – Analyzes text, URLs, and web content to detect phishing patterns.
|
10 |
+
- **Fine-Tuned on Real-World Data** – Includes recent **three months of financial institution data (Bancolombia)**.
|
11 |
+
- **Lightweight & Efficient** – Based on **DistilBERT**, providing high performance with reduced computational costs.
|
12 |
+
- **High Accuracy** – Achieves **85.22% precision, 93.81% recall, and 88.77% accuracy** on unseen data.
|
13 |
+
- **Self-Adaptive Learning** – Continuously evolves using real-time phishing simulations generated with **GPT-4o**.
|
14 |
+
- **Scalability** – Designed to support **7,000–25,000 simultaneous users** in production environments.
|
15 |
+
|
16 |
+
---
|
17 |
+
|
18 |
+
## Model Architecture
|
19 |
+
|
20 |
+
PhilBERT leverages **DistilBERT**, a distilled version of **BERT**, maintaining the same architecture but with **40% fewer parameters**, making it lightweight while preserving high accuracy. The **final model** includes:
|
21 |
+
|
22 |
+
- **Tokenizer**: Trained to recognize phishing-specific patterns (URLs, obfuscation, domain misspellings).
|
23 |
+
- **Custom Classifier**: A **fully connected dense layer** added for binary classification (phishing vs. benign).
|
24 |
+
- **Risk Scoring Mechanism**: A **weighted confidence score** applied to enhance detection reliability.
|
25 |
+
|
26 |
+
---
|
27 |
+
|
28 |
+
## Data Preprocessing
|
29 |
+
|
30 |
+
Before fine-tuning, the dataset underwent **extensive preprocessing** to ensure balance and quality:
|
31 |
+
|
32 |
+
- **Duplicate Removal & Balancing**: Maintained a **near 50-50 phishing-to-benign ratio** to prevent model bias.
|
33 |
+
- **Feature Extraction**: Applied to **URLs, HTML, email bodies, and SMS content** to enrich input representations.
|
34 |
+
- **Dataset Split**: Final dataset included:
|
35 |
+
- **427,028 benign URLs** & **381,014 phishing URLs**
|
36 |
+
- **17,536 unique email samples**
|
37 |
+
- **5,949 SMS samples**
|
38 |
+
- **Web entries filtered for efficiency** (removing entries >100KB).
|
39 |
+
- **Export Format**: Data transformed and stored in **JSON for efficient training**.
|
40 |
+
|
41 |
+
---
|
42 |
+
|
43 |
+
## Training & Evaluation
|
44 |
+
|
45 |
+
PhilBERT was fine-tuned on **multi-modal phishing datasets** using **transfer learning**, achieving:
|
46 |
+
|
47 |
+
| **Metric** | **Value** |
|
48 |
+
|---------------------|------------|
|
49 |
+
| Accuracy | **88.77%** |
|
50 |
+
| Precision | **85.22%** |
|
51 |
+
| Recall | **93.81%** |
|
52 |
+
| F1-Score | **89.31%** |
|
53 |
+
| Evaluation Runtime | **130.46s** |
|
54 |
+
| Samples/sec | **58.701** |
|
55 |
+
|
56 |
+
- **False Positive Reduction**: Multi-layered filtering minimized false positives while maintaining **high recall**.
|
57 |
+
- **Scalability**: Successfully stress-tested for **up to 25,000 simultaneous users**.
|
58 |
+
- **Compliance**: Meets **ISO 27001** and **GDPR standards** for security and privacy.
|
59 |
+
|
60 |
+
---
|
61 |
+
|
62 |
+
## Usage
|
63 |
+
|
64 |
+
### Installation
|
65 |
+
|
66 |
+
```bash
|
67 |
+
pip install transformers torch
|
68 |
+
```
|
69 |
+
|
70 |
+
### Inference
|
71 |
+
|
72 |
+
```python
|
73 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
74 |
+
import torch
|
75 |
+
|
76 |
+
model_name = "your_username/PhilBERT"
|
77 |
+
|
78 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
79 |
+
model = AutoModelForSequenceClassification.from_pretrained(model_name)
|
80 |
+
|
81 |
+
text = "Click this link to update your bank details: http://fakebank.com"
|
82 |
+
inputs = tokenizer(text, return_tensors="pt")
|
83 |
+
|
84 |
+
with torch.no_grad():
|
85 |
+
outputs = model(**inputs)
|
86 |
+
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
|
87 |
+
|
88 |
+
print(f"Phishing probability: {predictions[0][1].item():.4f}")
|
89 |
+
```
|
90 |
+
|
91 |
+
---
|
92 |
+
|
93 |
+
## License
|
94 |
+
|
95 |
+
This model is **proprietary** and protected under a **custom license**. Please refer to the **[LICENSE](LICENSE)** file for terms of use.
|
96 |
+
|
97 |
---
|