File size: 3,879 Bytes
841c9c1 335bc04 841c9c1 335bc04 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 |
# PhilBERT: Phishing Detection with DistilBERT
PhilBERT is a **fine-tuned DistilBERT model** optimized for detecting phishing threats across multiple communication channels, including **emails, SMS, URLs, and websites**. It is trained on a diverse dataset sourced from **Kaggle, Mendeley, Phishing.Database, and Bancolombia**, ensuring high adaptability and real-world applicability.
---
## Key Features
- **Multi-Channel Detection** β Analyzes text, URLs, and web content to detect phishing patterns.
- **Fine-Tuned on Real-World Data** β Includes recent **three months of financial institution data (Bancolombia)**.
- **Lightweight & Efficient** β Based on **DistilBERT**, providing high performance with reduced computational costs.
- **High Accuracy** β Achieves **85.22% precision, 93.81% recall, and 88.77% accuracy** on unseen data.
- **Self-Adaptive Learning** β Continuously evolves using real-time phishing simulations generated with **GPT-4o**.
- **Scalability** β Designed to support **7,000β25,000 simultaneous users** in production environments.
---
## Model Architecture
PhilBERT leverages **DistilBERT**, a distilled version of **BERT**, maintaining the same architecture but with **40% fewer parameters**, making it lightweight while preserving high accuracy. The **final model** includes:
- **Tokenizer**: Trained to recognize phishing-specific patterns (URLs, obfuscation, domain misspellings).
- **Custom Classifier**: A **fully connected dense layer** added for binary classification (phishing vs. benign).
- **Risk Scoring Mechanism**: A **weighted confidence score** applied to enhance detection reliability.
---
## Data Preprocessing
Before fine-tuning, the dataset underwent **extensive preprocessing** to ensure balance and quality:
- **Duplicate Removal & Balancing**: Maintained a **near 50-50 phishing-to-benign ratio** to prevent model bias.
- **Feature Extraction**: Applied to **URLs, HTML, email bodies, and SMS content** to enrich input representations.
- **Dataset Split**: Final dataset included:
- **427,028 benign URLs** & **381,014 phishing URLs**
- **17,536 unique email samples**
- **5,949 SMS samples**
- **Web entries filtered for efficiency** (removing entries >100KB).
- **Export Format**: Data transformed and stored in **JSON for efficient training**.
---
## Training & Evaluation
PhilBERT was fine-tuned on **multi-modal phishing datasets** using **transfer learning**, achieving:
| **Metric** | **Value** |
|---------------------|------------|
| Accuracy | **88.77%** |
| Precision | **85.22%** |
| Recall | **93.81%** |
| F1-Score | **89.31%** |
| Evaluation Runtime | **130.46s** |
| Samples/sec | **58.701** |
- **False Positive Reduction**: Multi-layered filtering minimized false positives while maintaining **high recall**.
- **Scalability**: Successfully stress-tested for **up to 25,000 simultaneous users**.
- **Compliance**: Meets **ISO 27001** and **GDPR standards** for security and privacy.
---
## Usage
### Installation
```bash
pip install transformers torch
```
### Inference
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "your_username/PhilBERT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
text = "Click this link to update your bank details: http://fakebank.com"
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(f"Phishing probability: {predictions[0][1].item():.4f}")
```
---
## License
This model is **proprietary** and protected under a **custom license**. Please refer to the **[LICENSE](LICENSE)** file for terms of use.
---
|