PhilBERT: Phishing Detection with DistilBERT
PhilBERT is a fine-tuned DistilBERT model optimized for detecting phishing threats across multiple communication channels, including emails, SMS, URLs, and websites. It is trained on a diverse dataset sourced from Kaggle, Mendeley, Phishing.Database, and Bancolombia, ensuring high adaptability and real-world applicability.
Key Features
- Multi-Channel Detection β Analyzes text, URLs, and web content to detect phishing patterns.
- Fine-Tuned on Real-World Data β Includes recent three months of financial institution data (Bancolombia).
- Lightweight & Efficient β Based on DistilBERT, providing high performance with reduced computational costs.
- High Accuracy β Achieves 85.22% precision, 93.81% recall, and 88.77% accuracy on unseen data.
- Self-Adaptive Learning β Continuously evolves using real-time phishing simulations generated with GPT-4o.
- Scalability β Designed to support 7,000β25,000 simultaneous users in production environments.
Model Architecture
PhilBERT leverages DistilBERT, a distilled version of BERT, maintaining the same architecture but with 40% fewer parameters, making it lightweight while preserving high accuracy. The final model includes:
- Tokenizer: Trained to recognize phishing-specific patterns (URLs, obfuscation, domain misspellings).
- Custom Classifier: A fully connected dense layer added for binary classification (phishing vs. benign).
- Risk Scoring Mechanism: A weighted confidence score applied to enhance detection reliability.
Data Preprocessing
Before fine-tuning, the dataset underwent extensive preprocessing to ensure balance and quality:
- Duplicate Removal & Balancing: Maintained a near 50-50 phishing-to-benign ratio to prevent model bias.
- Feature Extraction: Applied to URLs, HTML, email bodies, and SMS content to enrich input representations.
- Dataset Split: Final dataset included:
- 427,028 benign URLs & 381,014 phishing URLs
- 17,536 unique email samples
- 5,949 SMS samples
- Web entries filtered for efficiency (removing entries >100KB).
- Export Format: Data transformed and stored in JSON for efficient training.
Training & Evaluation
PhilBERT was fine-tuned on multi-modal phishing datasets using transfer learning, achieving:
Metric | Value |
---|---|
Accuracy | 88.77% |
Precision | 85.22% |
Recall | 93.81% |
F1-Score | 89.31% |
Evaluation Runtime | 130.46s |
Samples/sec | 58.701 |
- False Positive Reduction: Multi-layered filtering minimized false positives while maintaining high recall.
- Scalability: Successfully stress-tested for up to 25,000 simultaneous users.
- Compliance: Meets ISO 27001 and GDPR standards for security and privacy.
Usage
Installation
pip install transformers torch
Inference
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "your_username/PhilBERT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
text = "Click this link to update your bank details: http://fakebank.com"
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(f"Phishing probability: {predictions[0][1].item():.4f}")
License
This model is proprietary and protected under a custom license. Please refer to the LICENSE file for terms of use.
- Downloads last month
- 19