PhilBERT / README.md
Alekla0126's picture
Update README.md
841c9c1 verified

PhilBERT: Phishing Detection with DistilBERT

PhilBERT is a fine-tuned DistilBERT model optimized for detecting phishing threats across multiple communication channels, including emails, SMS, URLs, and websites. It is trained on a diverse dataset sourced from Kaggle, Mendeley, Phishing.Database, and Bancolombia, ensuring high adaptability and real-world applicability.


Key Features

  • Multi-Channel Detection – Analyzes text, URLs, and web content to detect phishing patterns.
  • Fine-Tuned on Real-World Data – Includes recent three months of financial institution data (Bancolombia).
  • Lightweight & Efficient – Based on DistilBERT, providing high performance with reduced computational costs.
  • High Accuracy – Achieves 85.22% precision, 93.81% recall, and 88.77% accuracy on unseen data.
  • Self-Adaptive Learning – Continuously evolves using real-time phishing simulations generated with GPT-4o.
  • Scalability – Designed to support 7,000–25,000 simultaneous users in production environments.

Model Architecture

PhilBERT leverages DistilBERT, a distilled version of BERT, maintaining the same architecture but with 40% fewer parameters, making it lightweight while preserving high accuracy. The final model includes:

  • Tokenizer: Trained to recognize phishing-specific patterns (URLs, obfuscation, domain misspellings).
  • Custom Classifier: A fully connected dense layer added for binary classification (phishing vs. benign).
  • Risk Scoring Mechanism: A weighted confidence score applied to enhance detection reliability.

Data Preprocessing

Before fine-tuning, the dataset underwent extensive preprocessing to ensure balance and quality:

  • Duplicate Removal & Balancing: Maintained a near 50-50 phishing-to-benign ratio to prevent model bias.
  • Feature Extraction: Applied to URLs, HTML, email bodies, and SMS content to enrich input representations.
  • Dataset Split: Final dataset included:
    • 427,028 benign URLs & 381,014 phishing URLs
    • 17,536 unique email samples
    • 5,949 SMS samples
    • Web entries filtered for efficiency (removing entries >100KB).
  • Export Format: Data transformed and stored in JSON for efficient training.

Training & Evaluation

PhilBERT was fine-tuned on multi-modal phishing datasets using transfer learning, achieving:

Metric Value
Accuracy 88.77%
Precision 85.22%
Recall 93.81%
F1-Score 89.31%
Evaluation Runtime 130.46s
Samples/sec 58.701
  • False Positive Reduction: Multi-layered filtering minimized false positives while maintaining high recall.
  • Scalability: Successfully stress-tested for up to 25,000 simultaneous users.
  • Compliance: Meets ISO 27001 and GDPR standards for security and privacy.

Usage

Installation

pip install transformers torch

Inference

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "your_username/PhilBERT"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "Click this link to update your bank details: http://fakebank.com"
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

print(f"Phishing probability: {predictions[0][1].item():.4f}")

License

This model is proprietary and protected under a custom license. Please refer to the LICENSE file for terms of use.