YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

PhilBERT: Phishing Detection with DistilBERT

PhilBERT is a fine-tuned DistilBERT model optimized for detecting phishing threats across multiple communication channels, including emails, SMS, URLs, and websites. It is trained on a diverse dataset sourced from Kaggle, Mendeley, Phishing.Database, and Bancolombia, ensuring high adaptability and real-world applicability.


Key Features

  • Multi-Channel Detection – Analyzes text, URLs, and web content to detect phishing patterns.
  • Fine-Tuned on Real-World Data – Includes recent three months of financial institution data (Bancolombia).
  • Lightweight & Efficient – Based on DistilBERT, providing high performance with reduced computational costs.
  • High Accuracy – Achieves 85.22% precision, 93.81% recall, and 88.77% accuracy on unseen data.
  • Self-Adaptive Learning – Continuously evolves using real-time phishing simulations generated with GPT-4o.
  • Scalability – Designed to support 7,000–25,000 simultaneous users in production environments.

Model Architecture

PhilBERT leverages DistilBERT, a distilled version of BERT, maintaining the same architecture but with 40% fewer parameters, making it lightweight while preserving high accuracy. The final model includes:

  • Tokenizer: Trained to recognize phishing-specific patterns (URLs, obfuscation, domain misspellings).
  • Custom Classifier: A fully connected dense layer added for binary classification (phishing vs. benign).
  • Risk Scoring Mechanism: A weighted confidence score applied to enhance detection reliability.

Data Preprocessing

Before fine-tuning, the dataset underwent extensive preprocessing to ensure balance and quality:

  • Duplicate Removal & Balancing: Maintained a near 50-50 phishing-to-benign ratio to prevent model bias.
  • Feature Extraction: Applied to URLs, HTML, email bodies, and SMS content to enrich input representations.
  • Dataset Split: Final dataset included:
    • 427,028 benign URLs & 381,014 phishing URLs
    • 17,536 unique email samples
    • 5,949 SMS samples
    • Web entries filtered for efficiency (removing entries >100KB).
  • Export Format: Data transformed and stored in JSON for efficient training.

Training & Evaluation

PhilBERT was fine-tuned on multi-modal phishing datasets using transfer learning, achieving:

Metric Value
Accuracy 88.77%
Precision 85.22%
Recall 93.81%
F1-Score 89.31%
Evaluation Runtime 130.46s
Samples/sec 58.701
  • False Positive Reduction: Multi-layered filtering minimized false positives while maintaining high recall.
  • Scalability: Successfully stress-tested for up to 25,000 simultaneous users.
  • Compliance: Meets ISO 27001 and GDPR standards for security and privacy.

Usage

Installation

pip install transformers torch

Inference

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "your_username/PhilBERT"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "Click this link to update your bank details: http://fakebank.com"
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

print(f"Phishing probability: {predictions[0][1].item():.4f}")

License

This model is proprietary and protected under a custom license. Please refer to the LICENSE file for terms of use.


Downloads last month
19
Safetensors
Model size
67M params
Tensor type
F32
Β·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.