Alekla0126
/

PhilBERT

+# PhilBERT: Phishing Detection with DistilBERT
+PhilBERT is a **fine-tuned DistilBERT model** optimized for detecting phishing threats across multiple communication channels, including **emails, SMS, URLs, and websites**. It is trained on a diverse dataset sourced from **Kaggle, Mendeley, Phishing.Database, and Bancolombia**, ensuring high adaptability and real-world applicability.
 ---
+## Key Features
+- **Multi-Channel Detection** – Analyzes text, URLs, and web content to detect phishing patterns.
+- **Fine-Tuned on Real-World Data** – Includes recent **three months of financial institution data (Bancolombia)**.
+- **Lightweight & Efficient** – Based on **DistilBERT**, providing high performance with reduced computational costs.
+- **High Accuracy** – Achieves **85.22% precision, 93.81% recall, and 88.77% accuracy** on unseen data.
+- **Self-Adaptive Learning** – Continuously evolves using real-time phishing simulations generated with **GPT-4o**.
+- **Scalability** – Designed to support **7,000–25,000 simultaneous users** in production environments.
+---
+## Model Architecture
+PhilBERT leverages **DistilBERT**, a distilled version of **BERT**, maintaining the same architecture but with **40% fewer parameters**, making it lightweight while preserving high accuracy. The **final model** includes:
+- **Tokenizer**: Trained to recognize phishing-specific patterns (URLs, obfuscation, domain misspellings).
+- **Custom Classifier**: A **fully connected dense layer** added for binary classification (phishing vs. benign).
+- **Risk Scoring Mechanism**: A **weighted confidence score** applied to enhance detection reliability.
+---
+## Data Preprocessing
+Before fine-tuning, the dataset underwent **extensive preprocessing** to ensure balance and quality:
+- **Duplicate Removal & Balancing**: Maintained a **near 50-50 phishing-to-benign ratio** to prevent model bias.
+- **Feature Extraction**: Applied to **URLs, HTML, email bodies, and SMS content** to enrich input representations.
+- **Dataset Split**: Final dataset included:
+  - **427,028 benign URLs** & **381,014 phishing URLs**
+  - **17,536 unique email samples**
+  - **5,949 SMS samples**
+  - **Web entries filtered for efficiency** (removing entries >100KB).
+- **Export Format**: Data transformed and stored in **JSON for efficient training**.
+---
+## Training & Evaluation
+PhilBERT was fine-tuned on **multi-modal phishing datasets** using **transfer learning**, achieving:
+| **Metric**          | **Value**  |
+|---------------------|------------|
+| Accuracy           | **88.77%**  |
+| Precision         | **85.22%**  |
+| Recall            | **93.81%**  |
+| F1-Score         | **89.31%**  |
+| Evaluation Runtime | **130.46s** |
+| Samples/sec       | **58.701**  |
+- **False Positive Reduction**: Multi-layered filtering minimized false positives while maintaining **high recall**.
+- **Scalability**: Successfully stress-tested for **up to 25,000 simultaneous users**.
+- **Compliance**: Meets **ISO 27001** and **GDPR standards** for security and privacy.
+---
+## Usage
+### Installation
+```bash
+pip install transformers torch
+```
+### Inference
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+model_name = "your_username/PhilBERT"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSequenceClassification.from_pretrained(model_name)
+text = "Click this link to update your bank details: http://fakebank.com"
+inputs = tokenizer(text, return_tensors="pt")
+with torch.no_grad():
+    outputs = model(**inputs)
+    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
+print(f"Phishing probability: {predictions[0][1].item():.4f}")
+```
+---
+## License
+This model is **proprietary** and protected under a **custom license**. Please refer to the **[LICENSE](LICENSE)** file for terms of use.
 ---