Alekla0126 commited on
Commit
841c9c1
·
verified ·
1 Parent(s): caaebbd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +95 -3
README.md CHANGED
@@ -1,5 +1,97 @@
 
 
 
 
1
  ---
2
- license: other
3
- license_name: proprietarylicense
4
- license_link: LICENSE
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  ---
 
1
+ # PhilBERT: Phishing Detection with DistilBERT
2
+
3
+ PhilBERT is a **fine-tuned DistilBERT model** optimized for detecting phishing threats across multiple communication channels, including **emails, SMS, URLs, and websites**. It is trained on a diverse dataset sourced from **Kaggle, Mendeley, Phishing.Database, and Bancolombia**, ensuring high adaptability and real-world applicability.
4
+
5
  ---
6
+
7
+ ## Key Features
8
+
9
+ - **Multi-Channel Detection** – Analyzes text, URLs, and web content to detect phishing patterns.
10
+ - **Fine-Tuned on Real-World Data** – Includes recent **three months of financial institution data (Bancolombia)**.
11
+ - **Lightweight & Efficient** – Based on **DistilBERT**, providing high performance with reduced computational costs.
12
+ - **High Accuracy** – Achieves **85.22% precision, 93.81% recall, and 88.77% accuracy** on unseen data.
13
+ - **Self-Adaptive Learning** – Continuously evolves using real-time phishing simulations generated with **GPT-4o**.
14
+ - **Scalability** – Designed to support **7,000–25,000 simultaneous users** in production environments.
15
+
16
+ ---
17
+
18
+ ## Model Architecture
19
+
20
+ PhilBERT leverages **DistilBERT**, a distilled version of **BERT**, maintaining the same architecture but with **40% fewer parameters**, making it lightweight while preserving high accuracy. The **final model** includes:
21
+
22
+ - **Tokenizer**: Trained to recognize phishing-specific patterns (URLs, obfuscation, domain misspellings).
23
+ - **Custom Classifier**: A **fully connected dense layer** added for binary classification (phishing vs. benign).
24
+ - **Risk Scoring Mechanism**: A **weighted confidence score** applied to enhance detection reliability.
25
+
26
+ ---
27
+
28
+ ## Data Preprocessing
29
+
30
+ Before fine-tuning, the dataset underwent **extensive preprocessing** to ensure balance and quality:
31
+
32
+ - **Duplicate Removal & Balancing**: Maintained a **near 50-50 phishing-to-benign ratio** to prevent model bias.
33
+ - **Feature Extraction**: Applied to **URLs, HTML, email bodies, and SMS content** to enrich input representations.
34
+ - **Dataset Split**: Final dataset included:
35
+ - **427,028 benign URLs** & **381,014 phishing URLs**
36
+ - **17,536 unique email samples**
37
+ - **5,949 SMS samples**
38
+ - **Web entries filtered for efficiency** (removing entries >100KB).
39
+ - **Export Format**: Data transformed and stored in **JSON for efficient training**.
40
+
41
+ ---
42
+
43
+ ## Training & Evaluation
44
+
45
+ PhilBERT was fine-tuned on **multi-modal phishing datasets** using **transfer learning**, achieving:
46
+
47
+ | **Metric** | **Value** |
48
+ |---------------------|------------|
49
+ | Accuracy | **88.77%** |
50
+ | Precision | **85.22%** |
51
+ | Recall | **93.81%** |
52
+ | F1-Score | **89.31%** |
53
+ | Evaluation Runtime | **130.46s** |
54
+ | Samples/sec | **58.701** |
55
+
56
+ - **False Positive Reduction**: Multi-layered filtering minimized false positives while maintaining **high recall**.
57
+ - **Scalability**: Successfully stress-tested for **up to 25,000 simultaneous users**.
58
+ - **Compliance**: Meets **ISO 27001** and **GDPR standards** for security and privacy.
59
+
60
+ ---
61
+
62
+ ## Usage
63
+
64
+ ### Installation
65
+
66
+ ```bash
67
+ pip install transformers torch
68
+ ```
69
+
70
+ ### Inference
71
+
72
+ ```python
73
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
74
+ import torch
75
+
76
+ model_name = "your_username/PhilBERT"
77
+
78
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
79
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
80
+
81
+ text = "Click this link to update your bank details: http://fakebank.com"
82
+ inputs = tokenizer(text, return_tensors="pt")
83
+
84
+ with torch.no_grad():
85
+ outputs = model(**inputs)
86
+ predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
87
+
88
+ print(f"Phishing probability: {predictions[0][1].item():.4f}")
89
+ ```
90
+
91
+ ---
92
+
93
+ ## License
94
+
95
+ This model is **proprietary** and protected under a **custom license**. Please refer to the **[LICENSE](LICENSE)** file for terms of use.
96
+
97
  ---