Typosquat CE detector
Model Details
Model Description
This model is a cross-encoder fine-tuned for binary classification to detect typosquatting domain names, leveraging the CANINE-c transformer model. The model can be used to classify whether a domain name is a typographical variant (typosquat) of another domain.
- Developed by: Anvilogic
- Model type: Cross-encoder binary classification
- Maximum Sequence Length: 512 tokens
- Language(s) (NLP): Multilingual
- License: MIT
- Finetuned from model : google/CANINE-c
Usage
Direct Usage (Sentence Transformers)
This model can be directly used in cybersecurity applications to identify malicious typosquatting domains by analyzing a domain name similarity to a legitimate one.
To start using this model, the following code can be used for loading and testing:
from sentence_transformers import CrossEncoder
model = CrossEncoder("Anvilogic/CE-typosquat-detect-Canine")
result = model.predict([("example.com", "exarnple.com")])
Downstream Usage
This model can be used with an embedding model to enhance typosquatting detection. First, an embedding model retrieves similar domains from a legitimate database. Then, the cross-encoder labels these pairs, confirming if a domain is a typosquat and identifying its original source.
For embedding, consider using: Anvilogic/Embedder-typosquat-detect
Bias, Risks, and Limitations
Users are advised to use this model as a supportive tool rather than a sole indicator for domain security. Regular updates may be needed to maintain its performance against new and evolving types of domain spoofing.
Training Details
Framework Versions
- Python: 3.10.14
- Sentence Transformers: 3.2.1
- Transformers: 4.46.2
- PyTorch: 2.2.2
- Tokenizers: 0.20.3
Training Data
The model was fine-tuned using Anvilogic/CE-Typosquat-Training-Dataset, which contains pairs of domain names and their similarity labels. The dataset was filtered and converted to the parquet format for efficient processing.
Training Procedure
The model was optimized using the binary cross-entropy loss function with logits, nn.BCEWithLogitsLoss()
.
Training Hyperparameters
- Model Architecture: Cross-encoder fine-tuned from canine-c
- Batch Size: 64
- Epochs: 3
- Learning Rate: 2e-5
- Warmup Steps: 100
Evaluation
In the final evaluation after training, the model achieved the following metrics on the test set:
CE Binary Classification Evaluator
Accuracy : 0.9740
F1 Score : 0.9737
Precision : 0.9836
Recall : 0.964
Average Precision : 0.9969
These results indicate the model's high performance in identifying typosquatting domains, with strong precision and recall scores that make it well-suited for cybersecurity applications.
- Downloads last month
- 2
Model tree for Anvilogic/CE-Typosquat-Detect
Base model
google/canine-c