Typosquat CE detector

Model Details

Model Description

This model is a cross-encoder fine-tuned for binary classification to detect typosquatting domain names, leveraging the CANINE-c transformer model. The model can be used to classify whether a domain name is a typographical variant (typosquat) of another domain.

Developed by: Anvilogic
Model type: Cross-encoder binary classification
Maximum Sequence Length: 512 tokens
Language(s) (NLP): Multilingual
License: MIT
Finetuned from model : google/CANINE-c

Usage

Direct Usage (Sentence Transformers)

This model can be directly used in cybersecurity applications to identify malicious typosquatting domains by analyzing a domain name similarity to a legitimate one.

To start using this model, the following code can be used for loading and testing:

from sentence_transformers import CrossEncoder

model = CrossEncoder("Anvilogic/CE-typosquat-detect-Canine")
result = model.predict([("example.com", "exarnple.com")])

Downstream Usage

This model can be used with an embedding model to enhance typosquatting detection. First, an embedding model retrieves similar domains from a legitimate database. Then, the cross-encoder labels these pairs, confirming if a domain is a typosquat and identifying its original source.

For embedding, consider using: Anvilogic/Embedder-typosquat-detect

Bias, Risks, and Limitations

Users are advised to use this model as a supportive tool rather than a sole indicator for domain security. Regular updates may be needed to maintain its performance against new and evolving types of domain spoofing.

Training Details

Framework Versions

Python: 3.10.14
Sentence Transformers: 3.2.1
Transformers: 4.46.2
PyTorch: 2.2.2
Tokenizers: 0.20.3

Training Data

The model was fine-tuned using Anvilogic/CE-Typosquat-Training-Dataset, which contains pairs of domain names and their similarity labels. The dataset was filtered and converted to the parquet format for efficient processing.

Training Procedure

The model was optimized using the binary cross-entropy loss function with logits, nn.BCEWithLogitsLoss().

Training Hyperparameters

Model Architecture: Cross-encoder fine-tuned from canine-c
Batch Size: 64
Epochs: 3
Learning Rate: 2e-5
Warmup Steps: 100

Evaluation

In the final evaluation after training, the model achieved the following metrics on the test set:

CE Binary Classification Evaluator

Accuracy : 0.9740
F1 Score : 0.9737
Precision : 0.9836
Recall : 0.964
Average Precision : 0.9969

These results indicate the model's high performance in identifying typosquatting domains, with strong precision and recall scores that make it well-suited for cybersecurity applications.

Anvilogic
/

CE-Typosquat-Detect

You need to agree to share your contact information to access this model