Edit model card

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Typosquat CE detector

Model Details

Model Description

This model is a cross-encoder fine-tuned for binary classification to detect typosquatting domain names, leveraging the CANINE-c transformer model. The model can be used to classify whether a domain name is a typographical variant (typosquat) of another domain.

  • Developed by: Anvilogic
  • Model type: Cross-encoder binary classification
  • Maximum Sequence Length: 512 tokens
  • Language(s) (NLP): Multilingual
  • License: MIT
  • Finetuned from model : google/CANINE-c

Usage

Direct Usage (Sentence Transformers)

This model can be directly used in cybersecurity applications to identify malicious typosquatting domains by analyzing a domain name similarity to a legitimate one.

To start using this model, the following code can be used for loading and testing:

from sentence_transformers import CrossEncoder

model = CrossEncoder("Anvilogic/CE-typosquat-detect-Canine")
result = model.predict([("example.com", "exarnple.com")])

Downstream Usage

This model can be used with an embedding model to enhance typosquatting detection. First, an embedding model retrieves similar domains from a legitimate database. Then, the cross-encoder labels these pairs, confirming if a domain is a typosquat and identifying its original source.

For embedding, consider using: Anvilogic/Embedder-typosquat-detect

Bias, Risks, and Limitations

Users are advised to use this model as a supportive tool rather than a sole indicator for domain security. Regular updates may be needed to maintain its performance against new and evolving types of domain spoofing.

Training Details

Framework Versions

  • Python: 3.10.14
  • Sentence Transformers: 3.2.1
  • Transformers: 4.46.2
  • PyTorch: 2.2.2
  • Tokenizers: 0.20.3

Training Data

The model was fine-tuned using Anvilogic/CE-Typosquat-Training-Dataset, which contains pairs of domain names and their similarity labels. The dataset was filtered and converted to the parquet format for efficient processing.

Training Procedure

The model was optimized using the binary cross-entropy loss function with logits, nn.BCEWithLogitsLoss().

Training Hyperparameters

  • Model Architecture: Cross-encoder fine-tuned from canine-c
  • Batch Size: 64
  • Epochs: 3
  • Learning Rate: 2e-5
  • Warmup Steps: 100

Evaluation

In the final evaluation after training, the model achieved the following metrics on the test set:

CE Binary Classification Evaluator

Accuracy : 0.9740
F1 Score : 0.9737
Precision : 0.9836
Recall : 0.964
Average Precision : 0.9969

These results indicate the model's high performance in identifying typosquatting domains, with strong precision and recall scores that make it well-suited for cybersecurity applications.

Downloads last month
2
Safetensors
Model size
132M params
Tensor type
F32
·
Inference Examples
Inference API (serverless) does not yet support sentence-transformers models for this pipeline type.

Model tree for Anvilogic/CE-Typosquat-Detect

Base model

google/canine-c
Finetuned
(8)
this model

Dataset used to train Anvilogic/CE-Typosquat-Detect

Space using Anvilogic/CE-Typosquat-Detect 1