File size: 2,136 Bytes
23b9e59
 
 
 
 
 
 
 
 
b9a71db
 
3c10d28
 
b9a71db
8c0274d
 
b9a71db
ec9869b
b9a71db
186013a
 
8c0274d
ec9869b
 
 
8c0274d
 
 
ec9869b
 
8c0274d
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
---
title: README
emoji: 🏢
colorFrom: pink
colorTo: purple
sdk: static
pinned: false
---

# **Anvilogic - Where AI Meets Cybersecurity**

Welcome to the official Hugging Face organization for [Anvilogic's](https://www.anvilogic.com/) advanced cybersecurity AI models!  
Founded in 2019, [Anvilogic](https://www.anvilogic.com/) specializes in AI-driven threat detection and automation, enhancing Security Operations Center (SOC) capabilities with scalable, data-driven solutions.

## Typosquatting Collection
Typosquatting is a form of cyber attack where malicious actors create fake domain names that are visually or phonetically similar to legitimate domains, intending to deceive users into visiting these sites. This collection aims to detect typosquatted domains by identifying and flagging them. It is comprised of the following:

### Models

- **Embedder**: This model provides a representation for domain names and is used to mine similar domains.
- **Cross-Encoder**: This model can compare two domain names and determine if one domain is a typosquat of another.
- **T5 Typosquat Detection**: This model is a derived version of T5 trained on a new task, with the prefix "Is the first domain a typosquat of the second:" to which we append *TYPOSQUAT_DOMAIN* and *LEGITIMATE_DOMAIN*.

### Datasets

- **Embedder Training Dataset**: A dataset formatted to train the embedding model, containing pairs of (Anchor,Positive) domain examples.
- **Cross-Encoder Training Dataset**: A dataset formatted to train the Cross-Encoder model with (Anchor,Positive,label) samples.
- **T5 Training Dataset**: A dataset formatted to train the T5 model with (prompt,response) pairs.

### Spaces

- **Embedder Typosquat Detect**: Allows users to retrieve the most similar domains from a pool of 4,000 of the most common domains.
- **Cross-Encoder (CE) Typosquat Detect**: Allows users to compare two domains using the Cross-Encoder. The model outputs a probability of typosquatting.
- **T5 Typosquat Detect**: Allows users to compare two domains using the T5 model. The model outputs a boolean value indicating whether the domain is a typosquat.