File size: 1,630 Bytes
43ed950
 
 
 
143c475
a7a81cd
f5bdbb9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e0bbc52
 
ee38aec
e0bbc52
 
 
 
a7a81cd
f5bdbb9
43ed950
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
---
base_model:
- meta-llama/Meta-Llama-3-8B
---

## Llama3 8B Fine-Tuned for Domain Generation Algorithm Detection

This model is a fine-tuned version of Meta's Llama3 8B, specifically adapted for detecting **Domain Generation Algorithms (DGAs)**. DGAs are often used by malware to create dynamic domain names for command-and-control (C&C) servers, making them a critical challenge in cybersecurity.

## Model Description

- **Base Model**: Llama3 8B
- **Task**: DGA Detection
- **Fine-Tuning Approach**: Supervised Fine-Tuning (SFT) with domain-specific data.
- **Dataset**: A custom dataset comprising 68 malware families and legitimate domains from the Tranco dataset, with a focus on both arithmetic and word-based DGAs.
- **Performance**:
  - **Accuracy**: 94%
  - **False Positive Rate (FPR)**: 4%
  - Excels in detecting hard-to-identify word-based DGAs.

This model leverages the extensive semantic understanding of Llama3 to classify domains as either **malicious (DGA-generated)** or **legitimate** with high precision and recall.

## Data

The model was trained with 2 million domains, split between 1 million DGA domains and 1 million normal domains. The training data is stored in the file **train_2M.csv**. The model was evaluated with the family files located in the **Families_Test** folder.

The GitHub repository https://github.com/reypapin/Domain-Name-Classification-with-LLM contains the notebooks that describe how the model was trained and evaluated.


## Article Reference

La O, R. L., Catania, C. A., & Parlanti, T. (2024). LLMs for Domain Generation Algorithm Detection. arXiv preprint arXiv:2411.03307.