malphish-eater-v1 / README.md
FredZhang7's picture
change license
0b1bba2
metadata
license: apache-2.0
datasets:
  - FredZhang7/malicious-website-features-2.4M
wget:
  - text: https://chat.openai.com/
  - text: https://huggingface.co/FredZhang7/aivance-safesearch-v3
metrics:
  - accuracy
language:
  - af
  - en
  - et
  - sw
  - sv
  - sq
  - de
  - ca
  - hu
  - da
  - tl
  - so
  - fi
  - fr
  - cs
  - hr
  - cy
  - es
  - sl
  - tr
  - pl
  - pt
  - nl
  - id
  - sk
  - lt
  - 'no'
  - lv
  - vi
  - it
  - ro
  - ru
  - mk
  - bg
  - th
  - ja
  - ko
  - multilingual

It's very important to note that this model is not production-ready.


The classification task for v1 is split into two stages:

  1. URL features model
    • 96.5%+ accurate on training and validation data
    • 2,436,727 rows of labelled URLs
    • evaluation from v2: slightly overfitted, by perhaps around 0.8%
  2. Website features model
    • 98.4% accurate on training data, and 98.9% accurate on validation data
    • 911,180 rows of 42 features
    • evaluation from v2: slightly biased towards the URL feature (bert_confidence) more than the other columns

Training

I applied cross-validation with cv=5 to the training dataset to search for the best hyperparameters. Here's the dict passed to sklearn's GridSearchCV function:

params = {
    'objective': 'binary',
    'metric': 'binary_logloss',
    'boosting_type': ['gbdt', 'dart'],
    'num_leaves': [15, 23, 31, 63],
    'learning_rate': [0.001, 0.002, 0.01, 0.02],
    'feature_fraction': [0.5, 0.6, 0.7, 0.9],
    'early_stopping_rounds': [10, 20],
    'num_boost_round': [500, 750, 800, 900, 1000, 1250, 2000]
}

To reproduce the 98.4% accurate model, you can follow the data analysis on the dataset page to filter out the unimportant features. Then train a LightGBM model using the most suited hyperparamters for this task:

params = {
    'objective': 'binary',
    'metric': 'binary_logloss',
    'boosting_type': 'gbdt',
    'num_leaves': 31,
    'learning_rate': 0.01,
    'feature_fraction': 0.6,
    'early_stopping_rounds': 10,
    'num_boost_round': 800
}

URL Features

from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("FredZhang7/malware-phisher")
model = AutoModelForSequenceClassification.from_pretrained("FredZhang7/malware-phisher")

Website Features

pip install lightgbm
import lightgbm as lgb
lgb.Booster(model_file="phishing_model_combined_0.984_train.txt")