--- license: apache-2.0 datasets: - FredZhang7/malicious-website-features-2.4M wget: - text: https://chat.openai.com/ - text: https://huggingface.co/FredZhang7/aivance-safesearch-v3 metrics: - accuracy language: - af - en - et - sw - sv - sq - de - ca - hu - da - tl - so - fi - fr - cs - hr - cy - es - sl - tr - pl - pt - nl - id - sk - lt - 'no' - lv - vi - it - ro - ru - mk - bg - th - ja - ko - multilingual --- It's very important to note that this model is not production-ready.
The classification task for v1 is split into two stages: 1. URL features model - **96.5%+ accurate** on training and validation data - 2,436,727 rows of labelled URLs - evaluation from v2: slightly overfitted, by perhaps around 0.8% 2. Website features model - **98.4% accurate** on training data, and **98.9% accurate** on validation data - 911,180 rows of 42 features - evaluation from v2: slightly biased towards the URL feature (bert_confidence) more than the other columns ## Training I applied cross-validation with `cv=5` to the training dataset to search for the best hyperparameters. Here's the dict passed to `sklearn`'s `GridSearchCV` function: ```python params = { 'objective': 'binary', 'metric': 'binary_logloss', 'boosting_type': ['gbdt', 'dart'], 'num_leaves': [15, 23, 31, 63], 'learning_rate': [0.001, 0.002, 0.01, 0.02], 'feature_fraction': [0.5, 0.6, 0.7, 0.9], 'early_stopping_rounds': [10, 20], 'num_boost_round': [500, 750, 800, 900, 1000, 1250, 2000] } ``` To reproduce the 98.4% accurate model, you can follow the data analysis on the [dataset page](https://huggingface.co/datasets/FredZhang7/malicious-website-features-2.4M) to filter out the unimportant features. Then train a LightGBM model using the most suited hyperparamters for this task: ```python params = { 'objective': 'binary', 'metric': 'binary_logloss', 'boosting_type': 'gbdt', 'num_leaves': 31, 'learning_rate': 0.01, 'feature_fraction': 0.6, 'early_stopping_rounds': 10, 'num_boost_round': 800 } ``` ## URL Features ```python from transformers import AutoModelForSequenceClassification, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("FredZhang7/malware-phisher") model = AutoModelForSequenceClassification.from_pretrained("FredZhang7/malware-phisher") ``` ## Website Features ```bash pip install lightgbm ``` ```python import lightgbm as lgb lgb.Booster(model_file="phishing_model_combined_0.984_train.txt") ```