FredZhang7
/

malphish-eater-v1

Text Classification

Inference Endpoints

Model card Files Files and versions Community

malphish-eater-v1 / README.md

FredZhang7's picture

change license

0b1bba2 about 1 year ago

|

No virus

2.5 kB

	---
	license: apache-2.0
	datasets:
	- FredZhang7/malicious-website-features-2.4M
	wget:
	- text: https://chat.openai.com/
	- text: https://huggingface.co/FredZhang7/aivance-safesearch-v3
	metrics:
	- accuracy
	language:
	- af
	- en
	- et
	- sw
	- sv
	- sq
	- de
	- ca
	- hu
	- da
	- tl
	- so
	- fi
	- fr
	- cs
	- hr
	- cy
	- es
	- sl
	- tr
	- pl
	- pt
	- nl
	- id
	- sk
	- lt
	- 'no'
	- lv
	- vi
	- it
	- ro
	- ru
	- mk
	- bg
	- th
	- ja
	- ko
	- multilingual
	---

	It's very important to note that this model is not production-ready.

	<br>

	The classification task for v1 is split into two stages:
	1. URL features model
	- 96.5%+ accurate on training and validation data
	- 2,436,727 rows of labelled URLs
	- evaluation from v2: slightly overfitted, by perhaps around 0.8%
	2. Website features model
	- 98.4% accurate on training data, and 98.9% accurate on validation data
	- 911,180 rows of 42 features
	- evaluation from v2: slightly biased towards the URL feature (bert_confidence) more than the other columns

	## Training
	I applied cross-validation with `cv=5` to the training dataset to search for the best hyperparameters.
	Here's the dict passed to `sklearn`'s `GridSearchCV` function:
	```python
	params = {
	'objective': 'binary',
	'metric': 'binary_logloss',
	'boosting_type': ['gbdt', 'dart'],
	'num_leaves': [15, 23, 31, 63],
	'learning_rate': [0.001, 0.002, 0.01, 0.02],
	'feature_fraction': [0.5, 0.6, 0.7, 0.9],
	'early_stopping_rounds': [10, 20],
	'num_boost_round': [500, 750, 800, 900, 1000, 1250, 2000]
	}
	```
	To reproduce the 98.4% accurate model, you can follow the data analysis on the [dataset page](https://huggingface.co/datasets/FredZhang7/malicious-website-features-2.4M) to filter out the unimportant features.
	Then train a LightGBM model using the most suited hyperparamters for this task:
	```python
	params = {
	'objective': 'binary',
	'metric': 'binary_logloss',
	'boosting_type': 'gbdt',
	'num_leaves': 31,
	'learning_rate': 0.01,
	'feature_fraction': 0.6,
	'early_stopping_rounds': 10,
	'num_boost_round': 800
	}
	```


	## URL Features
	```python
	from transformers import AutoModelForSequenceClassification, AutoTokenizer
	tokenizer = AutoTokenizer.from_pretrained("FredZhang7/malware-phisher")
	model = AutoModelForSequenceClassification.from_pretrained("FredZhang7/malware-phisher")
	```
	## Website Features
	```bash
	pip install lightgbm
	```
	```python
	import lightgbm as lgb
	lgb.Booster(model_file="phishing_model_combined_0.984_train.txt")
	```