Update README.md

d6da746 verified 29 days ago

4.47 kB

	---
	datasets:
	- thehamkercat/telegram-spam-ham
	- ucirvine/sms_spam
	- SetFit/enron_spam
	base_model:
	- FacebookAI/roberta-base
	pipeline_tag: text-classification
	license: mit
	language:
	- en
	metrics:
	- accuracy
	results:
	- task:
	type: text-classification
	dataset:
	name: ucirvine/sms_spam
	metrics:
	- name: Accuracy
	type: Test-Data Accuracy
	value: 95.03%
	source:
	name: Validation via ucirvine/sms_spam dataset in Google Collab
	library_name: transformers
	---
	# Is Spam all we need? A RoBERTa Based Approach To Spam Detection
	## Intro
	This is inspired largely by mshenoda's roberta spam huggingFace model (https://huggingface.co/mshenoda/roberta-spam).

	However, instead of fine-tuning it on all the data sources that the original author had, I only finetuned using the telegram and enron spam/ham datasets. The idea behind this was a more diversified data source, preventing overfitting to the original distribution, and just a fun NLP exploratory experiment. This was fine-tuned by replicating the sentiment analysis Google collab example provided in the Roberta resources page (https://huggingface.co/docs/transformers/main/en/model_doc/roberta#resources) Google collab example.

	NOTE: This was done for an interview project, so if you find this by chance... hopefully it helps you too, but know there's definitely better resources out there... and that this was done in the span of one evening.

	## Metrics
	Accuracy: 0.9503
	Thrilling, I know, I also just got the chills, especially since my performance is arguably worse than the original authors 😂

	Granted, I only ran it for one epoch, and the data is taken from different distributions. I'm sure it would've been more "accurate" if I had just trained it on the SMS data, but diversity is good. And, it's fun to see how stuff impacts the final result!

	## Model Output
	- 0 is ham
	- 1 is spam

	## Dataset(s)

	The dataset is composed of messages labeled by ham or spam (0 or 1), merged from two data sources:

	1. Telegram Spam Ham https://huggingface.co/datasets/thehamkercat/telegram-spam-ham/tree/main
	2. Enron Spam: https://huggingface.co/datasets/SetFit/enron_spam/tree/main (only used message column and labels)

	The dataset used for testing was the original kaggle competition (as part of the interview project that this was for)

	1. SMS Spam Collection https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset

	## Dataset Class Distribution

	\| \| Total \| Training \| Testing \|
	\|:--------:\|:-----:\|:--------------:\|:-----------:\|
	\| Counts \| 59267 \| 53693 (90.6% ) \| 5574 (9.4%) \|

	\| \| Total \| Spam \| Ham \| Set \| % Total \|
	\|:--------:\|:-----:\|:-------------:\|:-------------:\|:-----:\|:-------:\|
	\| Enron \| 33345 \| 16852 (50.5%) \| 16493 (49.5%) \| Train \| 56.2% \|
	\| Telegram \| 20348 \| 6011 (29.5%) \| 14337 (70.5%) \| Train \| 43.8% \|
	\| SMS \| 5574 \| 747 (13.5%) \| 4827 (86.5%) \| Test \| 100% \|

	\| \| Distribution of number of characters per class label (100 bins) \| Distribution of number of words per class label (100 bins) \|
	\|:--------:\|:---------------------------------------------------------------:\|:----------------------------------------------------------:\|
	\| SMS \| ![image/png ](https://cdn-uploads.huggingface.co/production/uploads/644ef0eb1565b54e4a656946/OjLvujmQyeQPlowW5lI5A.png) \| ![image/png](https://cdn-uploads.huggingface.co/production/uploads/644ef0eb1565b54e4a656946/RFs92xoeIUDAsry6T1Ec4.png) \|
	\| Enron (limiting a few outliers) \| ![image/png](https://cdn-uploads.huggingface.co/production/uploads/644ef0eb1565b54e4a656946/Gd7le3W2U05DaQtjb971o.png) \| ![image/png](https://cdn-uploads.huggingface.co/production/uploads/644ef0eb1565b54e4a656946/A40RySWIPWAcwSyKGh-rm.png) \|
	\| Telegram \| ![image/png](https://cdn-uploads.huggingface.co/production/uploads/644ef0eb1565b54e4a656946/ZqMEzunZbhwqOkBUpzv81.png) \| ![image/png](https://cdn-uploads.huggingface.co/production/uploads/644ef0eb1565b54e4a656946/v0Y3MRgXUjRUX0prULu0v.png) \|

	^ Note the tails, very interesting distributions. But more so, good to see [Benford's law](https://en.wikipedia.org/wiki/Benford's_law) is alive and well in these.

	## Architecture
	The model is fine tuned RoBERTa

	roberta-base: https://huggingface.co/roberta-base

	paper: https://arxiv.org/abs/1907.11692

	## Code
	https://huggingface.co/ggrizzly/roBERTa-spam-detection/resolve/main/roberta_spam_classifier_fine_tuning_google_collab.ipynb