SVM Model with TF-IDF

This repository provides a pre-trained Support Vector Machine (SVM) model for text classification using Term Frequency-Inverse Document Frequency (TF-IDF). The repository also includes utilities for data preprocessing and feature extraction:

There are two ways to test our model:

1.Colab (can see the test_example.py file for how the Colab looks like)

Start

Download files tfidf.py, svm.py and data_cleaning.py.
Upload the files in Colab directly under Files as well as the test data.
Copy all the codes below into Colab.
Before running the code, ensure you have all the required libraries installed:

pip install nltk beautifulsoup4 scikit-learn pandas datasets fsspec huggingface_hub

Download necessary NTLK resources for preprocessing.

import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

Clean the Dataset

from data_cleaning import clean
import pandas as pd
import nltk
nltk.download('stopwords')

You can replace with any datasets you want by changing the file name inside pd.read_csv().


df = pd.read_csv("hf://datasets/CIS5190abcd/headlines_test/test_cleaned_headlines.csv")


cleaned_df = clean(df)

Extract TF-IDF Features

from tfidf import tfidf


X_new_tfidf = tfidf.transform(cleaned_df['title'])

Make Predictions


from svm import svm_model

2. Termial

Start:

Open your terminal.
Clone the repo by using the following command:

git clone https://huggingface.co/CIS5190abcd/svm

Go to the svm directory using following command:

cd svm

Run ls to check the files inside svm folder. Make sure tfidf.py, svm.py and data_cleaning.py are existing in this directory. If not, run the folloing commands:

git checkout origin/main -- tfidf.py
git checkout origin/main -- svm.py
git checkout origin/main -- data_cleaning.py

Rerun ls, double check all the required files（tfidf.py, svm.py and data_cleaning.py） are existing. Should look like this:

keep inside the svm directory until ends.

Installation

Before running the code, ensure you have all the required libraries installed:

pip install nltk beautifulsoup4 scikit-learn pandas datasets fsspec huggingface_hub

Go to Python which can be opened directly in terminal by typing the following command:

python

Download necessary NTLK resources for preprocessing.

import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

After downloading all the required packages, do not exit.

How to use:

Training a new dataset with existing SVM model, follow the steps below：

Clean the Dataset

from data_cleaning import clean
import pandas as pd
import nltk
nltk.download('stopwords')

You can replace with any datasets you want by changing the file name inside pd.read_csv().


df = pd.read_csv("hf://datasets/CIS5190abcd/headlines_test/test_cleaned_headlines.csv")


cleaned_df = clean(df)

Extract TF-IDF Features

from tfidf import tfidf


X_new_tfidf = tfidf.transform(cleaned_df['title'])

Make Predictions


from svm import svm_model

exit() if you want to leave python.

cd ..if you want to exit svm directory.