svm / README.md
yitingliii's picture
Update README.md
c704520 verified
|
raw
history blame
3.47 kB

SVM Model with TF-IDF

This repository provides a pre-trained Support Vector Machine (SVM) model for text classification using Term Frequency-Inverse Document Frequency (TF-IDF). The repository also includes utilities for data preprocessing and feature extraction:

There are two ways to test our model:

1.Colab (can see the test_example.py file for how the Colab looks like)

Start


Download all the files.
Copy all the codes below into Colab
Before running the code, ensure you have all the required libraries installed:

pip install nltk beautifulsoup4 scikit-learn pandas datasets fsspec huggingface_hub


Download necessary NTLK resources for preprocessing.

import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')


Clean the Dataset

from data_cleaning import clean
import pandas as pd
import nltk
nltk.download('stopwords')


You can replace with any datasets you want by changing the file name inside pd.read_csv().


df = pd.read_csv("hf://datasets/CIS5190abcd/headlines_test/test_cleaned_headlines.csv")


cleaned_df = clean(df)
  • Extract TF-IDF Features
from tfidf import tfidf


X_new_tfidf = tfidf.transform(cleaned_df['title'])
  • Make Predictions

from svm import svm_model

2. Termial

Start:


Open your terminal.
Clone the repo by using the following command:

git clone https://huggingface.co/CIS5190abcd/svm


Go to the svm directory using following command:

cd svm


Run ls to check the files inside svm folder. Make sure tfidf.py, svm.py and data_cleaning.py are existing in this directory. If not, run the folloing commands:

git checkout origin/main -- tfidf.py
git checkout origin/main -- svm.py
git checkout origin/main -- data_cleaning.py


Rerun ls, double check all the required files(tfidf.py, svm.py and data_cleaning.py) are existing. Should look like this:

image/png
keep inside the svm directory until ends.

Installation


Before running the code, ensure you have all the required libraries installed:

pip install nltk beautifulsoup4 scikit-learn pandas datasets fsspec huggingface_hub


Go to Python which can be opened directly in terminal by typing the following command:

python


Download necessary NTLK resources for preprocessing.

import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')


After downloading all the required packages, do not exit.

How to use:

Training a new dataset with existing SVM model, follow the steps below:

  • Clean the Dataset
from data_cleaning import clean
import pandas as pd
import nltk
nltk.download('stopwords')


You can replace with any datasets you want by changing the file name inside pd.read_csv().


df = pd.read_csv("hf://datasets/CIS5190abcd/headlines_test/test_cleaned_headlines.csv")


cleaned_df = clean(df)
  • Extract TF-IDF Features
from tfidf import tfidf


X_new_tfidf = tfidf.transform(cleaned_df['title'])
  • Make Predictions

from svm import svm_model

exit() if you want to leave python.

cd ..if you want to exit svm directory.