# SVM Model with TF-IDF This repository provides a pre-trained Support Vector Machine (SVM) model for text classification using Term Frequency-Inverse Document Frequency (TF-IDF). The repository also includes utilities for data preprocessing and feature extraction: There are two ways to test our model: # 1.Colab (can see the test_example.py file for how the Colab looks like) ## Start
Download files ```tfidf.py```, ```svm.py``` and ```data_cleaning.py```.
Upload the files in Colab directly under Files as well as the test data.
Copy all the codes below into Colab.
Before running the code, ensure you have all the required libraries installed: ```python pip install nltk beautifulsoup4 scikit-learn pandas datasets fsspec huggingface_hub ```
Download necessary NTLK resources for preprocessing. ```python import nltk nltk.download('stopwords') nltk.download('wordnet') nltk.download('omw-1.4') ```
Clean the Dataset ```python from data_cleaning import clean import pandas as pd import nltk nltk.download('stopwords') ```
You can replace with any datasets you want by changing the file name inside ```pd.read_csv()```. ```python df = pd.read_csv("hf://datasets/CIS5190abcd/headlines_test/test_cleaned_headlines.csv") cleaned_df = clean(df) ``` - Extract TF-IDF Features ```python from tfidf import tfidf X_new_tfidf = tfidf.transform(cleaned_df['title']) ``` - Make Predictions ```python from svm import svm_model ``` # 2. Termial ## Start:
Open your terminal.
Clone the repo by using the following command: ``` git clone https://huggingface.co/CIS5190abcd/svm ```
Go to the svm directory using following command: ``` cd svm ```
Run ```ls``` to check the files inside svm folder. Make sure ```tfidf.py```, ```svm.py``` and ```data_cleaning.py``` are existing in this directory. If not, run the folloing commands: ``` git checkout origin/main -- tfidf.py git checkout origin/main -- svm.py git checkout origin/main -- data_cleaning.py ```
Rerun ```ls```, double check all the required files(```tfidf.py```, ```svm.py``` and ```data_cleaning.py```) are existing. Should look like this: ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6755cffd784ff7ea9db10bd4/O9K5zYm7TKiIg9cYZpV1x.png)
keep inside the svm directory until ends. ## Installation
Before running the code, ensure you have all the required libraries installed: ```python pip install nltk beautifulsoup4 scikit-learn pandas datasets fsspec huggingface_hub ```
Go to Python which can be opened directly in terminal by typing the following command: ``` python ```
Download necessary NTLK resources for preprocessing. ```python import nltk nltk.download('stopwords') nltk.download('wordnet') nltk.download('omw-1.4') ```
After downloading all the required packages, **do not** exit. ## How to use: Training a new dataset with existing SVM model, follow the steps below: - Clean the Dataset ```python from data_cleaning import clean import pandas as pd import nltk nltk.download('stopwords') ```
You can replace with any datasets you want by changing the file name inside ```pd.read_csv()```. ```python df = pd.read_csv("hf://datasets/CIS5190abcd/headlines_test/test_cleaned_headlines.csv") cleaned_df = clean(df) ``` - Extract TF-IDF Features ```python from tfidf import tfidf X_new_tfidf = tfidf.transform(cleaned_df['title']) ``` - Make Predictions ```python from svm import svm_model ``` ```exit()``` if you want to leave python. ```cd ..```if you want to exit svm directory.