svm / README.md
yitingliii's picture
Update README.md
d4e4e9a verified
|
raw
history blame
2.72 kB

SVM Model with TF-IDF

This repository provides a pre-trained Support Vector Machine (SVM) model for text classification using Term Frequency-Inverse Document Frequency (TF-IDF). The repository also includes utilities for data preprocessing and feature extraction.:

Installation


Before running the code, ensure you have all the required libraries installed:

pip install nltk beautifulsoup4 scikit-learn pandas datasets


Download necessary NTLK resources for preprocessing.

import nltk
nltk.download('stopwords')
nltk.download('wordnet')

How to Use:

  1. Data Cleaning
    The data_cleaning.py file contains a clean() function to preprocess the input dataset:
  • Removes HTML tags.
  • Removes non-alphanumeric characters and extra spaces.
  • Converts text to lowercase.
  • Removes stopwords.
  • Lemmatizes words.
from data_cleaning import clean
import pandas as pd
import nltk
nltk.download('stopwords')


# Load your data
df = pd.read_csv('test_data_random_subset.csv')

# Clean the data
cleaned_df = clean(df)
  1. TF-IDF Feature Extraction
    The tfidf.py file contains the TF-IDF vectorization logic. It converts cleaned text data into numerical features suitable for training and testing the SVM model.
from tfidf import tfidf

# Apply TF-IDF vectorization
X_train_tfidf = tfidf.fit_transform(X_train['title'])
X_test_tfidf = tfidf.transform(X_test['title'])
  1. Training and Testing the SVM Model
    The svm.py file contains the logic for training and testing the SVM model. It uses the TF-IDF-transformed features to classify text data.
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

# Train the SVM model
svm_model = SVC(kernel='linear', random_state=42)
svm_model.fit(X_train_tfidf, y_train)

# Predict and evaluate
y_pred = svm_model.predict(X_test_tfidf)
accuracy = accuracy_score(y_test, y_pred)
print(f"SVM Accuracy: {accuracy:.4f}")
print(classification_report(y_test, y_pred))
  1. Training a new dataset with pre-trained model
    To test a new dataset, follow the steps below:
  • Clean the Dataset
from data_cleaning import clean
import pandas as pd

# Load your dataset
df = pd.read_csv('test_data_random_subset.csv')

# Clean the data
cleaned_df = clean(df)
  • Extract TF-IDF Features
from tfidf import tfidf

# Transform the cleaned dataset
X_new_tfidf = tfidf.transform(cleaned_df['title'])
  • Make Predictions
from svm import svm_model

# Make predictions
predictions = svm_model.predict(X_new_tfidf)

# Calculate accuracy score.
accuracy = accuracy_score(y_new, predictions)
print(f"Accuracy Score: {accuracy:.4f}")