# SVM Model with TF-IDF This repository provides a pre-trained Support Vector Machine (SVM) model for text classification using Term Frequency-Inverse Document Frequency (TF-IDF). The repository also includes utilities for data preprocessing and feature extraction.: ## Installation
Before running the code, ensure you have all the required libraries installed: ```python pip install nltk beautifulsoup4 scikit-learn pandas datasets ```
Download necessary NTLK resources for preprocessing. ```python import nltk nltk.download('stopwords') nltk.download('wordnet') ``` # How to Use: 1. Data Cleaning
The data_cleaning.py file contains a clean() function to preprocess the input dataset: - Removes HTML tags. - Removes non-alphanumeric characters and extra spaces. - Converts text to lowercase. - Removes stopwords. - Lemmatizes words. ```python from data_cleaning import clean import pandas as pd # Load your data df = pd.read_csv('test_data_random_subset.csv') # Clean the data cleaned_df = clean(df) ``` 2. TF-IDF Feature Extraction
The tfidf.py file contains the TF-IDF vectorization logic. It converts cleaned text data into numerical features suitable for training and testing the SVM model. ```python from tfidf import tfidf # Apply TF-IDF vectorization X_train_tfidf = tfidf.fit_transform(X_train['title']) X_test_tfidf = tfidf.transform(X_test['title']) ``` 3. Training and Testing the SVM Model
The svm.py file contains the logic for training and testing the SVM model. It uses the TF-IDF-transformed features to classify text data. ```python from sklearn.svm import SVC from sklearn.metrics import accuracy_score, classification_report # Train the SVM model svm_model = SVC(kernel='linear', random_state=42) svm_model.fit(X_train_tfidf, y_train) # Predict and evaluate y_pred = svm_model.predict(X_test_tfidf) accuracy = accuracy_score(y_test, y_pred) print(f"SVM Accuracy: {accuracy:.4f}") print(classification_report(y_test, y_pred)) ``` 4. Training a new dataset with pre-trained model
To test a new dataset, follow the steps below: - Clean the Dataset ```python from data_cleaning import clean import pandas as pd # Load your dataset df = pd.read_csv('test_data_random_subset.csv') # Clean the data cleaned_df = clean(df) ``` - Extract TF-IDF Features ```python from tfidf import tfidf # Transform the cleaned dataset X_new_tfidf = tfidf.transform(cleaned_df['title']) ``` - Make Predictions ```python from svm import svm_model # Make predictions predictions = svm_model.predict(X_new_tfidf) # Calculate accuracy score. accuracy = accuracy_score(y_new, predictions) print(f"Accuracy Score: {accuracy:.4f}") ```