|
# SVM Model with TF-IDF |
|
This repository provides a pre-trained Support Vector Machine (SVM) model for text classification using Term Frequency-Inverse Document Frequency (TF-IDF). The repository also includes utilities for data preprocessing and feature extraction.: |
|
## Installation |
|
<br>Before running the code, ensure you have all the required libraries installed: |
|
|
|
```python |
|
pip install nltk beautifulsoup4 scikit-learn pandas |
|
``` |
|
<br> Download necessary NTLK resources for preprocessing. |
|
```python |
|
import nltk |
|
nltk.download('stopwords') |
|
nltk.download('wordnet') |
|
|
|
``` |
|
# How to Use: |
|
1. Data Cleaning |
|
<br> The data_cleaning.py file contains a clean() function to preprocess the input dataset: |
|
- Removes HTML tags. |
|
- Removes non-alphanumeric characters and extra spaces. |
|
- Converts text to lowercase. |
|
- Removes stopwords. |
|
- Lemmatizes words. |
|
|
|
```python |
|
from data_cleaning import clean |
|
import pandas as pd |
|
|
|
# Load your data |
|
df = pd.read_csv('test_data_random_subset.csv') |
|
|
|
# Clean the data |
|
cleaned_df = clean(df) |
|
|
|
``` |
|
|
|
2. TF-IDF Feature Extraction |
|
<br> The tfidf.py file contains the TF-IDF vectorization logic. It converts cleaned text data into numerical features suitable for training and testing the SVM model. |
|
```python |
|
from tfidf import tfidf |
|
|
|
# Apply TF-IDF vectorization |
|
X_train_tfidf = tfidf.fit_transform(X_train['title']) |
|
X_test_tfidf = tfidf.transform(X_test['title']) |
|
``` |
|
3. Training and Testing the SVM Model |
|
<br> The svm.py file contains the logic for training and testing the SVM model. It uses the TF-IDF-transformed features to classify text data. |
|
```python |
|
from sklearn.svm import SVC |
|
from sklearn.metrics import accuracy_score, classification_report |
|
|
|
# Train the SVM model |
|
svm_model = SVC(kernel='linear', random_state=42) |
|
svm_model.fit(X_train_tfidf, y_train) |
|
|
|
# Predict and evaluate |
|
y_pred = svm_model.predict(X_test_tfidf) |
|
accuracy = accuracy_score(y_test, y_pred) |
|
print(f"SVM Accuracy: {accuracy:.4f}") |
|
print(classification_report(y_test, y_pred)) |
|
``` |
|
|
|
4. Training a new dataset with pre-trained model |
|
<br>To test a new dataset, combine the steps above: |
|
- |
|
|
|
|