CIS5190abcd
/

svm

Model card Files Files and versions Community

svm / README.md

yitingliii

Update README.md

9d8f216 verified 3 months ago

preview code

raw

history blame

2.08 kB

	# SVM Model with TF-IDF
	This repository provides a pre-trained Support Vector Machine (SVM) model for text classification using Term Frequency-Inverse Document Frequency (TF-IDF). The repository also includes utilities for data preprocessing and feature extraction.:
	## Installation
	<br>Before running the code, ensure you have all the required libraries installed:

	```python
	pip install nltk beautifulsoup4 scikit-learn pandas
	```
	<br> Download necessary NTLK resources for preprocessing.
	```python
	import nltk
	nltk.download('stopwords')
	nltk.download('wordnet')

	```
	# How to Use:
	1. Data Cleaning
	<br> The data_cleaning.py file contains a clean() function to preprocess the input dataset:
	- Removes HTML tags.
	- Removes non-alphanumeric characters and extra spaces.
	- Converts text to lowercase.
	- Removes stopwords.
	- Lemmatizes words.

	```python
	from data_cleaning import clean
	import pandas as pd

	# Load your data
	df = pd.read_csv('test_data_random_subset.csv')

	# Clean the data
	cleaned_df = clean(df)

	```

	2. TF-IDF Feature Extraction
	<br> The tfidf.py file contains the TF-IDF vectorization logic. It converts cleaned text data into numerical features suitable for training and testing the SVM model.
	```python
	from tfidf import tfidf

	# Apply TF-IDF vectorization
	X_train_tfidf = tfidf.fit_transform(X_train['title'])
	X_test_tfidf = tfidf.transform(X_test['title'])
	```
	3. Training and Testing the SVM Model
	<br> The svm.py file contains the logic for training and testing the SVM model. It uses the TF-IDF-transformed features to classify text data.
	```python
	from sklearn.svm import SVC
	from sklearn.metrics import accuracy_score, classification_report

	# Train the SVM model
	svm_model = SVC(kernel='linear', random_state=42)
	svm_model.fit(X_train_tfidf, y_train)

	# Predict and evaluate
	y_pred = svm_model.predict(X_test_tfidf)
	accuracy = accuracy_score(y_test, y_pred)
	print(f"SVM Accuracy: {accuracy:.4f}")
	print(classification_report(y_test, y_pred))
	```

	4. Training a new dataset with pre-trained model
	<br>To test a new dataset, combine the steps above：
	-