yitingliii commited on
Commit
9d8f216
1 Parent(s): fad1a71

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +37 -14
README.md CHANGED
@@ -1,5 +1,5 @@
1
  # SVM Model with TF-IDF
2
- Step by step instruction:
3
  ## Installation
4
  <br>Before running the code, ensure you have all the required libraries installed:
5
 
@@ -14,18 +14,13 @@ nltk.download('wordnet')
14
 
15
  ```
16
  # How to Use:
17
- 1. Pre-Trained Model and Vectorizer
18
- <br> The repository includes:
19
- - model.pkl : The pre-trained SVM model
20
- - tfidf.pkl: The saved TF-IDF vectorizer used to transform the text data.
21
-
22
- 2. Testing a new dataset
23
- <br> To test the model with the new dataset, follow these steps:
24
- - Step 1: Prepare the dataset:
25
- <br> Ensure the dataset is in CVS format and has three columns: title, outlet and labels. title column containing the text data to be classified.
26
-
27
- - Step 2: Preprocess the Data
28
- <br>Use the clean() function from data_cleaning.py to preprocess the text data:
29
 
30
  ```python
31
  from data_cleaning import clean
@@ -39,5 +34,33 @@ cleaned_df = clean(df)
39
 
40
  ```
41
 
42
- - Step 3: Load the pre-trained model and TF-IDF Vectorizer
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
 
 
1
  # SVM Model with TF-IDF
2
+ This repository provides a pre-trained Support Vector Machine (SVM) model for text classification using Term Frequency-Inverse Document Frequency (TF-IDF). The repository also includes utilities for data preprocessing and feature extraction.:
3
  ## Installation
4
  <br>Before running the code, ensure you have all the required libraries installed:
5
 
 
14
 
15
  ```
16
  # How to Use:
17
+ 1. Data Cleaning
18
+ <br> The data_cleaning.py file contains a clean() function to preprocess the input dataset:
19
+ - Removes HTML tags.
20
+ - Removes non-alphanumeric characters and extra spaces.
21
+ - Converts text to lowercase.
22
+ - Removes stopwords.
23
+ - Lemmatizes words.
 
 
 
 
 
24
 
25
  ```python
26
  from data_cleaning import clean
 
34
 
35
  ```
36
 
37
+ 2. TF-IDF Feature Extraction
38
+ <br> The tfidf.py file contains the TF-IDF vectorization logic. It converts cleaned text data into numerical features suitable for training and testing the SVM model.
39
+ ```python
40
+ from tfidf import tfidf
41
+
42
+ # Apply TF-IDF vectorization
43
+ X_train_tfidf = tfidf.fit_transform(X_train['title'])
44
+ X_test_tfidf = tfidf.transform(X_test['title'])
45
+ ```
46
+ 3. Training and Testing the SVM Model
47
+ <br> The svm.py file contains the logic for training and testing the SVM model. It uses the TF-IDF-transformed features to classify text data.
48
+ ```python
49
+ from sklearn.svm import SVC
50
+ from sklearn.metrics import accuracy_score, classification_report
51
+
52
+ # Train the SVM model
53
+ svm_model = SVC(kernel='linear', random_state=42)
54
+ svm_model.fit(X_train_tfidf, y_train)
55
+
56
+ # Predict and evaluate
57
+ y_pred = svm_model.predict(X_test_tfidf)
58
+ accuracy = accuracy_score(y_test, y_pred)
59
+ print(f"SVM Accuracy: {accuracy:.4f}")
60
+ print(classification_report(y_test, y_pred))
61
+ ```
62
+
63
+ 4. Training a new dataset with pre-trained model
64
+ <br>To test a new dataset, combine the steps above:
65
+ -
66