yitingliii commited on
Commit
ada45ca
1 Parent(s): ca7966e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +38 -55
README.md CHANGED
@@ -1,5 +1,27 @@
1
  # SVM Model with TF-IDF
2
- This repository provides a pre-trained Support Vector Machine (SVM) model for text classification using Term Frequency-Inverse Document Frequency (TF-IDF). The repository also includes utilities for data preprocessing and feature extraction.:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ## Installation
4
  <br>Before running the code, ensure you have all the required libraries installed:
5
 
@@ -7,72 +29,31 @@ This repository provides a pre-trained Support Vector Machine (SVM) model for te
7
  pip install nltk beautifulsoup4 scikit-learn pandas datasets
8
  ```
9
  <br> Download necessary NTLK resources for preprocessing.
10
- ```python
11
- import nltk
12
- nltk.download('stopwords')
13
- nltk.download('wordnet')
14
-
15
  ```
16
- # How to Use:
17
- 1. Data Cleaning
18
- <br> The data_cleaning.py file contains a clean() function to preprocess the input dataset:
19
- - Removes HTML tags.
20
- - Removes non-alphanumeric characters and extra spaces.
21
- - Converts text to lowercase.
22
- - Removes stopwords.
23
- - Lemmatizes words.
24
-
25
- ```python
26
- from data_cleaning import clean
27
- import pandas as pd
28
  import nltk
29
  nltk.download('stopwords')
30
-
31
-
32
- # Load your data
33
- df = pd.read_csv("hf://datasets/CIS5190abcd/headlines_test/test_cleaned_headlines.csv")
34
-
35
- # Clean the data
36
- cleaned_df = clean(df)
37
-
38
  ```
39
-
40
- 2. TF-IDF Feature Extraction
41
- <br> The tfidf.py file contains the TF-IDF vectorization logic. It converts cleaned text data into numerical features suitable for training and testing the SVM model.
42
- ```python
43
- from tfidf import tfidf
44
-
45
- # Apply TF-IDF vectorization
46
- X_train_tfidf = tfidf.fit_transform(X_train['title'])
47
- X_test_tfidf = tfidf.transform(X_test['title'])
48
  ```
49
- 3. Training and Testing the SVM Model
50
- <br> The svm.py file contains the logic for training and testing the SVM model. It uses the TF-IDF-transformed features to classify text data.
51
- ```python
52
- from sklearn.svm import SVC
53
- from sklearn.metrics import accuracy_score, classification_report
54
-
55
- # Train the SVM model
56
- svm_model = SVC(kernel='linear', random_state=42)
57
- svm_model.fit(X_train_tfidf, y_train)
58
-
59
- # Predict and evaluate
60
- y_pred = svm_model.predict(X_test_tfidf)
61
- accuracy = accuracy_score(y_test, y_pred)
62
- print(f"SVM Accuracy: {accuracy:.4f}")
63
- print(classification_report(y_test, y_pred))
64
  ```
65
 
66
- 4. Training a new dataset with pre-trained model
67
- <br>To test a new dataset, follow the steps below:
68
 
69
  - Clean the Dataset
70
  ```python
71
  from data_cleaning import clean
72
  import pandas as pd
73
-
74
- # Load your dataset
75
- df = pd.read_csv('test_data_random_subset.csv')
 
 
 
 
76
 
77
  # Clean the data
78
  cleaned_df = clean(df)
@@ -97,3 +78,5 @@ predictions = svm_model.predict(X_new_tfidf)
97
 
98
  ```
99
 
 
 
 
1
  # SVM Model with TF-IDF
2
+ This repository provides a pre-trained Support Vector Machine (SVM) model for text classification using Term Frequency-Inverse Document Frequency (TF-IDF). The repository also includes utilities for data preprocessing and feature extraction:
3
+
4
+ ## Start:
5
+ <br>Open your terminal.
6
+ <br> Clone the repo by using the following command:
7
+ ```
8
+ git clone https://huggingface.co/CIS5190abcd/svm
9
+ ```
10
+ <br> Go to the svm directory using following command:
11
+ ```
12
+ cd svm
13
+ ```
14
+ <br> Run ```ls``` to check the files inside svm folder. Make sure ```tfidf.py```, ```svm.py``` and ```data_cleaning.py``` are existing in this directory. If not, run the folloing commands:
15
+ ```
16
+ git checkout origin/main -- tfidf.py
17
+ git checkout origin/main -- svm.py
18
+ git checkout origin/main -- data_cleaning.py
19
+ ```
20
+ <br> Rerun ```ls```, double check all the required files are existing. Should look like this:
21
+
22
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6755cffd784ff7ea9db10bd4/O9K5zYm7TKiIg9cYZpV1x.png)
23
+ <br> keep inside the svm directory until ends.
24
+
25
  ## Installation
26
  <br>Before running the code, ensure you have all the required libraries installed:
27
 
 
29
  pip install nltk beautifulsoup4 scikit-learn pandas datasets
30
  ```
31
  <br> Download necessary NTLK resources for preprocessing.
 
 
 
 
 
32
  ```
33
+ python
 
 
 
 
 
 
 
 
 
 
 
34
  import nltk
35
  nltk.download('stopwords')
36
+ nltk.download('wordnet')
 
 
 
 
 
 
 
37
  ```
38
+ <br> After downloading all the required packages,
 
 
 
 
 
 
 
 
39
  ```
40
+ exit()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
  ```
42
 
43
+ ## How to use:
44
+ Training a new dataset with existing SVM model, follow the steps below:
45
 
46
  - Clean the Dataset
47
  ```python
48
  from data_cleaning import clean
49
  import pandas as pd
50
+ import nltk
51
+ nltk.download('stopwords')
52
+ ```
53
+ <br> You can replace with any datasets you want by changing the file name inside ```pd.read_csv()```.
54
+ ```
55
+ # Load your data
56
+ df = pd.read_csv("hf://datasets/CIS5190abcd/headlines_test/test_cleaned_headlines.csv")
57
 
58
  # Clean the data
59
  cleaned_df = clean(df)
 
78
 
79
  ```
80
 
81
+
82
+