File size: 3,597 Bytes
5ee59c7
ada45ca
 
567bef6
c704520
567bef6
c7ece0f
 
 
9c9929c
 
e1bbe05
 
 
567bef6
9c9929c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
567bef6
 
9c9929c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
567bef6
 
 
ada45ca
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28bd825
ada45ca
 
 
 
39757a1
 
1d6d48d
6e11957
eda3c87
6e11957
6b2dc18
1d6d48d
ada45ca
a2b22c1
 
 
48e4179
d4e4e9a
 
ada45ca
9318d1d
1d6d48d
7abb94b
 
9d8f216
ada45ca
 
25b4af4
 
 
 
c0804ea
ada45ca
 
 
 
48e4179
b544376
ada45ca
25b4af4
b544376
25b4af4
 
 
 
 
 
 
 
b544376
25b4af4
 
 
 
 
 
b544376
25b4af4
 
 
ed160f2
 
 
1d6d48d
ada45ca
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
# SVM Model with TF-IDF
This repository provides a pre-trained Support Vector Machine (SVM) model for text classification using Term Frequency-Inverse Document Frequency (TF-IDF). The repository also includes utilities for data preprocessing and feature extraction:

There are two ways to test our model: 
# 1.Colab (can see the test_example.py file for how the Colab looks like)
## Start
<br> Download files ```tfidf.py```, ```svm.py``` and ```data_cleaning.py```.
<br> Upload the files in Colab directly under Files as well as the test data.
<br> Copy all the codes below into Colab.
<br>Before running the code, ensure you have all the required libraries installed:

```python
pip install nltk beautifulsoup4 scikit-learn pandas datasets fsspec huggingface_hub
```

<br> Download necessary NTLK resources for preprocessing. 

```python
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
```
<br>Clean the Dataset
```python
from data_cleaning import clean
import pandas as pd
import nltk
nltk.download('stopwords')
```

<br> You can replace with any datasets you want by changing the file name inside ```pd.read_csv()```.
```python

df = pd.read_csv("hf://datasets/CIS5190abcd/headlines_test/test_cleaned_headlines.csv")


cleaned_df = clean(df)

```

- Extract TF-IDF Features
```python
from tfidf import tfidf


X_new_tfidf = tfidf.transform(cleaned_df['title'])

```

- Make Predictions
```python

from svm import svm_model

```



# 2. Termial
## Start:
<br>Open your terminal. 
<br> Clone the repo by using the following command:
```
git clone https://huggingface.co/CIS5190abcd/svm
```
<br> Go to the svm directory using following command:
```
cd svm
```
<br> Run ```ls``` to check the files inside svm folder. Make sure ```tfidf.py```, ```svm.py``` and ```data_cleaning.py``` are existing in this directory. If not, run the folloing commands:
```
git checkout origin/main -- tfidf.py
git checkout origin/main -- svm.py
git checkout origin/main -- data_cleaning.py
```
<br> Rerun ```ls```, double check all the required files(```tfidf.py```, ```svm.py``` and ```data_cleaning.py```) are existing. Should look like this:

![image/png](https://cdn-uploads.huggingface.co/production/uploads/6755cffd784ff7ea9db10bd4/O9K5zYm7TKiIg9cYZpV1x.png)
<br> keep inside the svm directory until ends. 

## Installation
<br>Before running the code, ensure you have all the required libraries installed:

```python
pip install nltk beautifulsoup4 scikit-learn pandas datasets fsspec huggingface_hub
```
<br> Go to Python which can be opened directly in terminal by typing the following command:
```
python
```
<br> Download necessary NTLK resources for preprocessing. 

```python
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
```
<br> After downloading all the required packages, **do not** exit. 


## How to use:
Training a new dataset with existing SVM model, follow the steps below:

- Clean the Dataset
```python
from data_cleaning import clean
import pandas as pd
import nltk
nltk.download('stopwords')
```
<br> You can replace with any datasets you want by changing the file name inside ```pd.read_csv()```.
```python

df = pd.read_csv("hf://datasets/CIS5190abcd/headlines_test/test_cleaned_headlines.csv")


cleaned_df = clean(df)

```

- Extract TF-IDF Features
```python
from tfidf import tfidf


X_new_tfidf = tfidf.transform(cleaned_df['title'])

```

- Make Predictions
```python

from svm import svm_model

```
```exit()``` if you want to leave python.

```cd ..```if you want to exit svm directory.