yitingliii
commited on
Commit
•
9c9929c
1
Parent(s):
e1bbe05
Update README.md
Browse files
README.md
CHANGED
@@ -2,16 +2,58 @@
|
|
2 |
This repository provides a pre-trained Support Vector Machine (SVM) model for text classification using Term Frequency-Inverse Document Frequency (TF-IDF). The repository also includes utilities for data preprocessing and feature extraction:
|
3 |
|
4 |
There are two ways to test our model:
|
5 |
-
# 1.Colab
|
6 |
## Start
|
7 |
<br> Download all the files.
|
8 |
<br> Copy all the codes below into Colab
|
|
|
|
|
9 |
```python
|
10 |
pip install nltk beautifulsoup4 scikit-learn pandas datasets fsspec huggingface_hub
|
11 |
```
|
12 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
13 |
|
14 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
15 |
|
16 |
|
17 |
# 2. Termial
|
|
|
2 |
This repository provides a pre-trained Support Vector Machine (SVM) model for text classification using Term Frequency-Inverse Document Frequency (TF-IDF). The repository also includes utilities for data preprocessing and feature extraction:
|
3 |
|
4 |
There are two ways to test our model:
|
5 |
+
# 1.Colab (can see the file for how the Colab looks like)
|
6 |
## Start
|
7 |
<br> Download all the files.
|
8 |
<br> Copy all the codes below into Colab
|
9 |
+
<br>Before running the code, ensure you have all the required libraries installed:
|
10 |
+
|
11 |
```python
|
12 |
pip install nltk beautifulsoup4 scikit-learn pandas datasets fsspec huggingface_hub
|
13 |
```
|
14 |
|
15 |
+
<br> Download necessary NTLK resources for preprocessing.
|
16 |
+
|
17 |
+
```python
|
18 |
+
import nltk
|
19 |
+
nltk.download('stopwords')
|
20 |
+
nltk.download('wordnet')
|
21 |
+
nltk.download('omw-1.4')
|
22 |
+
```
|
23 |
+
<br>Clean the Dataset
|
24 |
+
```python
|
25 |
+
from data_cleaning import clean
|
26 |
+
import pandas as pd
|
27 |
+
import nltk
|
28 |
+
nltk.download('stopwords')
|
29 |
+
```
|
30 |
+
|
31 |
+
<br> You can replace with any datasets you want by changing the file name inside ```pd.read_csv()```.
|
32 |
+
```python
|
33 |
+
|
34 |
+
df = pd.read_csv("hf://datasets/CIS5190abcd/headlines_test/test_cleaned_headlines.csv")
|
35 |
|
36 |
|
37 |
+
cleaned_df = clean(df)
|
38 |
+
|
39 |
+
```
|
40 |
+
|
41 |
+
- Extract TF-IDF Features
|
42 |
+
```python
|
43 |
+
from tfidf import tfidf
|
44 |
+
|
45 |
+
|
46 |
+
X_new_tfidf = tfidf.transform(cleaned_df['title'])
|
47 |
+
|
48 |
+
```
|
49 |
+
|
50 |
+
- Make Predictions
|
51 |
+
```python
|
52 |
+
|
53 |
+
from svm import svm_model
|
54 |
+
|
55 |
+
```
|
56 |
+
|
57 |
|
58 |
|
59 |
# 2. Termial
|