File size: 3,597 Bytes
5ee59c7 ada45ca 567bef6 c704520 567bef6 c7ece0f 9c9929c e1bbe05 567bef6 9c9929c 567bef6 9c9929c 567bef6 ada45ca 28bd825 ada45ca 39757a1 1d6d48d 6e11957 eda3c87 6e11957 6b2dc18 1d6d48d ada45ca a2b22c1 48e4179 d4e4e9a ada45ca 9318d1d 1d6d48d 7abb94b 9d8f216 ada45ca 25b4af4 c0804ea ada45ca 48e4179 b544376 ada45ca 25b4af4 b544376 25b4af4 b544376 25b4af4 b544376 25b4af4 ed160f2 1d6d48d ada45ca |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 |
# SVM Model with TF-IDF
This repository provides a pre-trained Support Vector Machine (SVM) model for text classification using Term Frequency-Inverse Document Frequency (TF-IDF). The repository also includes utilities for data preprocessing and feature extraction:
There are two ways to test our model:
# 1.Colab (can see the test_example.py file for how the Colab looks like)
## Start
<br> Download files ```tfidf.py```, ```svm.py``` and ```data_cleaning.py```.
<br> Upload the files in Colab directly under Files as well as the test data.
<br> Copy all the codes below into Colab.
<br>Before running the code, ensure you have all the required libraries installed:
```python
pip install nltk beautifulsoup4 scikit-learn pandas datasets fsspec huggingface_hub
```
<br> Download necessary NTLK resources for preprocessing.
```python
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
```
<br>Clean the Dataset
```python
from data_cleaning import clean
import pandas as pd
import nltk
nltk.download('stopwords')
```
<br> You can replace with any datasets you want by changing the file name inside ```pd.read_csv()```.
```python
df = pd.read_csv("hf://datasets/CIS5190abcd/headlines_test/test_cleaned_headlines.csv")
cleaned_df = clean(df)
```
- Extract TF-IDF Features
```python
from tfidf import tfidf
X_new_tfidf = tfidf.transform(cleaned_df['title'])
```
- Make Predictions
```python
from svm import svm_model
```
# 2. Termial
## Start:
<br>Open your terminal.
<br> Clone the repo by using the following command:
```
git clone https://huggingface.co/CIS5190abcd/svm
```
<br> Go to the svm directory using following command:
```
cd svm
```
<br> Run ```ls``` to check the files inside svm folder. Make sure ```tfidf.py```, ```svm.py``` and ```data_cleaning.py``` are existing in this directory. If not, run the folloing commands:
```
git checkout origin/main -- tfidf.py
git checkout origin/main -- svm.py
git checkout origin/main -- data_cleaning.py
```
<br> Rerun ```ls```, double check all the required files(```tfidf.py```, ```svm.py``` and ```data_cleaning.py```) are existing. Should look like this:
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6755cffd784ff7ea9db10bd4/O9K5zYm7TKiIg9cYZpV1x.png)
<br> keep inside the svm directory until ends.
## Installation
<br>Before running the code, ensure you have all the required libraries installed:
```python
pip install nltk beautifulsoup4 scikit-learn pandas datasets fsspec huggingface_hub
```
<br> Go to Python which can be opened directly in terminal by typing the following command:
```
python
```
<br> Download necessary NTLK resources for preprocessing.
```python
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
```
<br> After downloading all the required packages, **do not** exit.
## How to use:
Training a new dataset with existing SVM model, follow the steps below:
- Clean the Dataset
```python
from data_cleaning import clean
import pandas as pd
import nltk
nltk.download('stopwords')
```
<br> You can replace with any datasets you want by changing the file name inside ```pd.read_csv()```.
```python
df = pd.read_csv("hf://datasets/CIS5190abcd/headlines_test/test_cleaned_headlines.csv")
cleaned_df = clean(df)
```
- Extract TF-IDF Features
```python
from tfidf import tfidf
X_new_tfidf = tfidf.transform(cleaned_df['title'])
```
- Make Predictions
```python
from svm import svm_model
```
```exit()``` if you want to leave python.
```cd ..```if you want to exit svm directory.
|