CIS5190abcd
/

svm

Model card Files Files and versions Community

yitingliii commited on 25 days ago

Commit

1d6d48d

•

1 Parent(s): f642add

Update README.md

Browse files

Files changed (1) hide show

README.md +55 -1

README.md CHANGED Viewed

@@ -1,2 +1,56 @@
 # SVM Model with TF-IDF
-This model uses TF-IDF for feature extraction.

 # SVM Model with TF-IDF
+Step by step instruction:
+1. install required packages:
+<br>Before running the code, install some necessary packages.
+```python
+import nltk
+nltk.download('stopwords')
+nltk.download('wordnet')
+from nltk.corpus import stopwords
+from nltk.stem import WordNetLemmatizer
+from bs4 import BeautifulSoup
+import re
+import pandas as pd
+from sklearn.svm import SVC
+```
+2. Data Cleaning
+<br> The next step is to do some data cleaning to ensure the input data's format.
+```python
+def clean(df):
+    stop_words = set(stopwords.words('english'))
+    lemmatizer = WordNetLemmatizer()
+    cleaned_headlines = []
+    for headline in df['title']:
+        headline = BeautifulSoup(headline, 'html.parser').get_text()
+        headline = re.sub(r'[^a-zA-Z0-9\s]', '', headline)
+        headline = re.sub(r'\s+', ' ', headline).strip()
+        headline = headline.lower()
+        words = headline.split()
+        words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
+        cleaned_headline = ' '.join(words)
+        cleaned_headlines.append(cleaned_headline)
+    df['title'] = cleaned_headlines
+    df.drop_duplicates(subset=['title'], inplace=True)
+    return df
+```
+3. run the SVM model
+```python
+svm_model = SVC(kernel='linear', random_state=42)
+svm_model.fit(X_train_tfidf, y_train)
+y_pred = svm_model.predict(X_test_tfidf)
+accuracy = accuracy_score(y_test, y_pred)
+print(f"Random Forest Accuracy: {accuracy:.4f}")
+print(classification_report(y_test, y_pred))
+```