Tymec commited on
Commit
370bb72
1 Parent(s): 183f8cd

Update README with images

Browse files
.gitignore CHANGED
@@ -196,4 +196,4 @@ pyrightconfig.json
196
  # Custom
197
  data/*
198
  !data/slang.json
199
- flagged/
 
196
  # Custom
197
  data/*
198
  !data/slang.json
199
+ !data/test.csv
README.md CHANGED
@@ -17,7 +17,7 @@ models:
17
  ---
18
 
19
 
20
- # Sentiment Analysis [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/Tymec/sentiment-analysis)
21
 
22
 
23
  ### Table of Contents
@@ -64,8 +64,7 @@ To see the available commands and options, run:
64
  ```bash
65
  python -m app --help
66
  ```
67
-
68
- <!-- Image of the output -->
69
 
70
 
71
  ### Predict
@@ -79,8 +78,7 @@ Alternatively, you can pipe the text into the command:
79
  ```bash
80
  echo "I love this movie" | python -m app predict --model <model>
81
  ```
82
-
83
- <!-- Image of the output -->
84
 
85
 
86
  ### GUI
@@ -89,11 +87,10 @@ To launch the GUI, run the following command:
89
  python -m app gui --model <model>
90
  ```
91
  where `<model>` is the path to the trained model. Add the `--share` flag to create a publicly accessible link.
 
92
 
93
  After running the command, open the link from the terminal in your browser to access the GUI.
94
-
95
- <!-- Image of the output -->
96
- <!-- Image of the GUI -->
97
 
98
 
99
  ### Training
@@ -109,8 +106,7 @@ To see all available options, run:
109
  ```bash
110
  python -m app train --help
111
  ```
112
-
113
- <!-- Image of the output -->
114
 
115
 
116
  ### Evaluation
@@ -124,8 +120,7 @@ To see all available options, run:
124
  ```bash
125
  python -m app evaluate --help
126
  ```
127
-
128
- <!-- Image of the output -->
129
 
130
 
131
  ## Options
@@ -136,10 +131,10 @@ python -m app evaluate --help
136
  | sentiment140 | `data/sentiment140.csv` | | [Twitter Sentiment Analysis](https://www.kaggle.com/kazanova/sentiment140) |
137
  | amazonreviews | `data/amazonreviews.bz2` | only train is used | [Amazon Product Reviews](https://www.kaggle.com/bittlingmayer/amazonreviews) |
138
  | imdb50k | `data/imdb50k.csv` | | [IMDB Movie Reviews](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews) |
139
- | test | `data/test.csv` | required for `evaluate` | [Multiclass Sentiment Analysis](https://huggingface.co/datasets/Sp1786/multiclass-sentiment-analysis-dataset) |
140
 
141
- #### Used for text preprocessing
142
- - [Slang Map](Https://www.kaggle.com/code/nmaguette/up-to-date-list-of-slangs-for-text-preprocessing)
143
 
144
 
145
  ### Vectorizers
@@ -164,8 +159,9 @@ The following environment variables can be set to customize the behavior of the
164
 
165
  ### Architecture
166
  The input text is first preprocessed and tokenized using `re` and `spaCy` where:
167
- - The text is cleaned up by removing any HTML tags and converting emojis to text
168
- - Stop words and punctuation are removed
 
169
  - URLs, email addresses and numbers are removed
170
  - Words are converted to lowercase
171
  - Lemmatization is performed (words are converted to their base form based on the surrounding context)
@@ -198,7 +194,6 @@ graph LR
198
  subgraph Classification
199
  direction LR
200
  D1[LogisticRegression]
201
- D2[LinearSVC]
202
  end
203
 
204
  Classification --> |sentiment|END:::hidden
@@ -211,9 +206,10 @@ graph LR
211
  The following pre-trained models are available for use:
212
  | Dataset | Vectorizer | Classifier | Features | Accuracy on test | Accuracy on self | Model |
213
  | --- | --- | --- | --- | --- | --- | --- |
214
- | `imdb50k` | `tfidf` | `LinearRegression` | 20 000 | 83.24% ± 0.99% | 89.24% ± 0.13% | [Here](models/imdb50k_tfidf_ft20000.pkl) |
215
- | `sentiment140` | `tfidf` | `LinearRegression` | 20 000 | 83.24% ± 0.99% | 77.32% ± 0.28% | [Here](models/sentiment140_tfidf_ft20000.pkl) |
216
- | `amazonreviews` | `tfidf` | `LinearRegression` | 20 000 | 82.17% ± 0.85% | | [Here](models/amazonreviews_tfidf_ft20000.pkl) |
 
217
 
218
  ## License
219
  Distributed under the MIT License. See [LICENSE](LICENSE) for more information.
 
17
  ---
18
 
19
 
20
+ # Sentiment Analysis [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://tymec-sentiment-analysis.hf.space)
21
 
22
 
23
  ### Table of Contents
 
64
  ```bash
65
  python -m app --help
66
  ```
67
+ ![help](assets/help.png)
 
68
 
69
 
70
  ### Predict
 
78
  ```bash
79
  echo "I love this movie" | python -m app predict --model <model>
80
  ```
81
+ ![predict-help](assets/predict.png)
 
82
 
83
 
84
  ### GUI
 
87
  python -m app gui --model <model>
88
  ```
89
  where `<model>` is the path to the trained model. Add the `--share` flag to create a publicly accessible link.
90
+ ![gui-help](assets/gui.png)
91
 
92
  After running the command, open the link from the terminal in your browser to access the GUI.
93
+ ![gui](assets/space.png)
 
 
94
 
95
 
96
  ### Training
 
106
  ```bash
107
  python -m app train --help
108
  ```
109
+ ![train-help](assets/train.png)
 
110
 
111
 
112
  ### Evaluation
 
120
  ```bash
121
  python -m app evaluate --help
122
  ```
123
+ ![evaluate-help](assets/evaluate.png)
 
124
 
125
 
126
  ## Options
 
131
  | sentiment140 | `data/sentiment140.csv` | | [Twitter Sentiment Analysis](https://www.kaggle.com/kazanova/sentiment140) |
132
  | amazonreviews | `data/amazonreviews.bz2` | only train is used | [Amazon Product Reviews](https://www.kaggle.com/bittlingmayer/amazonreviews) |
133
  | imdb50k | `data/imdb50k.csv` | | [IMDB Movie Reviews](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews) |
134
+ | test | `data/test.csv` | only used in `evaluate` | [Sentiment Analysis Evaluation Dataset](https://www.kaggle.com/datasets/prishasawhney/sentiment-analysis-evaluation-dataset) |
135
 
136
+ #### Other
137
+ During text preprocessing, this [slang map](Https://www.kaggle.com/code/nmaguette/up-to-date-list-of-slangs-for-text-preprocessing) is used to convert slang words to their formal form.
138
 
139
 
140
  ### Vectorizers
 
159
 
160
  ### Architecture
161
  The input text is first preprocessed and tokenized using `re` and `spaCy` where:
162
+ - Any HTML tags are removed
163
+ - Emojis and slang words are converted to their text form
164
+ - Stop words, punctuation and special characters are removed
165
  - URLs, email addresses and numbers are removed
166
  - Words are converted to lowercase
167
  - Lemmatization is performed (words are converted to their base form based on the surrounding context)
 
194
  subgraph Classification
195
  direction LR
196
  D1[LogisticRegression]
 
197
  end
198
 
199
  Classification --> |sentiment|END:::hidden
 
206
  The following pre-trained models are available for use:
207
  | Dataset | Vectorizer | Classifier | Features | Accuracy on test | Accuracy on self | Model |
208
  | --- | --- | --- | --- | --- | --- | --- |
209
+ | `imdb50k` | `tfidf` | `LinearRegression` | 20 000 | 75.63% ± 4.73% | 89.24% ± 0.13% (cv=5) | [Here](models/imdb50k_tfidf_ft20000.pkl) |
210
+ | `sentiment140` | `tfidf` | `LinearRegression` | 20 000 | 75.63% ± 4.73% | 77.32% ± 0.28% (cv=5) | [Here](models/sentiment140_tfidf_ft20000.pkl) |
211
+ | `amazonreviews` | `tfidf` | `LinearRegression` | 20 000 | 65.49% ± 7.03% | 90.08% ± 0.00% (cv=1) | [Here](models/amazonreviews_tfidf_ft20000.pkl) |
212
+
213
 
214
  ## License
215
  Distributed under the MIT License. See [LICENSE](LICENSE) for more information.
assets/evaluate.png ADDED
assets/gui.png ADDED
assets/help.png ADDED
assets/predict.png ADDED
assets/space.png ADDED
assets/train.png ADDED