Azarthehulk commited on
Commit
42ed916
1 Parent(s): 5533322

Upload text processing.ipynb (#1)

Browse files

- Upload text processing.ipynb (2eac9cd0758a95206488e298e9d854b0c17f81f0)

Files changed (1) hide show
  1. text processing.ipynb +1852 -0
text processing.ipynb ADDED
@@ -0,0 +1,1852 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "fc8cef68",
6
+ "metadata": {},
7
+ "source": [
8
+ "# text processing:"
9
+ ]
10
+ },
11
+ {
12
+ "cell_type": "code",
13
+ "execution_count": 1,
14
+ "id": "61636845",
15
+ "metadata": {},
16
+ "outputs": [],
17
+ "source": [
18
+ "import pandas as pd"
19
+ ]
20
+ },
21
+ {
22
+ "cell_type": "code",
23
+ "execution_count": 6,
24
+ "id": "b7cc480a",
25
+ "metadata": {},
26
+ "outputs": [],
27
+ "source": [
28
+ "data=pd.read_csv(\"Reviews.csv\")"
29
+ ]
30
+ },
31
+ {
32
+ "cell_type": "code",
33
+ "execution_count": 5,
34
+ "id": "2b61b374",
35
+ "metadata": {},
36
+ "outputs": [
37
+ {
38
+ "data": {
39
+ "text/plain": [
40
+ "Index(['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator',\n",
41
+ " 'HelpfulnessDenominator', 'Score', 'Time', 'Summary', 'Text'],\n",
42
+ " dtype='object')"
43
+ ]
44
+ },
45
+ "execution_count": 5,
46
+ "metadata": {},
47
+ "output_type": "execute_result"
48
+ }
49
+ ],
50
+ "source": [
51
+ "data.columns"
52
+ ]
53
+ },
54
+ {
55
+ "cell_type": "code",
56
+ "execution_count": 7,
57
+ "id": "ffe1915d",
58
+ "metadata": {},
59
+ "outputs": [
60
+ {
61
+ "data": {
62
+ "text/plain": [
63
+ "(568454, 10)"
64
+ ]
65
+ },
66
+ "execution_count": 7,
67
+ "metadata": {},
68
+ "output_type": "execute_result"
69
+ }
70
+ ],
71
+ "source": [
72
+ "data.shape"
73
+ ]
74
+ },
75
+ {
76
+ "cell_type": "code",
77
+ "execution_count": 10,
78
+ "id": "7658aeba",
79
+ "metadata": {},
80
+ "outputs": [
81
+ {
82
+ "data": {
83
+ "text/plain": [
84
+ "201034 Raspberry taste TOO strong\n",
85
+ "511879 The whole family loves these!\n",
86
+ "8685 Well-rounded but weak\n",
87
+ "385899 1/10 as strong as other brands, won't disolve ...\n",
88
+ "47587 Perfect fudge!\n",
89
+ "5222 Great snack, however,\n",
90
+ "560412 Looks, smells and probably tastes like real ch...\n",
91
+ "479143 Delicious way to start the day\n",
92
+ "307484 A dessert in a cup!\n",
93
+ "189729 SUPERB! GREAT FOR BAKING!\n",
94
+ "Name: Summary, dtype: object"
95
+ ]
96
+ },
97
+ "execution_count": 10,
98
+ "metadata": {},
99
+ "output_type": "execute_result"
100
+ }
101
+ ],
102
+ "source": [
103
+ "data.Summary.sample(10)"
104
+ ]
105
+ },
106
+ {
107
+ "cell_type": "code",
108
+ "execution_count": 13,
109
+ "id": "379273cc",
110
+ "metadata": {},
111
+ "outputs": [
112
+ {
113
+ "data": {
114
+ "text/plain": [
115
+ "'I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than most.'"
116
+ ]
117
+ },
118
+ "execution_count": 13,
119
+ "metadata": {},
120
+ "output_type": "execute_result"
121
+ }
122
+ ],
123
+ "source": [
124
+ "data.Text[0]"
125
+ ]
126
+ },
127
+ {
128
+ "cell_type": "code",
129
+ "execution_count": 14,
130
+ "id": "38332201",
131
+ "metadata": {},
132
+ "outputs": [
133
+ {
134
+ "data": {
135
+ "text/plain": [
136
+ "'Good Quality Dog Food'"
137
+ ]
138
+ },
139
+ "execution_count": 14,
140
+ "metadata": {},
141
+ "output_type": "execute_result"
142
+ }
143
+ ],
144
+ "source": [
145
+ "data.Summary[0]"
146
+ ]
147
+ },
148
+ {
149
+ "cell_type": "code",
150
+ "execution_count": 15,
151
+ "id": "73a4c938",
152
+ "metadata": {},
153
+ "outputs": [
154
+ {
155
+ "data": {
156
+ "text/plain": [
157
+ "'!\"#$%&\\'()*+,-./:;<=>?@[\\\\]^_`{|}~'"
158
+ ]
159
+ },
160
+ "execution_count": 15,
161
+ "metadata": {},
162
+ "output_type": "execute_result"
163
+ }
164
+ ],
165
+ "source": [
166
+ "import string\n",
167
+ "string.punctuation"
168
+ ]
169
+ },
170
+ {
171
+ "cell_type": "code",
172
+ "execution_count": 30,
173
+ "id": "93270bf2",
174
+ "metadata": {},
175
+ "outputs": [],
176
+ "source": [
177
+ "#this function is about to remove the , . ' ',\" \",all abpuve sybols discribed\n",
178
+ "def remove_punctuation(text):\n",
179
+ " punctuationfree=\"\".join([i for i in text if i not in string.punctuation])\n",
180
+ " return punctuationfree"
181
+ ]
182
+ },
183
+ {
184
+ "cell_type": "code",
185
+ "execution_count": 31,
186
+ "id": "95437c3b",
187
+ "metadata": {
188
+ "scrolled": true
189
+ },
190
+ "outputs": [
191
+ {
192
+ "data": {
193
+ "text/html": [
194
+ "<div>\n",
195
+ "<style scoped>\n",
196
+ " .dataframe tbody tr th:only-of-type {\n",
197
+ " vertical-align: middle;\n",
198
+ " }\n",
199
+ "\n",
200
+ " .dataframe tbody tr th {\n",
201
+ " vertical-align: top;\n",
202
+ " }\n",
203
+ "\n",
204
+ " .dataframe thead th {\n",
205
+ " text-align: right;\n",
206
+ " }\n",
207
+ "</style>\n",
208
+ "<table border=\"1\" class=\"dataframe\">\n",
209
+ " <thead>\n",
210
+ " <tr style=\"text-align: right;\">\n",
211
+ " <th></th>\n",
212
+ " <th>Id</th>\n",
213
+ " <th>ProductId</th>\n",
214
+ " <th>UserId</th>\n",
215
+ " <th>ProfileName</th>\n",
216
+ " <th>HelpfulnessNumerator</th>\n",
217
+ " <th>HelpfulnessDenominator</th>\n",
218
+ " <th>Score</th>\n",
219
+ " <th>Time</th>\n",
220
+ " <th>Summary</th>\n",
221
+ " <th>Text</th>\n",
222
+ " <th>clean_msg</th>\n",
223
+ " </tr>\n",
224
+ " </thead>\n",
225
+ " <tbody>\n",
226
+ " <tr>\n",
227
+ " <th>0</th>\n",
228
+ " <td>1</td>\n",
229
+ " <td>B001E4KFG0</td>\n",
230
+ " <td>A3SGXH7AUHU8GW</td>\n",
231
+ " <td>delmartian</td>\n",
232
+ " <td>1</td>\n",
233
+ " <td>1</td>\n",
234
+ " <td>5</td>\n",
235
+ " <td>1303862400</td>\n",
236
+ " <td>Good Quality Dog Food</td>\n",
237
+ " <td>I have bought several of the Vitality canned d...</td>\n",
238
+ " <td>I have bought several of the Vitality canned d...</td>\n",
239
+ " </tr>\n",
240
+ " <tr>\n",
241
+ " <th>1</th>\n",
242
+ " <td>2</td>\n",
243
+ " <td>B00813GRG4</td>\n",
244
+ " <td>A1D87F6ZCVE5NK</td>\n",
245
+ " <td>dll pa</td>\n",
246
+ " <td>0</td>\n",
247
+ " <td>0</td>\n",
248
+ " <td>1</td>\n",
249
+ " <td>1346976000</td>\n",
250
+ " <td>Not as Advertised</td>\n",
251
+ " <td>Product arrived labeled as Jumbo Salted Peanut...</td>\n",
252
+ " <td>Product arrived labeled as Jumbo Salted Peanut...</td>\n",
253
+ " </tr>\n",
254
+ " <tr>\n",
255
+ " <th>2</th>\n",
256
+ " <td>3</td>\n",
257
+ " <td>B000LQOCH0</td>\n",
258
+ " <td>ABXLMWJIXXAIN</td>\n",
259
+ " <td>Natalia Corres \"Natalia Corres\"</td>\n",
260
+ " <td>1</td>\n",
261
+ " <td>1</td>\n",
262
+ " <td>4</td>\n",
263
+ " <td>1219017600</td>\n",
264
+ " <td>\"Delight\" says it all</td>\n",
265
+ " <td>This is a confection that has been around a fe...</td>\n",
266
+ " <td>This is a confection that has been around a fe...</td>\n",
267
+ " </tr>\n",
268
+ " <tr>\n",
269
+ " <th>3</th>\n",
270
+ " <td>4</td>\n",
271
+ " <td>B000UA0QIQ</td>\n",
272
+ " <td>A395BORC6FGVXV</td>\n",
273
+ " <td>Karl</td>\n",
274
+ " <td>3</td>\n",
275
+ " <td>3</td>\n",
276
+ " <td>2</td>\n",
277
+ " <td>1307923200</td>\n",
278
+ " <td>Cough Medicine</td>\n",
279
+ " <td>If you are looking for the secret ingredient i...</td>\n",
280
+ " <td>If you are looking for the secret ingredient i...</td>\n",
281
+ " </tr>\n",
282
+ " <tr>\n",
283
+ " <th>4</th>\n",
284
+ " <td>5</td>\n",
285
+ " <td>B006K2ZZ7K</td>\n",
286
+ " <td>A1UQRSCLF8GW1T</td>\n",
287
+ " <td>Michael D. Bigham \"M. Wassir\"</td>\n",
288
+ " <td>0</td>\n",
289
+ " <td>0</td>\n",
290
+ " <td>5</td>\n",
291
+ " <td>1350777600</td>\n",
292
+ " <td>Great taffy</td>\n",
293
+ " <td>Great taffy at a great price. There was a wid...</td>\n",
294
+ " <td>Great taffy at a great price There was a wide...</td>\n",
295
+ " </tr>\n",
296
+ " </tbody>\n",
297
+ "</table>\n",
298
+ "</div>"
299
+ ],
300
+ "text/plain": [
301
+ " Id ProductId UserId ProfileName \\\n",
302
+ "0 1 B001E4KFG0 A3SGXH7AUHU8GW delmartian \n",
303
+ "1 2 B00813GRG4 A1D87F6ZCVE5NK dll pa \n",
304
+ "2 3 B000LQOCH0 ABXLMWJIXXAIN Natalia Corres \"Natalia Corres\" \n",
305
+ "3 4 B000UA0QIQ A395BORC6FGVXV Karl \n",
306
+ "4 5 B006K2ZZ7K A1UQRSCLF8GW1T Michael D. Bigham \"M. Wassir\" \n",
307
+ "\n",
308
+ " HelpfulnessNumerator HelpfulnessDenominator Score Time \\\n",
309
+ "0 1 1 5 1303862400 \n",
310
+ "1 0 0 1 1346976000 \n",
311
+ "2 1 1 4 1219017600 \n",
312
+ "3 3 3 2 1307923200 \n",
313
+ "4 0 0 5 1350777600 \n",
314
+ "\n",
315
+ " Summary Text \\\n",
316
+ "0 Good Quality Dog Food I have bought several of the Vitality canned d... \n",
317
+ "1 Not as Advertised Product arrived labeled as Jumbo Salted Peanut... \n",
318
+ "2 \"Delight\" says it all This is a confection that has been around a fe... \n",
319
+ "3 Cough Medicine If you are looking for the secret ingredient i... \n",
320
+ "4 Great taffy Great taffy at a great price. There was a wid... \n",
321
+ "\n",
322
+ " clean_msg \n",
323
+ "0 I have bought several of the Vitality canned d... \n",
324
+ "1 Product arrived labeled as Jumbo Salted Peanut... \n",
325
+ "2 This is a confection that has been around a fe... \n",
326
+ "3 If you are looking for the secret ingredient i... \n",
327
+ "4 Great taffy at a great price There was a wide... "
328
+ ]
329
+ },
330
+ "execution_count": 31,
331
+ "metadata": {},
332
+ "output_type": "execute_result"
333
+ }
334
+ ],
335
+ "source": [
336
+ "#storing the puntuation free text\n",
337
+ "data['clean_msg']= data['Text'].apply(lambda x:remove_punctuation(x))\n",
338
+ "data.head()"
339
+ ]
340
+ },
341
+ {
342
+ "cell_type": "code",
343
+ "execution_count": 28,
344
+ "id": "7f6436d5",
345
+ "metadata": {},
346
+ "outputs": [
347
+ {
348
+ "name": "stdout",
349
+ "output_type": "stream",
350
+ "text": [
351
+ " clean_msg \\\n",
352
+ "43350 Great tasting convenient coffeejust a little p... \n",
353
+ "136363 Really good coffee Good flavor and would buy a... \n",
354
+ "\n",
355
+ " Text \n",
356
+ "43350 Great tasting, convenient coffee..just a littl... \n",
357
+ "136363 Really good coffee. Good flavor and would buy ... \n"
358
+ ]
359
+ }
360
+ ],
361
+ "source": [
362
+ "print(data[['clean_msg','Text']].sample(2))"
363
+ ]
364
+ },
365
+ {
366
+ "cell_type": "code",
367
+ "execution_count": 41,
368
+ "id": "954459ba",
369
+ "metadata": {},
370
+ "outputs": [],
371
+ "source": [
372
+ "data['text_lower']=data['clean_msg'].apply(lambda x: x.lower())"
373
+ ]
374
+ },
375
+ {
376
+ "cell_type": "code",
377
+ "execution_count": 42,
378
+ "id": "29d12dbc",
379
+ "metadata": {},
380
+ "outputs": [
381
+ {
382
+ "data": {
383
+ "text/html": [
384
+ "<div>\n",
385
+ "<style scoped>\n",
386
+ " .dataframe tbody tr th:only-of-type {\n",
387
+ " vertical-align: middle;\n",
388
+ " }\n",
389
+ "\n",
390
+ " .dataframe tbody tr th {\n",
391
+ " vertical-align: top;\n",
392
+ " }\n",
393
+ "\n",
394
+ " .dataframe thead th {\n",
395
+ " text-align: right;\n",
396
+ " }\n",
397
+ "</style>\n",
398
+ "<table border=\"1\" class=\"dataframe\">\n",
399
+ " <thead>\n",
400
+ " <tr style=\"text-align: right;\">\n",
401
+ " <th></th>\n",
402
+ " <th>Text</th>\n",
403
+ " <th>clean_msg</th>\n",
404
+ " <th>text_lower</th>\n",
405
+ " </tr>\n",
406
+ " </thead>\n",
407
+ " <tbody>\n",
408
+ " <tr>\n",
409
+ " <th>330905</th>\n",
410
+ " <td>Excellent, easy, and tasty. It is usually chea...</td>\n",
411
+ " <td>Excellent easy and tasty It is usually cheaper...</td>\n",
412
+ " <td>excellent easy and tasty it is usually cheaper...</td>\n",
413
+ " </tr>\n",
414
+ " <tr>\n",
415
+ " <th>291052</th>\n",
416
+ " <td>This water is very good for the immune system....</td>\n",
417
+ " <td>This water is very good for the immune system ...</td>\n",
418
+ " <td>this water is very good for the immune system ...</td>\n",
419
+ " </tr>\n",
420
+ " <tr>\n",
421
+ " <th>389090</th>\n",
422
+ " <td>I enjoy these. I'm not sure the price is a gre...</td>\n",
423
+ " <td>I enjoy these Im not sure the price is a great...</td>\n",
424
+ " <td>i enjoy these im not sure the price is a great...</td>\n",
425
+ " </tr>\n",
426
+ " <tr>\n",
427
+ " <th>486142</th>\n",
428
+ " <td>Is Wellness relatively expensive? Yes. Is it w...</td>\n",
429
+ " <td>Is Wellness relatively expensive Yes Is it wor...</td>\n",
430
+ " <td>is wellness relatively expensive yes is it wor...</td>\n",
431
+ " </tr>\n",
432
+ " <tr>\n",
433
+ " <th>496694</th>\n",
434
+ " <td>Maybe the claims are true but who can really t...</td>\n",
435
+ " <td>Maybe the claims are true but who can really t...</td>\n",
436
+ " <td>maybe the claims are true but who can really t...</td>\n",
437
+ " </tr>\n",
438
+ " </tbody>\n",
439
+ "</table>\n",
440
+ "</div>"
441
+ ],
442
+ "text/plain": [
443
+ " Text \\\n",
444
+ "330905 Excellent, easy, and tasty. It is usually chea... \n",
445
+ "291052 This water is very good for the immune system.... \n",
446
+ "389090 I enjoy these. I'm not sure the price is a gre... \n",
447
+ "486142 Is Wellness relatively expensive? Yes. Is it w... \n",
448
+ "496694 Maybe the claims are true but who can really t... \n",
449
+ "\n",
450
+ " clean_msg \\\n",
451
+ "330905 Excellent easy and tasty It is usually cheaper... \n",
452
+ "291052 This water is very good for the immune system ... \n",
453
+ "389090 I enjoy these Im not sure the price is a great... \n",
454
+ "486142 Is Wellness relatively expensive Yes Is it wor... \n",
455
+ "496694 Maybe the claims are true but who can really t... \n",
456
+ "\n",
457
+ " text_lower \n",
458
+ "330905 excellent easy and tasty it is usually cheaper... \n",
459
+ "291052 this water is very good for the immune system ... \n",
460
+ "389090 i enjoy these im not sure the price is a great... \n",
461
+ "486142 is wellness relatively expensive yes is it wor... \n",
462
+ "496694 maybe the claims are true but who can really t... "
463
+ ]
464
+ },
465
+ "execution_count": 42,
466
+ "metadata": {},
467
+ "output_type": "execute_result"
468
+ }
469
+ ],
470
+ "source": [
471
+ "data[['Text','clean_msg','text_lower']].sample(5)"
472
+ ]
473
+ },
474
+ {
475
+ "cell_type": "code",
476
+ "execution_count": 45,
477
+ "id": "69f9cc57",
478
+ "metadata": {},
479
+ "outputs": [
480
+ {
481
+ "name": "stderr",
482
+ "output_type": "stream",
483
+ "text": [
484
+ "[nltk_data] Downloading package punkt to\n",
485
+ "[nltk_data] /Users/azarmohammad/nltk_data...\n",
486
+ "[nltk_data] Unzipping tokenizers/punkt.zip.\n"
487
+ ]
488
+ },
489
+ {
490
+ "data": {
491
+ "text/plain": [
492
+ "True"
493
+ ]
494
+ },
495
+ "execution_count": 45,
496
+ "metadata": {},
497
+ "output_type": "execute_result"
498
+ }
499
+ ],
500
+ "source": [
501
+ "import nltk\n",
502
+ "nltk.download('punkt')"
503
+ ]
504
+ },
505
+ {
506
+ "cell_type": "code",
507
+ "execution_count": 46,
508
+ "id": "e1005f2d",
509
+ "metadata": {},
510
+ "outputs": [],
511
+ "source": [
512
+ "#defining function for tokenization\n",
513
+ "import re\n",
514
+ "from nltk.tokenize import word_tokenize\n",
515
+ "\n",
516
+ "def tokenization(text):\n",
517
+ " tokens = word_tokenize(text)\n",
518
+ " #tokens = re.split('W+',text)\n",
519
+ " return tokens\n",
520
+ "\n"
521
+ ]
522
+ },
523
+ {
524
+ "cell_type": "code",
525
+ "execution_count": 47,
526
+ "id": "065bd725",
527
+ "metadata": {
528
+ "collapsed": true
529
+ },
530
+ "outputs": [
531
+ {
532
+ "data": {
533
+ "text/html": [
534
+ "<div>\n",
535
+ "<style scoped>\n",
536
+ " .dataframe tbody tr th:only-of-type {\n",
537
+ " vertical-align: middle;\n",
538
+ " }\n",
539
+ "\n",
540
+ " .dataframe tbody tr th {\n",
541
+ " vertical-align: top;\n",
542
+ " }\n",
543
+ "\n",
544
+ " .dataframe thead th {\n",
545
+ " text-align: right;\n",
546
+ " }\n",
547
+ "</style>\n",
548
+ "<table border=\"1\" class=\"dataframe\">\n",
549
+ " <thead>\n",
550
+ " <tr style=\"text-align: right;\">\n",
551
+ " <th></th>\n",
552
+ " <th>Id</th>\n",
553
+ " <th>ProductId</th>\n",
554
+ " <th>UserId</th>\n",
555
+ " <th>ProfileName</th>\n",
556
+ " <th>HelpfulnessNumerator</th>\n",
557
+ " <th>HelpfulnessDenominator</th>\n",
558
+ " <th>Score</th>\n",
559
+ " <th>Time</th>\n",
560
+ " <th>Summary</th>\n",
561
+ " <th>Text</th>\n",
562
+ " <th>clean_msg</th>\n",
563
+ " <th>text_lower</th>\n",
564
+ " <th>msg_tokenied</th>\n",
565
+ " </tr>\n",
566
+ " </thead>\n",
567
+ " <tbody>\n",
568
+ " <tr>\n",
569
+ " <th>0</th>\n",
570
+ " <td>1</td>\n",
571
+ " <td>B001E4KFG0</td>\n",
572
+ " <td>A3SGXH7AUHU8GW</td>\n",
573
+ " <td>delmartian</td>\n",
574
+ " <td>1</td>\n",
575
+ " <td>1</td>\n",
576
+ " <td>5</td>\n",
577
+ " <td>1303862400</td>\n",
578
+ " <td>Good Quality Dog Food</td>\n",
579
+ " <td>I have bought several of the Vitality canned d...</td>\n",
580
+ " <td>I have bought several of the Vitality canned d...</td>\n",
581
+ " <td>i have bought several of the vitality canned d...</td>\n",
582
+ " <td>[i, have, bought, several, of, the, vitality, ...</td>\n",
583
+ " </tr>\n",
584
+ " <tr>\n",
585
+ " <th>1</th>\n",
586
+ " <td>2</td>\n",
587
+ " <td>B00813GRG4</td>\n",
588
+ " <td>A1D87F6ZCVE5NK</td>\n",
589
+ " <td>dll pa</td>\n",
590
+ " <td>0</td>\n",
591
+ " <td>0</td>\n",
592
+ " <td>1</td>\n",
593
+ " <td>1346976000</td>\n",
594
+ " <td>Not as Advertised</td>\n",
595
+ " <td>Product arrived labeled as Jumbo Salted Peanut...</td>\n",
596
+ " <td>Product arrived labeled as Jumbo Salted Peanut...</td>\n",
597
+ " <td>product arrived labeled as jumbo salted peanut...</td>\n",
598
+ " <td>[product, arrived, labeled, as, jumbo, salted,...</td>\n",
599
+ " </tr>\n",
600
+ " <tr>\n",
601
+ " <th>2</th>\n",
602
+ " <td>3</td>\n",
603
+ " <td>B000LQOCH0</td>\n",
604
+ " <td>ABXLMWJIXXAIN</td>\n",
605
+ " <td>Natalia Corres \"Natalia Corres\"</td>\n",
606
+ " <td>1</td>\n",
607
+ " <td>1</td>\n",
608
+ " <td>4</td>\n",
609
+ " <td>1219017600</td>\n",
610
+ " <td>\"Delight\" says it all</td>\n",
611
+ " <td>This is a confection that has been around a fe...</td>\n",
612
+ " <td>This is a confection that has been around a fe...</td>\n",
613
+ " <td>this is a confection that has been around a fe...</td>\n",
614
+ " <td>[this, is, a, confection, that, has, been, aro...</td>\n",
615
+ " </tr>\n",
616
+ " <tr>\n",
617
+ " <th>3</th>\n",
618
+ " <td>4</td>\n",
619
+ " <td>B000UA0QIQ</td>\n",
620
+ " <td>A395BORC6FGVXV</td>\n",
621
+ " <td>Karl</td>\n",
622
+ " <td>3</td>\n",
623
+ " <td>3</td>\n",
624
+ " <td>2</td>\n",
625
+ " <td>1307923200</td>\n",
626
+ " <td>Cough Medicine</td>\n",
627
+ " <td>If you are looking for the secret ingredient i...</td>\n",
628
+ " <td>If you are looking for the secret ingredient i...</td>\n",
629
+ " <td>if you are looking for the secret ingredient i...</td>\n",
630
+ " <td>[if, you, are, looking, for, the, secret, ingr...</td>\n",
631
+ " </tr>\n",
632
+ " <tr>\n",
633
+ " <th>4</th>\n",
634
+ " <td>5</td>\n",
635
+ " <td>B006K2ZZ7K</td>\n",
636
+ " <td>A1UQRSCLF8GW1T</td>\n",
637
+ " <td>Michael D. Bigham \"M. Wassir\"</td>\n",
638
+ " <td>0</td>\n",
639
+ " <td>0</td>\n",
640
+ " <td>5</td>\n",
641
+ " <td>1350777600</td>\n",
642
+ " <td>Great taffy</td>\n",
643
+ " <td>Great taffy at a great price. There was a wid...</td>\n",
644
+ " <td>Great taffy at a great price There was a wide...</td>\n",
645
+ " <td>great taffy at a great price there was a wide...</td>\n",
646
+ " <td>[great, taffy, at, a, great, price, there, was...</td>\n",
647
+ " </tr>\n",
648
+ " </tbody>\n",
649
+ "</table>\n",
650
+ "</div>"
651
+ ],
652
+ "text/plain": [
653
+ " Id ProductId UserId ProfileName \\\n",
654
+ "0 1 B001E4KFG0 A3SGXH7AUHU8GW delmartian \n",
655
+ "1 2 B00813GRG4 A1D87F6ZCVE5NK dll pa \n",
656
+ "2 3 B000LQOCH0 ABXLMWJIXXAIN Natalia Corres \"Natalia Corres\" \n",
657
+ "3 4 B000UA0QIQ A395BORC6FGVXV Karl \n",
658
+ "4 5 B006K2ZZ7K A1UQRSCLF8GW1T Michael D. Bigham \"M. Wassir\" \n",
659
+ "\n",
660
+ " HelpfulnessNumerator HelpfulnessDenominator Score Time \\\n",
661
+ "0 1 1 5 1303862400 \n",
662
+ "1 0 0 1 1346976000 \n",
663
+ "2 1 1 4 1219017600 \n",
664
+ "3 3 3 2 1307923200 \n",
665
+ "4 0 0 5 1350777600 \n",
666
+ "\n",
667
+ " Summary Text \\\n",
668
+ "0 Good Quality Dog Food I have bought several of the Vitality canned d... \n",
669
+ "1 Not as Advertised Product arrived labeled as Jumbo Salted Peanut... \n",
670
+ "2 \"Delight\" says it all This is a confection that has been around a fe... \n",
671
+ "3 Cough Medicine If you are looking for the secret ingredient i... \n",
672
+ "4 Great taffy Great taffy at a great price. There was a wid... \n",
673
+ "\n",
674
+ " clean_msg \\\n",
675
+ "0 I have bought several of the Vitality canned d... \n",
676
+ "1 Product arrived labeled as Jumbo Salted Peanut... \n",
677
+ "2 This is a confection that has been around a fe... \n",
678
+ "3 If you are looking for the secret ingredient i... \n",
679
+ "4 Great taffy at a great price There was a wide... \n",
680
+ "\n",
681
+ " text_lower \\\n",
682
+ "0 i have bought several of the vitality canned d... \n",
683
+ "1 product arrived labeled as jumbo salted peanut... \n",
684
+ "2 this is a confection that has been around a fe... \n",
685
+ "3 if you are looking for the secret ingredient i... \n",
686
+ "4 great taffy at a great price there was a wide... \n",
687
+ "\n",
688
+ " msg_tokenied \n",
689
+ "0 [i, have, bought, several, of, the, vitality, ... \n",
690
+ "1 [product, arrived, labeled, as, jumbo, salted,... \n",
691
+ "2 [this, is, a, confection, that, has, been, aro... \n",
692
+ "3 [if, you, are, looking, for, the, secret, ingr... \n",
693
+ "4 [great, taffy, at, a, great, price, there, was... "
694
+ ]
695
+ },
696
+ "execution_count": 47,
697
+ "metadata": {},
698
+ "output_type": "execute_result"
699
+ }
700
+ ],
701
+ "source": [
702
+ "#applying function to the column\n",
703
+ "data['msg_tokenied']= data['text_lower'].apply(lambda x: tokenization(x))\n",
704
+ "data.head()"
705
+ ]
706
+ },
707
+ {
708
+ "cell_type": "code",
709
+ "execution_count": 50,
710
+ "id": "ad786064",
711
+ "metadata": {
712
+ "scrolled": true
713
+ },
714
+ "outputs": [
715
+ {
716
+ "data": {
717
+ "text/html": [
718
+ "<div>\n",
719
+ "<style scoped>\n",
720
+ " .dataframe tbody tr th:only-of-type {\n",
721
+ " vertical-align: middle;\n",
722
+ " }\n",
723
+ "\n",
724
+ " .dataframe tbody tr th {\n",
725
+ " vertical-align: top;\n",
726
+ " }\n",
727
+ "\n",
728
+ " .dataframe thead th {\n",
729
+ " text-align: right;\n",
730
+ " }\n",
731
+ "</style>\n",
732
+ "<table border=\"1\" class=\"dataframe\">\n",
733
+ " <thead>\n",
734
+ " <tr style=\"text-align: right;\">\n",
735
+ " <th></th>\n",
736
+ " <th>msg_tokenied</th>\n",
737
+ " <th>clean_msg</th>\n",
738
+ " <th>text_lower</th>\n",
739
+ " <th>Text</th>\n",
740
+ " </tr>\n",
741
+ " </thead>\n",
742
+ " <tbody>\n",
743
+ " <tr>\n",
744
+ " <th>321594</th>\n",
745
+ " <td>[these, treats, are, my, 18, month, old, bosto...</td>\n",
746
+ " <td>These treats are my 18 month old Boston Terrie...</td>\n",
747
+ " <td>these treats are my 18 month old boston terrie...</td>\n",
748
+ " <td>These treats are my 18 month old Boston Terrie...</td>\n",
749
+ " </tr>\n",
750
+ " </tbody>\n",
751
+ "</table>\n",
752
+ "</div>"
753
+ ],
754
+ "text/plain": [
755
+ " msg_tokenied \\\n",
756
+ "321594 [these, treats, are, my, 18, month, old, bosto... \n",
757
+ "\n",
758
+ " clean_msg \\\n",
759
+ "321594 These treats are my 18 month old Boston Terrie... \n",
760
+ "\n",
761
+ " text_lower \\\n",
762
+ "321594 these treats are my 18 month old boston terrie... \n",
763
+ "\n",
764
+ " Text \n",
765
+ "321594 These treats are my 18 month old Boston Terrie... "
766
+ ]
767
+ },
768
+ "execution_count": 50,
769
+ "metadata": {},
770
+ "output_type": "execute_result"
771
+ }
772
+ ],
773
+ "source": [
774
+ "data[['msg_tokenied','clean_msg','text_lower','Text']].sample(1)"
775
+ ]
776
+ },
777
+ {
778
+ "cell_type": "code",
779
+ "execution_count": 56,
780
+ "id": "02f94734",
781
+ "metadata": {},
782
+ "outputs": [
783
+ {
784
+ "name": "stdout",
785
+ "output_type": "stream",
786
+ "text": [
787
+ "I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than most.\n"
788
+ ]
789
+ }
790
+ ],
791
+ "source": [
792
+ "print(data.Text[0])"
793
+ ]
794
+ },
795
+ {
796
+ "cell_type": "code",
797
+ "execution_count": 57,
798
+ "id": "3640c44e",
799
+ "metadata": {},
800
+ "outputs": [
801
+ {
802
+ "name": "stdout",
803
+ "output_type": "stream",
804
+ "text": [
805
+ "I have bought several of the Vitality canned dog food products and have found them all to be of good quality The product looks more like a stew than a processed meat and it smells better My Labrador is finicky and she appreciates this product better than most\n"
806
+ ]
807
+ }
808
+ ],
809
+ "source": [
810
+ "print(data.clean_msg[0])"
811
+ ]
812
+ },
813
+ {
814
+ "cell_type": "code",
815
+ "execution_count": 58,
816
+ "id": "cef1cd92",
817
+ "metadata": {},
818
+ "outputs": [
819
+ {
820
+ "name": "stdout",
821
+ "output_type": "stream",
822
+ "text": [
823
+ "i have bought several of the vitality canned dog food products and have found them all to be of good quality the product looks more like a stew than a processed meat and it smells better my labrador is finicky and she appreciates this product better than most\n"
824
+ ]
825
+ }
826
+ ],
827
+ "source": [
828
+ "print(data.text_lower[0])"
829
+ ]
830
+ },
831
+ {
832
+ "cell_type": "code",
833
+ "execution_count": 60,
834
+ "id": "e9f164ba",
835
+ "metadata": {},
836
+ "outputs": [
837
+ {
838
+ "name": "stdout",
839
+ "output_type": "stream",
840
+ "text": [
841
+ "['i', 'have', 'bought', 'several', 'of', 'the', 'vitality', 'canned', 'dog', 'food', 'products', 'and', 'have', 'found', 'them', 'all', 'to', 'be', 'of', 'good', 'quality', 'the', 'product', 'looks', 'more', 'like', 'a', 'stew', 'than', 'a', 'processed', 'meat', 'and', 'it', 'smells', 'better', 'my', 'labrador', 'is', 'finicky', 'and', 'she', 'appreciates', 'this', 'product', 'better', 'than', 'most']\n"
842
+ ]
843
+ }
844
+ ],
845
+ "source": [
846
+ "print(data.msg_tokenied[0])"
847
+ ]
848
+ },
849
+ {
850
+ "cell_type": "code",
851
+ "execution_count": 62,
852
+ "id": "905d96e6",
853
+ "metadata": {},
854
+ "outputs": [
855
+ {
856
+ "name": "stderr",
857
+ "output_type": "stream",
858
+ "text": [
859
+ "[nltk_data] Downloading package stopwords to\n",
860
+ "[nltk_data] /Users/azarmohammad/nltk_data...\n",
861
+ "[nltk_data] Unzipping corpora/stopwords.zip.\n"
862
+ ]
863
+ }
864
+ ],
865
+ "source": [
866
+ "import nltk\n",
867
+ "#Stop words present in the library\n",
868
+ "nltk.download('stopwords')\n",
869
+ "stopwords = nltk.corpus.stopwords.words('english')\n"
870
+ ]
871
+ },
872
+ {
873
+ "cell_type": "code",
874
+ "execution_count": 64,
875
+ "id": "b6a95420",
876
+ "metadata": {
877
+ "collapsed": true
878
+ },
879
+ "outputs": [
880
+ {
881
+ "name": "stdout",
882
+ "output_type": "stream",
883
+ "text": [
884
+ "['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', \"you're\", \"you've\", \"you'll\", \"you'd\", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', \"she's\", 'her', 'hers', 'herself', 'it', \"it's\", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', \"that'll\", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', \"don't\", 'should', \"should've\", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', \"aren't\", 'couldn', \"couldn't\", 'didn', \"didn't\", 'doesn', \"doesn't\", 'hadn', \"hadn't\", 'hasn', \"hasn't\", 'haven', \"haven't\", 'isn', \"isn't\", 'ma', 'mightn', \"mightn't\", 'mustn', \"mustn't\", 'needn', \"needn't\", 'shan', \"shan't\", 'shouldn', \"shouldn't\", 'wasn', \"wasn't\", 'weren', \"weren't\", 'won', \"won't\", 'wouldn', \"wouldn't\"]\n"
885
+ ]
886
+ }
887
+ ],
888
+ "source": [
889
+ "print(stopwords)"
890
+ ]
891
+ },
892
+ {
893
+ "cell_type": "markdown",
894
+ "id": "f49f4947",
895
+ "metadata": {},
896
+ "source": [
897
+ "The words which are generally filtered out before processing a natural language are called stop words. These are actually the most common words in any language (like articles, prepositions, pronouns, conjunctions, etc) and does not add much information to the text. Examples of a few stop words in English are “the”, “a”, “an”, “so”, “what”."
898
+ ]
899
+ },
900
+ {
901
+ "cell_type": "code",
902
+ "execution_count": 68,
903
+ "id": "2f5b0bff",
904
+ "metadata": {},
905
+ "outputs": [],
906
+ "source": [
907
+ "def remove_stopwords(text):\n",
908
+ " #print(text)\n",
909
+ " output= [x for x in text if x not in stopwords]\n",
910
+ " #print(output)\n",
911
+ " return output\n"
912
+ ]
913
+ },
914
+ {
915
+ "cell_type": "code",
916
+ "execution_count": 70,
917
+ "id": "2652f6a3",
918
+ "metadata": {
919
+ "collapsed": true
920
+ },
921
+ "outputs": [
922
+ {
923
+ "data": {
924
+ "text/html": [
925
+ "<div>\n",
926
+ "<style scoped>\n",
927
+ " .dataframe tbody tr th:only-of-type {\n",
928
+ " vertical-align: middle;\n",
929
+ " }\n",
930
+ "\n",
931
+ " .dataframe tbody tr th {\n",
932
+ " vertical-align: top;\n",
933
+ " }\n",
934
+ "\n",
935
+ " .dataframe thead th {\n",
936
+ " text-align: right;\n",
937
+ " }\n",
938
+ "</style>\n",
939
+ "<table border=\"1\" class=\"dataframe\">\n",
940
+ " <thead>\n",
941
+ " <tr style=\"text-align: right;\">\n",
942
+ " <th></th>\n",
943
+ " <th>Id</th>\n",
944
+ " <th>ProductId</th>\n",
945
+ " <th>UserId</th>\n",
946
+ " <th>ProfileName</th>\n",
947
+ " <th>HelpfulnessNumerator</th>\n",
948
+ " <th>HelpfulnessDenominator</th>\n",
949
+ " <th>Score</th>\n",
950
+ " <th>Time</th>\n",
951
+ " <th>Summary</th>\n",
952
+ " <th>Text</th>\n",
953
+ " <th>clean_msg</th>\n",
954
+ " <th>text_lower</th>\n",
955
+ " <th>msg_tokenied</th>\n",
956
+ " <th>no_stopwords</th>\n",
957
+ " </tr>\n",
958
+ " </thead>\n",
959
+ " <tbody>\n",
960
+ " <tr>\n",
961
+ " <th>0</th>\n",
962
+ " <td>1</td>\n",
963
+ " <td>B001E4KFG0</td>\n",
964
+ " <td>A3SGXH7AUHU8GW</td>\n",
965
+ " <td>delmartian</td>\n",
966
+ " <td>1</td>\n",
967
+ " <td>1</td>\n",
968
+ " <td>5</td>\n",
969
+ " <td>1303862400</td>\n",
970
+ " <td>Good Quality Dog Food</td>\n",
971
+ " <td>I have bought several of the Vitality canned d...</td>\n",
972
+ " <td>I have bought several of the Vitality canned d...</td>\n",
973
+ " <td>i have bought several of the vitality canned d...</td>\n",
974
+ " <td>[i, have, bought, several, of, the, vitality, ...</td>\n",
975
+ " <td>[bought, several, vitality, canned, dog, food,...</td>\n",
976
+ " </tr>\n",
977
+ " <tr>\n",
978
+ " <th>1</th>\n",
979
+ " <td>2</td>\n",
980
+ " <td>B00813GRG4</td>\n",
981
+ " <td>A1D87F6ZCVE5NK</td>\n",
982
+ " <td>dll pa</td>\n",
983
+ " <td>0</td>\n",
984
+ " <td>0</td>\n",
985
+ " <td>1</td>\n",
986
+ " <td>1346976000</td>\n",
987
+ " <td>Not as Advertised</td>\n",
988
+ " <td>Product arrived labeled as Jumbo Salted Peanut...</td>\n",
989
+ " <td>Product arrived labeled as Jumbo Salted Peanut...</td>\n",
990
+ " <td>product arrived labeled as jumbo salted peanut...</td>\n",
991
+ " <td>[product, arrived, labeled, as, jumbo, salted,...</td>\n",
992
+ " <td>[product, arrived, labeled, jumbo, salted, pea...</td>\n",
993
+ " </tr>\n",
994
+ " <tr>\n",
995
+ " <th>2</th>\n",
996
+ " <td>3</td>\n",
997
+ " <td>B000LQOCH0</td>\n",
998
+ " <td>ABXLMWJIXXAIN</td>\n",
999
+ " <td>Natalia Corres \"Natalia Corres\"</td>\n",
1000
+ " <td>1</td>\n",
1001
+ " <td>1</td>\n",
1002
+ " <td>4</td>\n",
1003
+ " <td>1219017600</td>\n",
1004
+ " <td>\"Delight\" says it all</td>\n",
1005
+ " <td>This is a confection that has been around a fe...</td>\n",
1006
+ " <td>This is a confection that has been around a fe...</td>\n",
1007
+ " <td>this is a confection that has been around a fe...</td>\n",
1008
+ " <td>[this, is, a, confection, that, has, been, aro...</td>\n",
1009
+ " <td>[confection, around, centuries, light, pillowy...</td>\n",
1010
+ " </tr>\n",
1011
+ " <tr>\n",
1012
+ " <th>3</th>\n",
1013
+ " <td>4</td>\n",
1014
+ " <td>B000UA0QIQ</td>\n",
1015
+ " <td>A395BORC6FGVXV</td>\n",
1016
+ " <td>Karl</td>\n",
1017
+ " <td>3</td>\n",
1018
+ " <td>3</td>\n",
1019
+ " <td>2</td>\n",
1020
+ " <td>1307923200</td>\n",
1021
+ " <td>Cough Medicine</td>\n",
1022
+ " <td>If you are looking for the secret ingredient i...</td>\n",
1023
+ " <td>If you are looking for the secret ingredient i...</td>\n",
1024
+ " <td>if you are looking for the secret ingredient i...</td>\n",
1025
+ " <td>[if, you, are, looking, for, the, secret, ingr...</td>\n",
1026
+ " <td>[looking, secret, ingredient, robitussin, beli...</td>\n",
1027
+ " </tr>\n",
1028
+ " <tr>\n",
1029
+ " <th>4</th>\n",
1030
+ " <td>5</td>\n",
1031
+ " <td>B006K2ZZ7K</td>\n",
1032
+ " <td>A1UQRSCLF8GW1T</td>\n",
1033
+ " <td>Michael D. Bigham \"M. Wassir\"</td>\n",
1034
+ " <td>0</td>\n",
1035
+ " <td>0</td>\n",
1036
+ " <td>5</td>\n",
1037
+ " <td>1350777600</td>\n",
1038
+ " <td>Great taffy</td>\n",
1039
+ " <td>Great taffy at a great price. There was a wid...</td>\n",
1040
+ " <td>Great taffy at a great price There was a wide...</td>\n",
1041
+ " <td>great taffy at a great price there was a wide...</td>\n",
1042
+ " <td>[great, taffy, at, a, great, price, there, was...</td>\n",
1043
+ " <td>[great, taffy, great, price, wide, assortment,...</td>\n",
1044
+ " </tr>\n",
1045
+ " </tbody>\n",
1046
+ "</table>\n",
1047
+ "</div>"
1048
+ ],
1049
+ "text/plain": [
1050
+ " Id ProductId UserId ProfileName \\\n",
1051
+ "0 1 B001E4KFG0 A3SGXH7AUHU8GW delmartian \n",
1052
+ "1 2 B00813GRG4 A1D87F6ZCVE5NK dll pa \n",
1053
+ "2 3 B000LQOCH0 ABXLMWJIXXAIN Natalia Corres \"Natalia Corres\" \n",
1054
+ "3 4 B000UA0QIQ A395BORC6FGVXV Karl \n",
1055
+ "4 5 B006K2ZZ7K A1UQRSCLF8GW1T Michael D. Bigham \"M. Wassir\" \n",
1056
+ "\n",
1057
+ " HelpfulnessNumerator HelpfulnessDenominator Score Time \\\n",
1058
+ "0 1 1 5 1303862400 \n",
1059
+ "1 0 0 1 1346976000 \n",
1060
+ "2 1 1 4 1219017600 \n",
1061
+ "3 3 3 2 1307923200 \n",
1062
+ "4 0 0 5 1350777600 \n",
1063
+ "\n",
1064
+ " Summary Text \\\n",
1065
+ "0 Good Quality Dog Food I have bought several of the Vitality canned d... \n",
1066
+ "1 Not as Advertised Product arrived labeled as Jumbo Salted Peanut... \n",
1067
+ "2 \"Delight\" says it all This is a confection that has been around a fe... \n",
1068
+ "3 Cough Medicine If you are looking for the secret ingredient i... \n",
1069
+ "4 Great taffy Great taffy at a great price. There was a wid... \n",
1070
+ "\n",
1071
+ " clean_msg \\\n",
1072
+ "0 I have bought several of the Vitality canned d... \n",
1073
+ "1 Product arrived labeled as Jumbo Salted Peanut... \n",
1074
+ "2 This is a confection that has been around a fe... \n",
1075
+ "3 If you are looking for the secret ingredient i... \n",
1076
+ "4 Great taffy at a great price There was a wide... \n",
1077
+ "\n",
1078
+ " text_lower \\\n",
1079
+ "0 i have bought several of the vitality canned d... \n",
1080
+ "1 product arrived labeled as jumbo salted peanut... \n",
1081
+ "2 this is a confection that has been around a fe... \n",
1082
+ "3 if you are looking for the secret ingredient i... \n",
1083
+ "4 great taffy at a great price there was a wide... \n",
1084
+ "\n",
1085
+ " msg_tokenied \\\n",
1086
+ "0 [i, have, bought, several, of, the, vitality, ... \n",
1087
+ "1 [product, arrived, labeled, as, jumbo, salted,... \n",
1088
+ "2 [this, is, a, confection, that, has, been, aro... \n",
1089
+ "3 [if, you, are, looking, for, the, secret, ingr... \n",
1090
+ "4 [great, taffy, at, a, great, price, there, was... \n",
1091
+ "\n",
1092
+ " no_stopwords \n",
1093
+ "0 [bought, several, vitality, canned, dog, food,... \n",
1094
+ "1 [product, arrived, labeled, jumbo, salted, pea... \n",
1095
+ "2 [confection, around, centuries, light, pillowy... \n",
1096
+ "3 [looking, secret, ingredient, robitussin, beli... \n",
1097
+ "4 [great, taffy, great, price, wide, assortment,... "
1098
+ ]
1099
+ },
1100
+ "execution_count": 70,
1101
+ "metadata": {},
1102
+ "output_type": "execute_result"
1103
+ }
1104
+ ],
1105
+ "source": [
1106
+ "data['no_stopwords']= data['msg_tokenied'].apply(lambda x : remove_stopwords(x))"
1107
+ ]
1108
+ },
1109
+ {
1110
+ "cell_type": "code",
1111
+ "execution_count": 78,
1112
+ "id": "1d880d9f",
1113
+ "metadata": {},
1114
+ "outputs": [
1115
+ {
1116
+ "data": {
1117
+ "text/html": [
1118
+ "<div>\n",
1119
+ "<style scoped>\n",
1120
+ " .dataframe tbody tr th:only-of-type {\n",
1121
+ " vertical-align: middle;\n",
1122
+ " }\n",
1123
+ "\n",
1124
+ " .dataframe tbody tr th {\n",
1125
+ " vertical-align: top;\n",
1126
+ " }\n",
1127
+ "\n",
1128
+ " .dataframe thead th {\n",
1129
+ " text-align: right;\n",
1130
+ " }\n",
1131
+ "</style>\n",
1132
+ "<table border=\"1\" class=\"dataframe\">\n",
1133
+ " <thead>\n",
1134
+ " <tr style=\"text-align: right;\">\n",
1135
+ " <th></th>\n",
1136
+ " <th>no_stopwords</th>\n",
1137
+ " <th>clean_msg</th>\n",
1138
+ " </tr>\n",
1139
+ " </thead>\n",
1140
+ " <tbody>\n",
1141
+ " <tr>\n",
1142
+ " <th>47115</th>\n",
1143
+ " <td>[3rd, case, jello, instant, sugar, free, puddi...</td>\n",
1144
+ " <td>This is the 3rd case of Jello Instant Sugar Fr...</td>\n",
1145
+ " </tr>\n",
1146
+ " </tbody>\n",
1147
+ "</table>\n",
1148
+ "</div>"
1149
+ ],
1150
+ "text/plain": [
1151
+ " no_stopwords \\\n",
1152
+ "47115 [3rd, case, jello, instant, sugar, free, puddi... \n",
1153
+ "\n",
1154
+ " clean_msg \n",
1155
+ "47115 This is the 3rd case of Jello Instant Sugar Fr... "
1156
+ ]
1157
+ },
1158
+ "execution_count": 78,
1159
+ "metadata": {},
1160
+ "output_type": "execute_result"
1161
+ }
1162
+ ],
1163
+ "source": [
1164
+ "data[['no_stopwords','clean_msg']].sample()\n",
1165
+ "#data[['msg_tokenied','clean_msg','text_lower','Text']].sample(1)"
1166
+ ]
1167
+ },
1168
+ {
1169
+ "cell_type": "code",
1170
+ "execution_count": 80,
1171
+ "id": "e1a4f052",
1172
+ "metadata": {},
1173
+ "outputs": [
1174
+ {
1175
+ "name": "stdout",
1176
+ "output_type": "stream",
1177
+ "text": [
1178
+ "['bought', 'several', 'vitality', 'canned', 'dog', 'food', 'products', 'found', 'good', 'quality', 'product', 'looks', 'like', 'stew', 'processed', 'meat', 'smells', 'better', 'labrador', 'finicky', 'appreciates', 'product', 'better']\n"
1179
+ ]
1180
+ }
1181
+ ],
1182
+ "source": [
1183
+ "print(data.no_stopwords[0])"
1184
+ ]
1185
+ },
1186
+ {
1187
+ "cell_type": "code",
1188
+ "execution_count": 82,
1189
+ "id": "8a7ecdda",
1190
+ "metadata": {
1191
+ "collapsed": true
1192
+ },
1193
+ "outputs": [
1194
+ {
1195
+ "name": "stdout",
1196
+ "output_type": "stream",
1197
+ "text": [
1198
+ "I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than most.\n"
1199
+ ]
1200
+ }
1201
+ ],
1202
+ "source": [
1203
+ "print(data.Text[0])"
1204
+ ]
1205
+ },
1206
+ {
1207
+ "cell_type": "code",
1208
+ "execution_count": 83,
1209
+ "id": "03cd965e",
1210
+ "metadata": {},
1211
+ "outputs": [],
1212
+ "source": [
1213
+ "\n",
1214
+ "#importing the Stemming function from nltk library\n",
1215
+ "from nltk.stem.porter import PorterStemmer\n",
1216
+ "#defining the object for stemming\n",
1217
+ "porter_stemmer = PorterStemmer()"
1218
+ ]
1219
+ },
1220
+ {
1221
+ "cell_type": "code",
1222
+ "execution_count": 84,
1223
+ "id": "c983954d",
1224
+ "metadata": {},
1225
+ "outputs": [
1226
+ {
1227
+ "data": {
1228
+ "text/plain": [
1229
+ "<PorterStemmer>"
1230
+ ]
1231
+ },
1232
+ "execution_count": 84,
1233
+ "metadata": {},
1234
+ "output_type": "execute_result"
1235
+ }
1236
+ ],
1237
+ "source": [
1238
+ "porter_stemmer"
1239
+ ]
1240
+ },
1241
+ {
1242
+ "cell_type": "markdown",
1243
+ "id": "5b57e251",
1244
+ "metadata": {},
1245
+ "source": [
1246
+ "Stemming is the process of reducing a word to its stem that affixes to suffixes and prefixes or to the roots of words known as \"lemmas\". Stemming is important in natural language understanding (NLU) and natural language processing (NLP)."
1247
+ ]
1248
+ },
1249
+ {
1250
+ "cell_type": "code",
1251
+ "execution_count": 86,
1252
+ "id": "54d61f4e",
1253
+ "metadata": {},
1254
+ "outputs": [],
1255
+ "source": [
1256
+ "#defining a function for stemming\n",
1257
+ "def stemming(text):\n",
1258
+ " stem_text = [porter_stemmer.stem(word) for word in text]\n",
1259
+ " return stem_text\n",
1260
+ "data['msg_stemmed']=data['no_stopwords'].apply(lambda x: stemming(x))"
1261
+ ]
1262
+ },
1263
+ {
1264
+ "cell_type": "code",
1265
+ "execution_count": 90,
1266
+ "id": "5f797052",
1267
+ "metadata": {},
1268
+ "outputs": [
1269
+ {
1270
+ "name": "stdout",
1271
+ "output_type": "stream",
1272
+ "text": [
1273
+ "['bought', 'sever', 'vital', 'can', 'dog', 'food', 'product', 'found', 'good', 'qualiti', 'product', 'look', 'like', 'stew', 'process', 'meat', 'smell', 'better', 'labrador', 'finicki', 'appreci', 'product', 'better']\n"
1274
+ ]
1275
+ }
1276
+ ],
1277
+ "source": [
1278
+ "\"\"\"\n",
1279
+ "Inflection morphemes are suffixesthat are added to a word to assign particular grammatical property to that word.\n",
1280
+ "Inflectional morphemes are considered to be grammatical markers that indicate tense, number,\n",
1281
+ "POS, and so on. So, in more simple language, we can say that inflectional morphemes\n",
1282
+ "are identified as types of morpheme that modify the verb tense, aspect, mood, person,\n",
1283
+ "number (singular and plural), gender, or case, without affecting the words meaning or POS.\n",
1284
+ "\n",
1285
+ "\"\"\"\n",
1286
+ "\n",
1287
+ "print(data.msg_stemmed[0])"
1288
+ ]
1289
+ },
1290
+ {
1291
+ "cell_type": "markdown",
1292
+ "id": "66c04a5c",
1293
+ "metadata": {},
1294
+ "source": [
1295
+ "Lemmatization in NLTK is the algorithmic process of finding the lemma of a word depending on its meaning and context. Lemmatization usually refers to the morphological analysis of words, which aims to remove inflectional endings. It helps in returning the base or dictionary form of a word known as the lemma.\n",
1296
+ "\n",
1297
+ "\n",
1298
+ "\n",
1299
+ "Wordnet Lemmatizer with NLTK. Wordnet is an large, freely and publicly available lexical database for the English language aiming to establish structured semantic relationships between words. It offers lemmatization capabilities as well and is one of the earliest and most commonly used lemmatizers."
1300
+ ]
1301
+ },
1302
+ {
1303
+ "cell_type": "code",
1304
+ "execution_count": 96,
1305
+ "id": "95c6ab11",
1306
+ "metadata": {},
1307
+ "outputs": [
1308
+ {
1309
+ "name": "stdout",
1310
+ "output_type": "stream",
1311
+ "text": [
1312
+ "showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml\n"
1313
+ ]
1314
+ },
1315
+ {
1316
+ "data": {
1317
+ "text/plain": [
1318
+ "True"
1319
+ ]
1320
+ },
1321
+ "execution_count": 96,
1322
+ "metadata": {},
1323
+ "output_type": "execute_result"
1324
+ }
1325
+ ],
1326
+ "source": [
1327
+ "import nltk\n",
1328
+ "nltk.download()"
1329
+ ]
1330
+ },
1331
+ {
1332
+ "cell_type": "code",
1333
+ "execution_count": 97,
1334
+ "id": "1a6116ab",
1335
+ "metadata": {},
1336
+ "outputs": [],
1337
+ "source": [
1338
+ "from nltk.stem import WordNetLemmatizer\n",
1339
+ "#defining the object for Lemmatization\n",
1340
+ "wordnet_lemmatizer = WordNetLemmatizer()"
1341
+ ]
1342
+ },
1343
+ {
1344
+ "cell_type": "code",
1345
+ "execution_count": 98,
1346
+ "id": "13674f48",
1347
+ "metadata": {},
1348
+ "outputs": [],
1349
+ "source": [
1350
+ "def lemmatizer(text):\n",
1351
+ " lemm_text = [wordnet_lemmatizer.lemmatize(word) for word in text]\n",
1352
+ " return lemm_text\n",
1353
+ "\n",
1354
+ "data['msg_lemmatized']=data['no_stopwords'].apply(lambda x:lemmatizer(x))"
1355
+ ]
1356
+ },
1357
+ {
1358
+ "cell_type": "code",
1359
+ "execution_count": 101,
1360
+ "id": "83ee5f90",
1361
+ "metadata": {},
1362
+ "outputs": [
1363
+ {
1364
+ "name": "stdout",
1365
+ "output_type": "stream",
1366
+ "text": [
1367
+ "['bought', 'several', 'vitality', 'canned', 'dog', 'food', 'product', 'found', 'good', 'quality', 'product', 'look', 'like', 'stew', 'processed', 'meat', 'smell', 'better', 'labrador', 'finicky', 'appreciates', 'product', 'better']\n"
1368
+ ]
1369
+ }
1370
+ ],
1371
+ "source": [
1372
+ "print(data.msg_lemmatized[0])"
1373
+ ]
1374
+ },
1375
+ {
1376
+ "cell_type": "code",
1377
+ "execution_count": 102,
1378
+ "id": "ba7f4834",
1379
+ "metadata": {},
1380
+ "outputs": [
1381
+ {
1382
+ "data": {
1383
+ "text/plain": [
1384
+ "(568454, 16)"
1385
+ ]
1386
+ },
1387
+ "execution_count": 102,
1388
+ "metadata": {},
1389
+ "output_type": "execute_result"
1390
+ }
1391
+ ],
1392
+ "source": [
1393
+ "data.shape"
1394
+ ]
1395
+ },
1396
+ {
1397
+ "cell_type": "code",
1398
+ "execution_count": 124,
1399
+ "id": "9e6dc658",
1400
+ "metadata": {},
1401
+ "outputs": [
1402
+ {
1403
+ "data": {
1404
+ "text/html": [
1405
+ "<div>\n",
1406
+ "<style scoped>\n",
1407
+ " .dataframe tbody tr th:only-of-type {\n",
1408
+ " vertical-align: middle;\n",
1409
+ " }\n",
1410
+ "\n",
1411
+ " .dataframe tbody tr th {\n",
1412
+ " vertical-align: top;\n",
1413
+ " }\n",
1414
+ "\n",
1415
+ " .dataframe thead th {\n",
1416
+ " text-align: right;\n",
1417
+ " }\n",
1418
+ "</style>\n",
1419
+ "<table border=\"1\" class=\"dataframe\">\n",
1420
+ " <thead>\n",
1421
+ " <tr style=\"text-align: right;\">\n",
1422
+ " <th></th>\n",
1423
+ " <th>clean_msg</th>\n",
1424
+ " <th>text_lower</th>\n",
1425
+ " <th>msg_tokenied</th>\n",
1426
+ " <th>no_stopwords</th>\n",
1427
+ " <th>msg_stemmed</th>\n",
1428
+ " <th>msg_lemmatized</th>\n",
1429
+ " </tr>\n",
1430
+ " </thead>\n",
1431
+ " <tbody>\n",
1432
+ " <tr>\n",
1433
+ " <th>0</th>\n",
1434
+ " <td>I have bought several of the Vitality canned d...</td>\n",
1435
+ " <td>i have bought several of the vitality canned d...</td>\n",
1436
+ " <td>[i, have, bought, several, of, the, vitality, ...</td>\n",
1437
+ " <td>[bought, several, vitality, canned, dog, food,...</td>\n",
1438
+ " <td>[bought, sever, vital, can, dog, food, product...</td>\n",
1439
+ " <td>[bought, several, vitality, canned, dog, food,...</td>\n",
1440
+ " </tr>\n",
1441
+ " <tr>\n",
1442
+ " <th>1</th>\n",
1443
+ " <td>Product arrived labeled as Jumbo Salted Peanut...</td>\n",
1444
+ " <td>product arrived labeled as jumbo salted peanut...</td>\n",
1445
+ " <td>[product, arrived, labeled, as, jumbo, salted,...</td>\n",
1446
+ " <td>[product, arrived, labeled, jumbo, salted, pea...</td>\n",
1447
+ " <td>[product, arriv, label, jumbo, salt, peanutsth...</td>\n",
1448
+ " <td>[product, arrived, labeled, jumbo, salted, pea...</td>\n",
1449
+ " </tr>\n",
1450
+ " <tr>\n",
1451
+ " <th>2</th>\n",
1452
+ " <td>This is a confection that has been around a fe...</td>\n",
1453
+ " <td>this is a confection that has been around a fe...</td>\n",
1454
+ " <td>[this, is, a, confection, that, has, been, aro...</td>\n",
1455
+ " <td>[confection, around, centuries, light, pillowy...</td>\n",
1456
+ " <td>[confect, around, centuri, light, pillowi, cit...</td>\n",
1457
+ " <td>[confection, around, century, light, pillowy, ...</td>\n",
1458
+ " </tr>\n",
1459
+ " <tr>\n",
1460
+ " <th>3</th>\n",
1461
+ " <td>If you are looking for the secret ingredient i...</td>\n",
1462
+ " <td>if you are looking for the secret ingredient i...</td>\n",
1463
+ " <td>[if, you, are, looking, for, the, secret, ingr...</td>\n",
1464
+ " <td>[looking, secret, ingredient, robitussin, beli...</td>\n",
1465
+ " <td>[look, secret, ingredi, robitussin, believ, fo...</td>\n",
1466
+ " <td>[looking, secret, ingredient, robitussin, beli...</td>\n",
1467
+ " </tr>\n",
1468
+ " <tr>\n",
1469
+ " <th>4</th>\n",
1470
+ " <td>Great taffy at a great price There was a wide...</td>\n",
1471
+ " <td>great taffy at a great price there was a wide...</td>\n",
1472
+ " <td>[great, taffy, at, a, great, price, there, was...</td>\n",
1473
+ " <td>[great, taffy, great, price, wide, assortment,...</td>\n",
1474
+ " <td>[great, taffi, great, price, wide, assort, yum...</td>\n",
1475
+ " <td>[great, taffy, great, price, wide, assortment,...</td>\n",
1476
+ " </tr>\n",
1477
+ " <tr>\n",
1478
+ " <th>5</th>\n",
1479
+ " <td>I got a wild hair for taffy and ordered this f...</td>\n",
1480
+ " <td>i got a wild hair for taffy and ordered this f...</td>\n",
1481
+ " <td>[i, got, a, wild, hair, for, taffy, and, order...</td>\n",
1482
+ " <td>[got, wild, hair, taffy, ordered, five, pound,...</td>\n",
1483
+ " <td>[got, wild, hair, taffi, order, five, pound, b...</td>\n",
1484
+ " <td>[got, wild, hair, taffy, ordered, five, pound,...</td>\n",
1485
+ " </tr>\n",
1486
+ " <tr>\n",
1487
+ " <th>6</th>\n",
1488
+ " <td>This saltwater taffy had great flavors and was...</td>\n",
1489
+ " <td>this saltwater taffy had great flavors and was...</td>\n",
1490
+ " <td>[this, saltwater, taffy, had, great, flavors, ...</td>\n",
1491
+ " <td>[saltwater, taffy, great, flavors, soft, chewy...</td>\n",
1492
+ " <td>[saltwat, taffi, great, flavor, soft, chewi, c...</td>\n",
1493
+ " <td>[saltwater, taffy, great, flavor, soft, chewy,...</td>\n",
1494
+ " </tr>\n",
1495
+ " <tr>\n",
1496
+ " <th>7</th>\n",
1497
+ " <td>This taffy is so good It is very soft and che...</td>\n",
1498
+ " <td>this taffy is so good it is very soft and che...</td>\n",
1499
+ " <td>[this, taffy, is, so, good, it, is, very, soft...</td>\n",
1500
+ " <td>[taffy, good, soft, chewy, flavors, amazing, w...</td>\n",
1501
+ " <td>[taffi, good, soft, chewi, flavor, amaz, would...</td>\n",
1502
+ " <td>[taffy, good, soft, chewy, flavor, amazing, wo...</td>\n",
1503
+ " </tr>\n",
1504
+ " <tr>\n",
1505
+ " <th>8</th>\n",
1506
+ " <td>Right now Im mostly just sprouting this so my ...</td>\n",
1507
+ " <td>right now im mostly just sprouting this so my ...</td>\n",
1508
+ " <td>[right, now, im, mostly, just, sprouting, this...</td>\n",
1509
+ " <td>[right, im, mostly, sprouting, cats, eat, gras...</td>\n",
1510
+ " <td>[right, im, mostli, sprout, cat, eat, grass, l...</td>\n",
1511
+ " <td>[right, im, mostly, sprouting, cat, eat, grass...</td>\n",
1512
+ " </tr>\n",
1513
+ " <tr>\n",
1514
+ " <th>9</th>\n",
1515
+ " <td>This is a very healthy dog food Good for their...</td>\n",
1516
+ " <td>this is a very healthy dog food good for their...</td>\n",
1517
+ " <td>[this, is, a, very, healthy, dog, food, good, ...</td>\n",
1518
+ " <td>[healthy, dog, food, good, digestion, also, go...</td>\n",
1519
+ " <td>[healthi, dog, food, good, digest, also, good,...</td>\n",
1520
+ " <td>[healthy, dog, food, good, digestion, also, go...</td>\n",
1521
+ " </tr>\n",
1522
+ " </tbody>\n",
1523
+ "</table>\n",
1524
+ "</div>"
1525
+ ],
1526
+ "text/plain": [
1527
+ " clean_msg \\\n",
1528
+ "0 I have bought several of the Vitality canned d... \n",
1529
+ "1 Product arrived labeled as Jumbo Salted Peanut... \n",
1530
+ "2 This is a confection that has been around a fe... \n",
1531
+ "3 If you are looking for the secret ingredient i... \n",
1532
+ "4 Great taffy at a great price There was a wide... \n",
1533
+ "5 I got a wild hair for taffy and ordered this f... \n",
1534
+ "6 This saltwater taffy had great flavors and was... \n",
1535
+ "7 This taffy is so good It is very soft and che... \n",
1536
+ "8 Right now Im mostly just sprouting this so my ... \n",
1537
+ "9 This is a very healthy dog food Good for their... \n",
1538
+ "\n",
1539
+ " text_lower \\\n",
1540
+ "0 i have bought several of the vitality canned d... \n",
1541
+ "1 product arrived labeled as jumbo salted peanut... \n",
1542
+ "2 this is a confection that has been around a fe... \n",
1543
+ "3 if you are looking for the secret ingredient i... \n",
1544
+ "4 great taffy at a great price there was a wide... \n",
1545
+ "5 i got a wild hair for taffy and ordered this f... \n",
1546
+ "6 this saltwater taffy had great flavors and was... \n",
1547
+ "7 this taffy is so good it is very soft and che... \n",
1548
+ "8 right now im mostly just sprouting this so my ... \n",
1549
+ "9 this is a very healthy dog food good for their... \n",
1550
+ "\n",
1551
+ " msg_tokenied \\\n",
1552
+ "0 [i, have, bought, several, of, the, vitality, ... \n",
1553
+ "1 [product, arrived, labeled, as, jumbo, salted,... \n",
1554
+ "2 [this, is, a, confection, that, has, been, aro... \n",
1555
+ "3 [if, you, are, looking, for, the, secret, ingr... \n",
1556
+ "4 [great, taffy, at, a, great, price, there, was... \n",
1557
+ "5 [i, got, a, wild, hair, for, taffy, and, order... \n",
1558
+ "6 [this, saltwater, taffy, had, great, flavors, ... \n",
1559
+ "7 [this, taffy, is, so, good, it, is, very, soft... \n",
1560
+ "8 [right, now, im, mostly, just, sprouting, this... \n",
1561
+ "9 [this, is, a, very, healthy, dog, food, good, ... \n",
1562
+ "\n",
1563
+ " no_stopwords \\\n",
1564
+ "0 [bought, several, vitality, canned, dog, food,... \n",
1565
+ "1 [product, arrived, labeled, jumbo, salted, pea... \n",
1566
+ "2 [confection, around, centuries, light, pillowy... \n",
1567
+ "3 [looking, secret, ingredient, robitussin, beli... \n",
1568
+ "4 [great, taffy, great, price, wide, assortment,... \n",
1569
+ "5 [got, wild, hair, taffy, ordered, five, pound,... \n",
1570
+ "6 [saltwater, taffy, great, flavors, soft, chewy... \n",
1571
+ "7 [taffy, good, soft, chewy, flavors, amazing, w... \n",
1572
+ "8 [right, im, mostly, sprouting, cats, eat, gras... \n",
1573
+ "9 [healthy, dog, food, good, digestion, also, go... \n",
1574
+ "\n",
1575
+ " msg_stemmed \\\n",
1576
+ "0 [bought, sever, vital, can, dog, food, product... \n",
1577
+ "1 [product, arriv, label, jumbo, salt, peanutsth... \n",
1578
+ "2 [confect, around, centuri, light, pillowi, cit... \n",
1579
+ "3 [look, secret, ingredi, robitussin, believ, fo... \n",
1580
+ "4 [great, taffi, great, price, wide, assort, yum... \n",
1581
+ "5 [got, wild, hair, taffi, order, five, pound, b... \n",
1582
+ "6 [saltwat, taffi, great, flavor, soft, chewi, c... \n",
1583
+ "7 [taffi, good, soft, chewi, flavor, amaz, would... \n",
1584
+ "8 [right, im, mostli, sprout, cat, eat, grass, l... \n",
1585
+ "9 [healthi, dog, food, good, digest, also, good,... \n",
1586
+ "\n",
1587
+ " msg_lemmatized \n",
1588
+ "0 [bought, several, vitality, canned, dog, food,... \n",
1589
+ "1 [product, arrived, labeled, jumbo, salted, pea... \n",
1590
+ "2 [confection, around, century, light, pillowy, ... \n",
1591
+ "3 [looking, secret, ingredient, robitussin, beli... \n",
1592
+ "4 [great, taffy, great, price, wide, assortment,... \n",
1593
+ "5 [got, wild, hair, taffy, ordered, five, pound,... \n",
1594
+ "6 [saltwater, taffy, great, flavor, soft, chewy,... \n",
1595
+ "7 [taffy, good, soft, chewy, flavor, amazing, wo... \n",
1596
+ "8 [right, im, mostly, sprouting, cat, eat, grass... \n",
1597
+ "9 [healthy, dog, food, good, digestion, also, go... "
1598
+ ]
1599
+ },
1600
+ "execution_count": 124,
1601
+ "metadata": {},
1602
+ "output_type": "execute_result"
1603
+ }
1604
+ ],
1605
+ "source": [
1606
+ "data.iloc[:10,10:16]"
1607
+ ]
1608
+ },
1609
+ {
1610
+ "cell_type": "code",
1611
+ "execution_count": 125,
1612
+ "id": "7666b9a9",
1613
+ "metadata": {},
1614
+ "outputs": [
1615
+ {
1616
+ "ename": "ModuleNotFoundError",
1617
+ "evalue": "No module named 'huggingface_hub'",
1618
+ "output_type": "error",
1619
+ "traceback": [
1620
+ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
1621
+ "\u001b[0;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)",
1622
+ "Input \u001b[0;32mIn [125]\u001b[0m, in \u001b[0;36m<cell line: 1>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mhuggingface_hub\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m login\n\u001b[1;32m 2\u001b[0m login()\n",
1623
+ "\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'huggingface_hub'"
1624
+ ]
1625
+ }
1626
+ ],
1627
+ "source": [
1628
+ "from huggingface_hub import login\n",
1629
+ "login()"
1630
+ ]
1631
+ },
1632
+ {
1633
+ "cell_type": "code",
1634
+ "execution_count": 126,
1635
+ "id": "1d4e9483",
1636
+ "metadata": {},
1637
+ "outputs": [
1638
+ {
1639
+ "ename": "SyntaxError",
1640
+ "evalue": "invalid syntax (1355118443.py, line 1)",
1641
+ "output_type": "error",
1642
+ "traceback": [
1643
+ "\u001b[0;36m Input \u001b[0;32mIn [126]\u001b[0;36m\u001b[0m\n\u001b[0;31m huggingface-cli login\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m invalid syntax\n"
1644
+ ]
1645
+ }
1646
+ ],
1647
+ "source": [
1648
+ "huggingface-cli login"
1649
+ ]
1650
+ },
1651
+ {
1652
+ "cell_type": "code",
1653
+ "execution_count": 127,
1654
+ "id": "30fffe0c",
1655
+ "metadata": {},
1656
+ "outputs": [
1657
+ {
1658
+ "ename": "ModuleNotFoundError",
1659
+ "evalue": "No module named 'huggingface_hub'",
1660
+ "output_type": "error",
1661
+ "traceback": [
1662
+ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
1663
+ "\u001b[0;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)",
1664
+ "Input \u001b[0;32mIn [127]\u001b[0m, in \u001b[0;36m<cell line: 1>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mhuggingface_hub\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m HfApi\n\u001b[1;32m 2\u001b[0m api \u001b[38;5;241m=\u001b[39m HfApi()\n\u001b[1;32m 3\u001b[0m api\u001b[38;5;241m.\u001b[39mupload_folder(\n\u001b[1;32m 4\u001b[0m folder_path\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m/path/to/local/folder\u001b[39m\u001b[38;5;124m\"\u001b[39m,\n\u001b[1;32m 5\u001b[0m path_in_repo\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mmy-dataset/train\u001b[39m\u001b[38;5;124m\"\u001b[39m,\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 8\u001b[0m ignore_patterns\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m**/logs/*.txt\u001b[39m\u001b[38;5;124m\"\u001b[39m,\n\u001b[1;32m 9\u001b[0m )\n",
1665
+ "\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'huggingface_hub'"
1666
+ ]
1667
+ }
1668
+ ],
1669
+ "source": [
1670
+ "from huggingface_hub import HfApi\n",
1671
+ "api = HfApi()\n",
1672
+ "api.upload_folder(\n",
1673
+ "folder_path=\"/path/to/local/folder\",\n",
1674
+ "path_in_repo=\"my-dataset/train\",\n",
1675
+ "repo_id=\"username/test-dataset\",\n",
1676
+ "repo_type=\"dataset\",\n",
1677
+ "ignore_patterns=\"**/logs/*.txt\",\n",
1678
+ ")"
1679
+ ]
1680
+ },
1681
+ {
1682
+ "cell_type": "code",
1683
+ "execution_count": 128,
1684
+ "id": "e66e1d40",
1685
+ "metadata": {},
1686
+ "outputs": [
1687
+ {
1688
+ "name": "stdout",
1689
+ "output_type": "stream",
1690
+ "text": [
1691
+ "Collecting package metadata (current_repodata.json): done\n",
1692
+ "Solving environment: failed with initial frozen solve. Retrying with flexible solve.\n",
1693
+ "Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.\n",
1694
+ "Collecting package metadata (repodata.json): done\n",
1695
+ "Solving environment: / \n",
1696
+ "The environment is inconsistent, please check the package plan carefully\n",
1697
+ "The following packages are causing the inconsistency:\n",
1698
+ "\n",
1699
+ " - defaults/osx-64::anaconda==2022.05=py39_0\n",
1700
+ " - defaults/osx-64::conda-build==3.21.8=py39hecd8cb5_2\n",
1701
+ " - defaults/osx-64::bcrypt==3.2.0=py39h9ed2024_0\n",
1702
+ " - defaults/osx-64::scrapy==2.6.1=py39hecd8cb5_0\n",
1703
+ " - defaults/osx-64::twisted==22.2.0=py39hca72f7f_0\n",
1704
+ "done\n",
1705
+ "\n",
1706
+ "## Package Plan ##\n",
1707
+ "\n",
1708
+ " environment location: /opt/anaconda3\n",
1709
+ "\n",
1710
+ " added / updated specs:\n",
1711
+ " - tokenizers==0.10.3\n",
1712
+ " - transformers==4.14.1\n",
1713
+ "\n",
1714
+ "\n",
1715
+ "The following packages will be downloaded:\n",
1716
+ "\n",
1717
+ " package | build\n",
1718
+ " ---------------------------|-----------------\n",
1719
+ " _anaconda_depends-2022.10 | py39_2 69 KB\n",
1720
+ " anaconda-custom | py39_1 4 KB\n",
1721
+ " ca-certificates-2023.01.10 | hecd8cb5_0 121 KB\n",
1722
+ " cctools-949.0.1 | h9abeeb2_25 18 KB\n",
1723
+ " cctools_osx-64-949.0.1 | hc7db93f_25 1.3 MB\n",
1724
+ " certifi-2022.12.7 | py39hecd8cb5_0 151 KB\n",
1725
+ " conda-23.1.0 | py39hecd8cb5_0 938 KB\n",
1726
+ " conda-build-3.23.3 | py39hecd8cb5_0 568 KB\n",
1727
+ " huggingface_hub-0.12.0 | py_0 185 KB huggingface\n",
1728
+ " ld64-530 | h20443b4_25 16 KB\n",
1729
+ " ld64_osx-64-530 | h70f3046_25 920 KB\n",
1730
+ " ldid-2.1.2 | h2d21305_2 54 KB\n",
1731
+ " libllvm14-14.0.6 | he552d86_0 21.3 MB\n",
1732
+ " ninja-1.10.2 | hecd8cb5_5 9 KB\n",
1733
+ " ninja-base-1.10.2 | haf03e11_5 118 KB\n",
1734
+ " openssl-1.1.1s | hca72f7f_0 2.8 MB\n",
1735
+ " patch-2.7.6 | h1de35cc_1001 128 KB\n",
1736
+ " pip-22.3.1 | py39hecd8cb5_0 2.7 MB\n",
1737
+ " pytorch-1.10.2 |cpu_py39h903acac_0 53.9 MB\n",
1738
+ " ruamel.yaml-0.17.21 | py39hca72f7f_0 179 KB\n",
1739
+ " ruamel.yaml.clib-0.2.6 | py39hca72f7f_1 126 KB\n",
1740
+ " sacremoses-master | py_0 404 KB huggingface\n",
1741
+ " tapi-1000.10.8 | ha1b3eb9_0 4.2 MB\n",
1742
+ " tokenizers-0.10.3 | py39h7bafbf5_1 1.6 MB\n",
1743
+ " transformers-4.14.1 | pyhd3eb1b0_0 1.0 MB\n",
1744
+ " ------------------------------------------------------------\n",
1745
+ " Total: 92.9 MB\n",
1746
+ "\n",
1747
+ "The following NEW packages will be INSTALLED:\n",
1748
+ "\n",
1749
+ " _anaconda_depends pkgs/main/osx-64::_anaconda_depends-2022.10-py39_2\n",
1750
+ " cctools pkgs/main/osx-64::cctools-949.0.1-h9abeeb2_25\n",
1751
+ " cctools_osx-64 pkgs/main/osx-64::cctools_osx-64-949.0.1-hc7db93f_25\n",
1752
+ " huggingface_hub huggingface/noarch::huggingface_hub-0.12.0-py_0\n",
1753
+ " ld64 pkgs/main/osx-64::ld64-530-h20443b4_25\n",
1754
+ " ld64_osx-64 pkgs/main/osx-64::ld64_osx-64-530-h70f3046_25\n",
1755
+ " ldid pkgs/main/osx-64::ldid-2.1.2-h2d21305_2\n",
1756
+ " libllvm14 pkgs/main/osx-64::libllvm14-14.0.6-he552d86_0\n",
1757
+ " ninja pkgs/main/osx-64::ninja-1.10.2-hecd8cb5_5\n",
1758
+ " ninja-base pkgs/main/osx-64::ninja-base-1.10.2-haf03e11_5\n",
1759
+ " patch pkgs/main/osx-64::patch-2.7.6-h1de35cc_1001\n",
1760
+ " pip pkgs/main/osx-64::pip-22.3.1-py39hecd8cb5_0\n",
1761
+ " pytorch pkgs/main/osx-64::pytorch-1.10.2-cpu_py39h903acac_0\n",
1762
+ " ruamel.yaml pkgs/main/osx-64::ruamel.yaml-0.17.21-py39hca72f7f_0\n",
1763
+ " ruamel.yaml.clib pkgs/main/osx-64::ruamel.yaml.clib-0.2.6-py39hca72f7f_1\n",
1764
+ " sacremoses huggingface/noarch::sacremoses-master-py_0\n",
1765
+ " tapi pkgs/main/osx-64::tapi-1000.10.8-ha1b3eb9_0\n",
1766
+ " tokenizers pkgs/main/osx-64::tokenizers-0.10.3-py39h7bafbf5_1\n",
1767
+ " transformers pkgs/main/noarch::transformers-4.14.1-pyhd3eb1b0_0\n",
1768
+ "\n",
1769
+ "The following packages will be UPDATED:\n",
1770
+ "\n",
1771
+ " ca-certificates 2022.3.29-hecd8cb5_1 --> 2023.01.10-hecd8cb5_0\n",
1772
+ " certifi 2021.10.8-py39hecd8cb5_2 --> 2022.12.7-py39hecd8cb5_0\n",
1773
+ " conda 4.14.0-py39hecd8cb5_0 --> 23.1.0-py39hecd8cb5_0\n",
1774
+ " conda-build 3.21.8-py39hecd8cb5_2 --> 3.23.3-py39hecd8cb5_0\n",
1775
+ " openssl 1.1.1n-hca72f7f_0 --> 1.1.1s-hca72f7f_0\n",
1776
+ "\n",
1777
+ "The following packages will be DOWNGRADED:\n",
1778
+ "\n",
1779
+ " anaconda 2022.05-py39_0 --> custom-py39_1\n",
1780
+ "\n",
1781
+ "\n",
1782
+ "\n",
1783
+ "Downloading and Extracting Packages\n",
1784
+ "sacremoses-master | 404 KB | ##################################### | 100% \n",
1785
+ "patch-2.7.6 | 128 KB | ##################################### | 100% \n",
1786
+ "ruamel.yaml.clib-0.2 | 126 KB | ##################################### | 100% \n",
1787
+ "transformers-4.14.1 | 1.0 MB | ##################################### | 100% \n",
1788
+ "openssl-1.1.1s | 2.8 MB | ##################################### | 100% \n",
1789
+ "pip-22.3.1 | 2.7 MB | ##################################### | 100% \n",
1790
+ "ninja-base-1.10.2 | 118 KB | ##################################### | 100% \n",
1791
+ "_anaconda_depends-20 | 69 KB | ##################################### | 100% \n",
1792
+ "conda-23.1.0 | 938 KB | ##################################### | 100% \n",
1793
+ "conda-build-3.23.3 | 568 KB | ##################################### | 100% \n",
1794
+ "pytorch-1.10.2 | 53.9 MB | ##################################### | 100% \n",
1795
+ "ruamel.yaml-0.17.21 | 179 KB | ##################################### | 100% \n",
1796
+ "ca-certificates-2023 | 121 KB | ##################################### | 100% \n",
1797
+ "anaconda-custom | 4 KB | ##################################### | 100% \n",
1798
+ "ld64-530 | 16 KB | ##################################### | 100% \n",
1799
+ "tapi-1000.10.8 | 4.2 MB | ##################################### | 100% \n",
1800
+ "libllvm14-14.0.6 | 21.3 MB | ##################################### | 100% \n",
1801
+ "cctools-949.0.1 | 18 KB | ##################################### | 100% \n",
1802
+ "ld64_osx-64-530 | 920 KB | ##################################### | 100% \n",
1803
+ "tokenizers-0.10.3 | 1.6 MB | ##################################### | 100% \n",
1804
+ "ninja-1.10.2 | 9 KB | ##################################### | 100% \n",
1805
+ "huggingface_hub-0.12 | 185 KB | ##################################### | 100% \n",
1806
+ "cctools_osx-64-949.0 | 1.3 MB | ##################################### | 100% \n",
1807
+ "ldid-2.1.2 | 54 KB | ##################################### | 100% \n",
1808
+ "certifi-2022.12.7 | 151 KB | ##################################### | 100% \n",
1809
+ "Preparing transaction: done\n",
1810
+ "Verifying transaction: done\n",
1811
+ "Executing transaction: done\n",
1812
+ "Retrieving notices: ...working... done\n",
1813
+ "\n",
1814
+ "Note: you may need to restart the kernel to use updated packages.\n"
1815
+ ]
1816
+ }
1817
+ ],
1818
+ "source": [
1819
+ "conda install -c huggingface transformers==4.14.1 tokenizers==0.10.3 -y"
1820
+ ]
1821
+ },
1822
+ {
1823
+ "cell_type": "code",
1824
+ "execution_count": null,
1825
+ "id": "1c35fe28",
1826
+ "metadata": {},
1827
+ "outputs": [],
1828
+ "source": []
1829
+ }
1830
+ ],
1831
+ "metadata": {
1832
+ "kernelspec": {
1833
+ "display_name": "Python 3 (ipykernel)",
1834
+ "language": "python",
1835
+ "name": "python3"
1836
+ },
1837
+ "language_info": {
1838
+ "codemirror_mode": {
1839
+ "name": "ipython",
1840
+ "version": 3
1841
+ },
1842
+ "file_extension": ".py",
1843
+ "mimetype": "text/x-python",
1844
+ "name": "python",
1845
+ "nbconvert_exporter": "python",
1846
+ "pygments_lexer": "ipython3",
1847
+ "version": "3.9.12"
1848
+ }
1849
+ },
1850
+ "nbformat": 4,
1851
+ "nbformat_minor": 5
1852
+ }