{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Natural Language Processing" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Sentiment Analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](https://c.files.bbci.co.uk/2A16/production/_115547701_gettyimages-1229654243.jpg)\n", "\n", "Photo by [GETTY IMAGES]()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "___" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Today is about sentiment analysis, and the introduction of the Zindi NLP Project" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# I. Sentiment Analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## I.1. Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sentiment Analysis is a wide field with a unique purpose: predict the feeling of a person based on features. Those features can be a voice recording or a face picture, bust most of the time in sentiment analysis this is text features." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Classic applications of sentiment analysis are the following:\n", "* Is this product review positive, neutral or negative?\n", "* Based on tweets on a topic, do people react positively?\n", "* Is this customer review positive or negative?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So in most cases, sentiment analysis is just a classification.\n", "\n", "The input features are the input text, and the output targets are the classes to predict.\n", "It can be a binary classification (i.e. positive or negative review), or multiclass classification (e.g. 0 to 5 stars note)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## I.2. Sentiment Analysis using NLP" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sentiment Analysis can be done using NLP. Historically, several methods have been developed.\n", "\n", "Some basic methods would use the polarity of words:\n", "* words like bad, wrong, lame, disgusting suggest negative polarity\n", "* words like good, amazing, great, delightful suggest positive polarity\n", "\n", "Unfortunately, language is more complicated than just polarity: for example \"not bad at all\" would have a negative polarity while actually giving a good review.\n", "\n", "Modern methods use Machine Learning methods, based on NLP." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To do sentiment analysis, you already know all the tools:\n", "* Text preprocessing (tokenization, punctuation, stopwords, stemming/lemmatization, n-grams)\n", "* Feature computation (BOW, TF-IDF)\n", "* Classification (SVM, logistic regression...)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## I.3. Application: Zindi Covid-related Tweets Challenge" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see now a short example of sentiment analysis on Zindi Covid-related Tweets Challenge." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tweet_idsafe_textlabelagreement
0CL1KWCMYMe & The Big Homie meanboy3000 #MEANBOY #M...0.01.000000
1E3303EMEI'm 100% thinking of devoting my career to pro...1.01.000000
2M4IVFSMS#whatcausesautism VACCINES, DO NOT VACCINATE Y...-1.01.000000
31DR6ROZ4I mean if they immunize my kid with something ...-1.01.000000
4J77ENIIEThanks to <user> Catch me performing at La Nui...0.01.000000
5OVNPOAUX<user> a nearly 67 year old study when mental ...1.00.666667
6JDA2QDV5Study of more than 95,000 kids finds no link b...1.00.666667
7S6UKR4OJpsa: VACCINATE YOUR FUCKING KIDS1.01.000000
8V6IJATBECoughing extra on the shuttle and everyone thi...1.00.666667
9VB25IDQKAIDS vaccine created at Oregon Health &amp; Sc...1.00.666667
\n", "
" ], "text/plain": [ " tweet_id safe_text label \\\n", "0 CL1KWCMY Me & The Big Homie meanboy3000 #MEANBOY #M... 0.0 \n", "1 E3303EME I'm 100% thinking of devoting my career to pro... 1.0 \n", "2 M4IVFSMS #whatcausesautism VACCINES, DO NOT VACCINATE Y... -1.0 \n", "3 1DR6ROZ4 I mean if they immunize my kid with something ... -1.0 \n", "4 J77ENIIE Thanks to Catch me performing at La Nui... 0.0 \n", "5 OVNPOAUX a nearly 67 year old study when mental ... 1.0 \n", "6 JDA2QDV5 Study of more than 95,000 kids finds no link b... 1.0 \n", "7 S6UKR4OJ psa: VACCINATE YOUR FUCKING KIDS 1.0 \n", "8 V6IJATBE Coughing extra on the shuttle and everyone thi... 1.0 \n", "9 VB25IDQK AIDS vaccine created at Oregon Health & Sc... 1.0 \n", "\n", " agreement \n", "0 1.000000 \n", "1 1.000000 \n", "2 1.000000 \n", "3 1.000000 \n", "4 1.000000 \n", "5 0.666667 \n", "6 0.666667 \n", "7 1.000000 \n", "8 0.666667 \n", "9 0.666667 " ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "import warnings\n", "warnings.filterwarnings('ignore')\n", "\n", "# Load the dataset and display some values\n", "df = pd.read_csv('../data/Train.csv')\n", "df.head(10)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "tweet_id 0\n", "safe_text 0\n", "label 1\n", "agreement 2\n", "dtype: int64" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.isna().sum()" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tweet_idsafe_textlabelagreement
0CL1KWCMYMe &amp; The Big Homie meanboy3000 #MEANBOY #M...0.01.0
1E3303EMEI'm 100% thinking of devoting my career to pro...1.01.0
2M4IVFSMS#whatcausesautism VACCINES, DO NOT VACCINATE Y...-1.01.0
31DR6ROZ4I mean if they immunize my kid with something ...-1.01.0
4J77ENIIEThanks to <user> Catch me performing at La Nui...0.01.0
\n", "
" ], "text/plain": [ " tweet_id safe_text label \\\n", "0 CL1KWCMY Me & The Big Homie meanboy3000 #MEANBOY #M... 0.0 \n", "1 E3303EME I'm 100% thinking of devoting my career to pro... 1.0 \n", "2 M4IVFSMS #whatcausesautism VACCINES, DO NOT VACCINATE Y... -1.0 \n", "3 1DR6ROZ4 I mean if they immunize my kid with something ... -1.0 \n", "4 J77ENIIE Thanks to Catch me performing at La Nui... 0.0 \n", "\n", " agreement \n", "0 1.0 \n", "1 1.0 \n", "2 1.0 \n", "3 1.0 \n", "4 1.0 " ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# A way to eliminate rows containing NaN values\n", "df = df[~df.isna().any(axis=1)]\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "tweet_id 0\n", "safe_text 0\n", "label 0\n", "agreement 0\n", "dtype: int64" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.isna().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, we have a two classes:\n", "* -1 for a negative sentiment\n", "* 0 for a neutral sentiment\n", "* 1 for a positive sentiment\n", "\n", "Each review is a text, more or less long. So now we will do as usual: preprocessing, TF-IDF and model building." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from nltk import download\n", "\n", "# Download stopwords, execute it just once then may comment\n", "download('stopwords')" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[nltk_data] Downloading package stopwords to\n", "[nltk_data] /Users/emmanuelkoupoh/nltk_data...\n", "[nltk_data] Package stopwords is already up-to-date!\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tweet_idsafe_textlabelagreementtokensalphastopstemmed
0CL1KWCMYMe &amp; The Big Homie meanboy3000 #MEANBOY #M...0.01.0[Me, &, amp, ;, The, Big, Homie, meanboy3000, ...[Me, amp, The, Big, Homie, MEANBOY, MB, MBS, M...[Me, amp, The, Big, Homie, MEANBOY, MB, MBS, M...[me, amp, the, big, homi, meanboy, mb, mb, mmr...
1E3303EMEI'm 100% thinking of devoting my career to pro...1.01.0[I, 'm, 100, %, thinking, of, devoting, my, ca...[I, thinking, of, devoting, my, career, to, pr...[I, thinking, devoting, career, proving, autis...[i, think, devot, career, prove, autism, caus,...
2M4IVFSMS#whatcausesautism VACCINES, DO NOT VACCINATE Y...-1.01.0[#, whatcausesautism, VACCINES, ,, DO, NOT, VA...[whatcausesautism, VACCINES, DO, NOT, VACCINAT...[whatcausesautism, VACCINES, DO, NOT, VACCINAT...[whatcausesaut, vaccin, do, not, vaccin, your,...
31DR6ROZ4I mean if they immunize my kid with something ...-1.01.0[I, mean, if, they, immunize, my, kid, with, s...[I, mean, if, they, immunize, my, kid, with, s...[I, mean, immunize, kid, something, wo, secret...[i, mean, immun, kid, someth, wo, secretli, ki...
4J77ENIIEThanks to <user> Catch me performing at La Nui...0.01.0[Thanks, to, <, user, >, Catch, me, performing...[Thanks, to, user, Catch, me, performing, at, ...[Thanks, user, Catch, performing, La, Nuit, NY...[thank, user, catch, perform, la, nuit, nyc, s...
\n", "
" ], "text/plain": [ " tweet_id safe_text label \\\n", "0 CL1KWCMY Me & The Big Homie meanboy3000 #MEANBOY #M... 0.0 \n", "1 E3303EME I'm 100% thinking of devoting my career to pro... 1.0 \n", "2 M4IVFSMS #whatcausesautism VACCINES, DO NOT VACCINATE Y... -1.0 \n", "3 1DR6ROZ4 I mean if they immunize my kid with something ... -1.0 \n", "4 J77ENIIE Thanks to Catch me performing at La Nui... 0.0 \n", "\n", " agreement tokens \\\n", "0 1.0 [Me, &, amp, ;, The, Big, Homie, meanboy3000, ... \n", "1 1.0 [I, 'm, 100, %, thinking, of, devoting, my, ca... \n", "2 1.0 [#, whatcausesautism, VACCINES, ,, DO, NOT, VA... \n", "3 1.0 [I, mean, if, they, immunize, my, kid, with, s... \n", "4 1.0 [Thanks, to, <, user, >, Catch, me, performing... \n", "\n", " alpha \\\n", "0 [Me, amp, The, Big, Homie, MEANBOY, MB, MBS, M... \n", "1 [I, thinking, of, devoting, my, career, to, pr... \n", "2 [whatcausesautism, VACCINES, DO, NOT, VACCINAT... \n", "3 [I, mean, if, they, immunize, my, kid, with, s... \n", "4 [Thanks, to, user, Catch, me, performing, at, ... \n", "\n", " stop \\\n", "0 [Me, amp, The, Big, Homie, MEANBOY, MB, MBS, M... \n", "1 [I, thinking, devoting, career, proving, autis... \n", "2 [whatcausesautism, VACCINES, DO, NOT, VACCINAT... \n", "3 [I, mean, immunize, kid, something, wo, secret... \n", "4 [Thanks, user, Catch, performing, La, Nuit, NY... \n", "\n", " stemmed \n", "0 [me, amp, the, big, homi, meanboy, mb, mb, mmr... \n", "1 [i, think, devot, career, prove, autism, caus,... \n", "2 [whatcausesaut, vaccin, do, not, vaccin, your,... \n", "3 [i, mean, immun, kid, someth, wo, secretli, ki... \n", "4 [thank, user, catch, perform, la, nuit, nyc, s... " ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from nltk.tokenize import word_tokenize\n", "from nltk.corpus import stopwords\n", "from nltk.stem import PorterStemmer\n", "\n", "stop = stopwords.words('english')\n", "stemmer = PorterStemmer()\n", "\n", "# Perform preprocessing\n", "df['tokens'] = df['safe_text'].apply(lambda df: word_tokenize(df, preserve_line=True))\n", "df['alpha'] = df['tokens'].apply(lambda x: [item for item in x if item.isalpha()])\n", "df['stop'] = df['alpha'].apply(lambda x: [item for item in x if item not in stop])\n", "df['stemmed'] = df['stop'].apply(lambda x: [stemmer.stem(item) for item in x])\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(9999, 8)" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# New dimension of the DataFrame\n", "df.shape" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
aaaaaaaaaaandaaasmtgaackaafpassemblaapaapglobaaronhernandezab...мненаписатьооптимизмомссмотрюстранетутчем病院実習行くのにmmrと水疱瘡の抗体を調べたら
00.00.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
10.00.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
20.00.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
30.00.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
40.00.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
\n", "

5 rows × 9399 columns

\n", "
" ], "text/plain": [ " a aa aaaaaaaand aaasmtg aack aafpassembl aap aapglob \\\n", "0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "\n", " aaronhernandez ab ... мне написать о оптимизмом с смотрю \\\n", "0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 \n", "1 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 \n", "2 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 \n", "3 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 \n", "4 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 \n", "\n", " стране тут чем 病院実習行くのにmmrと水疱瘡の抗体を調べたら \n", "0 0.0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 0.0 \n", "\n", "[5 rows x 9399 columns]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.feature_extraction.text import TfidfVectorizer\n", "\n", "# Compute the TF-IDF\n", "vectorizer = TfidfVectorizer(lowercase=False, analyzer=lambda x: x)\n", "tf_idf = vectorizer.fit_transform(df['stemmed']).toarray()\n", "pd.DataFrame(tf_idf, columns=vectorizer.get_feature_names()).head()" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "accuracy: 0.719\n", "rmse: 0.6931810730249348\n" ] } ], "source": [ "from sklearn.linear_model import LogisticRegression\n", "from sklearn.svm import SVC\n", "from sklearn.metrics import accuracy_score, mean_squared_error\n", "from sklearn.model_selection import train_test_split\n", "\n", "# Split the data\n", "X_train, X_test, y_train, y_test = train_test_split(tf_idf, df['label'], test_size=0.2, random_state=42)\n", "\n", "# Train the model\n", "lr = LogisticRegression()\n", "lr.fit(X_train, y_train)\n", "\n", "# Predict using the trained model\n", "y_pred_lr = lr.predict(X_test)\n", "\n", "# Estimate some metrics\n", "print('accuracy:', accuracy_score(y_pred_lr, y_test))\n", "print('rmse:', mean_squared_error(y_pred_lr, y_test, squared=False))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We built here a very simplistic model, still reaching an accuracy of about 70%. Feel free to improve this model as an exercise with all your Machine Learning knowledge and experience." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([-1., 0., 1.])" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# All distinct values that have been predicted\n", "np.unique(y_pred_lr)" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "accuracy: 0.7225\n", "rmse: 0.7024955515873392\n" ] } ], "source": [ "# Train the model\n", "model = SVC()\n", "model.fit(X_train, y_train)\n", "\n", "# Predict using the trained model\n", "y_pred = model.predict(X_test)\n", "\n", "# Estimate some metrics\n", "print('accuracy:', accuracy_score(y_pred, y_test))\n", "print('rmse:', mean_squared_error(y_pred, y_test, squared=False))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "celltoolbar": "Diaporama", "kernelspec": { "display_name": "Python 3.8.9 ('venv': venv)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.6 (default, Aug 5 2022, 15:21:02) \n[Clang 14.0.0 (clang-1400.0.29.102)]" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false }, "vscode": { "interpreter": { "hash": "1ab24538aa0da4b2d8c48eaca591ff7ffc54671225fb0511b432fd9e26a098ba" } } }, "nbformat": 4, "nbformat_minor": 2 }