{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Natural Language Processing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Sentiment Analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](https://c.files.bbci.co.uk/2A16/production/_115547701_gettyimages-1229654243.jpg)\n",
    "\n",
    "Photo by [GETTY IMAGES]()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "___"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Today is about sentiment analysis, and the introduction of the Zindi NLP Project"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# I. Sentiment Analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## I.1. Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sentiment Analysis is a wide field with a unique purpose: predict the feeling of a person based on features. Those features can be a voice recording or a face picture, bust most of the time in sentiment analysis this is text features."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Classic applications of sentiment analysis are the following:\n",
    "* Is this product review positive, neutral or negative?\n",
    "* Based on tweets on a topic, do people react positively?\n",
    "* Is this customer review positive or negative?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So in most cases, sentiment analysis is just a classification.\n",
    "\n",
    "The input features are the input text, and the output targets are the classes to predict.\n",
    "It can be a binary classification (i.e. positive or negative review), or multiclass classification (e.g. 0 to 5 stars note)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## I.2. Sentiment Analysis using NLP"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sentiment Analysis can be done using NLP. Historically, several methods have been developed.\n",
    "\n",
    "Some basic methods would use the polarity of words:\n",
    "* words like bad, wrong, lame, disgusting suggest negative polarity\n",
    "* words like good, amazing, great, delightful suggest positive polarity\n",
    "\n",
    "Unfortunately, language is more complicated than just polarity: for example \"not bad at all\" would have a negative polarity while actually giving a good review.\n",
    "\n",
    "Modern methods use Machine Learning methods, based on NLP."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To do sentiment analysis, you already know all the tools:\n",
    "* Text preprocessing (tokenization, punctuation, stopwords, stemming/lemmatization, n-grams)\n",
    "* Feature computation (BOW, TF-IDF)\n",
    "* Classification (SVM, logistic regression...)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## I.3. Application: Zindi Covid-related Tweets Challenge"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see now a short example of sentiment analysis on Zindi Covid-related Tweets Challenge."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>tweet_id</th>\n",
       "      <th>safe_text</th>\n",
       "      <th>label</th>\n",
       "      <th>agreement</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>CL1KWCMY</td>\n",
       "      <td>Me &amp;amp; The Big Homie meanboy3000 #MEANBOY #M...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>E3303EME</td>\n",
       "      <td>I'm 100% thinking of devoting my career to pro...</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>M4IVFSMS</td>\n",
       "      <td>#whatcausesautism VACCINES, DO NOT VACCINATE Y...</td>\n",
       "      <td>-1.0</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1DR6ROZ4</td>\n",
       "      <td>I mean if they immunize my kid with something ...</td>\n",
       "      <td>-1.0</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>J77ENIIE</td>\n",
       "      <td>Thanks to &lt;user&gt; Catch me performing at La Nui...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>OVNPOAUX</td>\n",
       "      <td>&lt;user&gt; a nearly 67 year old study when mental ...</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.666667</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>JDA2QDV5</td>\n",
       "      <td>Study of more than 95,000 kids finds no link b...</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.666667</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>S6UKR4OJ</td>\n",
       "      <td>psa: VACCINATE YOUR FUCKING KIDS</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>V6IJATBE</td>\n",
       "      <td>Coughing extra on the shuttle and everyone thi...</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.666667</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>VB25IDQK</td>\n",
       "      <td>AIDS vaccine created at Oregon Health &amp;amp; Sc...</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.666667</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   tweet_id                                          safe_text  label  \\\n",
       "0  CL1KWCMY  Me &amp; The Big Homie meanboy3000 #MEANBOY #M...    0.0   \n",
       "1  E3303EME  I'm 100% thinking of devoting my career to pro...    1.0   \n",
       "2  M4IVFSMS  #whatcausesautism VACCINES, DO NOT VACCINATE Y...   -1.0   \n",
       "3  1DR6ROZ4  I mean if they immunize my kid with something ...   -1.0   \n",
       "4  J77ENIIE  Thanks to <user> Catch me performing at La Nui...    0.0   \n",
       "5  OVNPOAUX  <user> a nearly 67 year old study when mental ...    1.0   \n",
       "6  JDA2QDV5  Study of more than 95,000 kids finds no link b...    1.0   \n",
       "7  S6UKR4OJ                   psa: VACCINATE YOUR FUCKING KIDS    1.0   \n",
       "8  V6IJATBE  Coughing extra on the shuttle and everyone thi...    1.0   \n",
       "9  VB25IDQK  AIDS vaccine created at Oregon Health &amp; Sc...    1.0   \n",
       "\n",
       "   agreement  \n",
       "0   1.000000  \n",
       "1   1.000000  \n",
       "2   1.000000  \n",
       "3   1.000000  \n",
       "4   1.000000  \n",
       "5   0.666667  \n",
       "6   0.666667  \n",
       "7   1.000000  \n",
       "8   0.666667  \n",
       "9   0.666667  "
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "# Load the dataset and display some values\n",
    "df = pd.read_csv('../data/Train.csv')\n",
    "df.head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tweet_id     0\n",
       "safe_text    0\n",
       "label        1\n",
       "agreement    2\n",
       "dtype: int64"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.isna().sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>tweet_id</th>\n",
       "      <th>safe_text</th>\n",
       "      <th>label</th>\n",
       "      <th>agreement</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>CL1KWCMY</td>\n",
       "      <td>Me &amp;amp; The Big Homie meanboy3000 #MEANBOY #M...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>E3303EME</td>\n",
       "      <td>I'm 100% thinking of devoting my career to pro...</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>M4IVFSMS</td>\n",
       "      <td>#whatcausesautism VACCINES, DO NOT VACCINATE Y...</td>\n",
       "      <td>-1.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1DR6ROZ4</td>\n",
       "      <td>I mean if they immunize my kid with something ...</td>\n",
       "      <td>-1.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>J77ENIIE</td>\n",
       "      <td>Thanks to &lt;user&gt; Catch me performing at La Nui...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   tweet_id                                          safe_text  label  \\\n",
       "0  CL1KWCMY  Me &amp; The Big Homie meanboy3000 #MEANBOY #M...    0.0   \n",
       "1  E3303EME  I'm 100% thinking of devoting my career to pro...    1.0   \n",
       "2  M4IVFSMS  #whatcausesautism VACCINES, DO NOT VACCINATE Y...   -1.0   \n",
       "3  1DR6ROZ4  I mean if they immunize my kid with something ...   -1.0   \n",
       "4  J77ENIIE  Thanks to <user> Catch me performing at La Nui...    0.0   \n",
       "\n",
       "   agreement  \n",
       "0        1.0  \n",
       "1        1.0  \n",
       "2        1.0  \n",
       "3        1.0  \n",
       "4        1.0  "
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# A way to eliminate rows containing NaN values\n",
    "df = df[~df.isna().any(axis=1)]\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tweet_id     0\n",
       "safe_text    0\n",
       "label        0\n",
       "agreement    0\n",
       "dtype: int64"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.isna().sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, we have a two classes:\n",
    "* -1 for a negative sentiment\n",
    "* 0 for a neutral sentiment\n",
    "* 1 for a positive sentiment\n",
    "\n",
    "Each review is a text, more or less long. So now we will do as usual: preprocessing, TF-IDF and model building."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from nltk import download\n",
    "\n",
    "# Download stopwords, execute it just once then may comment\n",
    "download('stopwords')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[nltk_data] Downloading package stopwords to\n",
      "[nltk_data]     /Users/emmanuelkoupoh/nltk_data...\n",
      "[nltk_data]   Package stopwords is already up-to-date!\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>tweet_id</th>\n",
       "      <th>safe_text</th>\n",
       "      <th>label</th>\n",
       "      <th>agreement</th>\n",
       "      <th>tokens</th>\n",
       "      <th>alpha</th>\n",
       "      <th>stop</th>\n",
       "      <th>stemmed</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>CL1KWCMY</td>\n",
       "      <td>Me &amp;amp; The Big Homie meanboy3000 #MEANBOY #M...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>[Me, &amp;, amp, ;, The, Big, Homie, meanboy3000, ...</td>\n",
       "      <td>[Me, amp, The, Big, Homie, MEANBOY, MB, MBS, M...</td>\n",
       "      <td>[Me, amp, The, Big, Homie, MEANBOY, MB, MBS, M...</td>\n",
       "      <td>[me, amp, the, big, homi, meanboy, mb, mb, mmr...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>E3303EME</td>\n",
       "      <td>I'm 100% thinking of devoting my career to pro...</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>[I, 'm, 100, %, thinking, of, devoting, my, ca...</td>\n",
       "      <td>[I, thinking, of, devoting, my, career, to, pr...</td>\n",
       "      <td>[I, thinking, devoting, career, proving, autis...</td>\n",
       "      <td>[i, think, devot, career, prove, autism, caus,...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>M4IVFSMS</td>\n",
       "      <td>#whatcausesautism VACCINES, DO NOT VACCINATE Y...</td>\n",
       "      <td>-1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>[#, whatcausesautism, VACCINES, ,, DO, NOT, VA...</td>\n",
       "      <td>[whatcausesautism, VACCINES, DO, NOT, VACCINAT...</td>\n",
       "      <td>[whatcausesautism, VACCINES, DO, NOT, VACCINAT...</td>\n",
       "      <td>[whatcausesaut, vaccin, do, not, vaccin, your,...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1DR6ROZ4</td>\n",
       "      <td>I mean if they immunize my kid with something ...</td>\n",
       "      <td>-1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>[I, mean, if, they, immunize, my, kid, with, s...</td>\n",
       "      <td>[I, mean, if, they, immunize, my, kid, with, s...</td>\n",
       "      <td>[I, mean, immunize, kid, something, wo, secret...</td>\n",
       "      <td>[i, mean, immun, kid, someth, wo, secretli, ki...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>J77ENIIE</td>\n",
       "      <td>Thanks to &lt;user&gt; Catch me performing at La Nui...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>[Thanks, to, &lt;, user, &gt;, Catch, me, performing...</td>\n",
       "      <td>[Thanks, to, user, Catch, me, performing, at, ...</td>\n",
       "      <td>[Thanks, user, Catch, performing, La, Nuit, NY...</td>\n",
       "      <td>[thank, user, catch, perform, la, nuit, nyc, s...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   tweet_id                                          safe_text  label  \\\n",
       "0  CL1KWCMY  Me &amp; The Big Homie meanboy3000 #MEANBOY #M...    0.0   \n",
       "1  E3303EME  I'm 100% thinking of devoting my career to pro...    1.0   \n",
       "2  M4IVFSMS  #whatcausesautism VACCINES, DO NOT VACCINATE Y...   -1.0   \n",
       "3  1DR6ROZ4  I mean if they immunize my kid with something ...   -1.0   \n",
       "4  J77ENIIE  Thanks to <user> Catch me performing at La Nui...    0.0   \n",
       "\n",
       "   agreement                                             tokens  \\\n",
       "0        1.0  [Me, &, amp, ;, The, Big, Homie, meanboy3000, ...   \n",
       "1        1.0  [I, 'm, 100, %, thinking, of, devoting, my, ca...   \n",
       "2        1.0  [#, whatcausesautism, VACCINES, ,, DO, NOT, VA...   \n",
       "3        1.0  [I, mean, if, they, immunize, my, kid, with, s...   \n",
       "4        1.0  [Thanks, to, <, user, >, Catch, me, performing...   \n",
       "\n",
       "                                               alpha  \\\n",
       "0  [Me, amp, The, Big, Homie, MEANBOY, MB, MBS, M...   \n",
       "1  [I, thinking, of, devoting, my, career, to, pr...   \n",
       "2  [whatcausesautism, VACCINES, DO, NOT, VACCINAT...   \n",
       "3  [I, mean, if, they, immunize, my, kid, with, s...   \n",
       "4  [Thanks, to, user, Catch, me, performing, at, ...   \n",
       "\n",
       "                                                stop  \\\n",
       "0  [Me, amp, The, Big, Homie, MEANBOY, MB, MBS, M...   \n",
       "1  [I, thinking, devoting, career, proving, autis...   \n",
       "2  [whatcausesautism, VACCINES, DO, NOT, VACCINAT...   \n",
       "3  [I, mean, immunize, kid, something, wo, secret...   \n",
       "4  [Thanks, user, Catch, performing, La, Nuit, NY...   \n",
       "\n",
       "                                             stemmed  \n",
       "0  [me, amp, the, big, homi, meanboy, mb, mb, mmr...  \n",
       "1  [i, think, devot, career, prove, autism, caus,...  \n",
       "2  [whatcausesaut, vaccin, do, not, vaccin, your,...  \n",
       "3  [i, mean, immun, kid, someth, wo, secretli, ki...  \n",
       "4  [thank, user, catch, perform, la, nuit, nyc, s...  "
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from nltk.tokenize import word_tokenize\n",
    "from nltk.corpus import stopwords\n",
    "from nltk.stem import PorterStemmer\n",
    "\n",
    "stop = stopwords.words('english')\n",
    "stemmer = PorterStemmer()\n",
    "\n",
    "# Perform preprocessing\n",
    "df['tokens'] = df['safe_text'].apply(lambda df: word_tokenize(df, preserve_line=True))\n",
    "df['alpha'] = df['tokens'].apply(lambda x: [item for item in x if item.isalpha()])\n",
    "df['stop'] = df['alpha'].apply(lambda x: [item for item in x if item not in stop])\n",
    "df['stemmed'] = df['stop'].apply(lambda x: [stemmer.stem(item) for item in x])\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(9999, 8)"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# New dimension of the DataFrame\n",
    "df.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>aa</th>\n",
       "      <th>aaaaaaaand</th>\n",
       "      <th>aaasmtg</th>\n",
       "      <th>aack</th>\n",
       "      <th>aafpassembl</th>\n",
       "      <th>aap</th>\n",
       "      <th>aapglob</th>\n",
       "      <th>aaronhernandez</th>\n",
       "      <th>ab</th>\n",
       "      <th>...</th>\n",
       "      <th>мне</th>\n",
       "      <th>написать</th>\n",
       "      <th>о</th>\n",
       "      <th>оптимизмом</th>\n",
       "      <th>с</th>\n",
       "      <th>смотрю</th>\n",
       "      <th>стране</th>\n",
       "      <th>тут</th>\n",
       "      <th>чем</th>\n",
       "      <th>病院実習行くのにmmrと水疱瘡の抗体を調べたら</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 9399 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     a   aa  aaaaaaaand  aaasmtg  aack  aafpassembl  aap  aapglob  \\\n",
       "0  0.0  0.0         0.0      0.0   0.0          0.0  0.0      0.0   \n",
       "1  0.0  0.0         0.0      0.0   0.0          0.0  0.0      0.0   \n",
       "2  0.0  0.0         0.0      0.0   0.0          0.0  0.0      0.0   \n",
       "3  0.0  0.0         0.0      0.0   0.0          0.0  0.0      0.0   \n",
       "4  0.0  0.0         0.0      0.0   0.0          0.0  0.0      0.0   \n",
       "\n",
       "   aaronhernandez   ab  ...  мне  написать    о  оптимизмом    с  смотрю  \\\n",
       "0             0.0  0.0  ...  0.0       0.0  0.0         0.0  0.0     0.0   \n",
       "1             0.0  0.0  ...  0.0       0.0  0.0         0.0  0.0     0.0   \n",
       "2             0.0  0.0  ...  0.0       0.0  0.0         0.0  0.0     0.0   \n",
       "3             0.0  0.0  ...  0.0       0.0  0.0         0.0  0.0     0.0   \n",
       "4             0.0  0.0  ...  0.0       0.0  0.0         0.0  0.0     0.0   \n",
       "\n",
       "   стране  тут  чем  病院実習行くのにmmrと水疱瘡の抗体を調べたら  \n",
       "0     0.0  0.0  0.0                      0.0  \n",
       "1     0.0  0.0  0.0                      0.0  \n",
       "2     0.0  0.0  0.0                      0.0  \n",
       "3     0.0  0.0  0.0                      0.0  \n",
       "4     0.0  0.0  0.0                      0.0  \n",
       "\n",
       "[5 rows x 9399 columns]"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "\n",
    "# Compute the TF-IDF\n",
    "vectorizer = TfidfVectorizer(lowercase=False, analyzer=lambda x: x)\n",
    "tf_idf = vectorizer.fit_transform(df['stemmed']).toarray()\n",
    "pd.DataFrame(tf_idf, columns=vectorizer.get_feature_names()).head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "accuracy: 0.719\n",
      "rmse: 0.6931810730249348\n"
     ]
    }
   ],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.svm import SVC\n",
    "from sklearn.metrics import accuracy_score, mean_squared_error\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "# Split the data\n",
    "X_train, X_test, y_train, y_test = train_test_split(tf_idf, df['label'], test_size=0.2, random_state=42)\n",
    "\n",
    "# Train the model\n",
    "lr = LogisticRegression()\n",
    "lr.fit(X_train, y_train)\n",
    "\n",
    "# Predict using the trained model\n",
    "y_pred_lr = lr.predict(X_test)\n",
    "\n",
    "# Estimate some metrics\n",
    "print('accuracy:', accuracy_score(y_pred_lr, y_test))\n",
    "print('rmse:', mean_squared_error(y_pred_lr, y_test, squared=False))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We built here a very simplistic model, still reaching an accuracy of about 70%. Feel free to improve this model as an exercise with all your Machine Learning knowledge and experience."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([-1.,  0.,  1.])"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# All distinct values that have been predicted\n",
    "np.unique(y_pred_lr)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "accuracy: 0.7225\n",
      "rmse: 0.7024955515873392\n"
     ]
    }
   ],
   "source": [
    "# Train the model\n",
    "model = SVC()\n",
    "model.fit(X_train, y_train)\n",
    "\n",
    "# Predict using the trained model\n",
    "y_pred = model.predict(X_test)\n",
    "\n",
    "# Estimate some metrics\n",
    "print('accuracy:', accuracy_score(y_pred, y_test))\n",
    "print('rmse:', mean_squared_error(y_pred, y_test, squared=False))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "celltoolbar": "Diaporama",
  "kernelspec": {
   "display_name": "Python 3.8.9 ('venv': venv)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.6 (default, Aug  5 2022, 15:21:02) \n[Clang 14.0.0 (clang-1400.0.29.102)]"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  },
  "vscode": {
   "interpreter": {
    "hash": "1ab24538aa0da4b2d8c48eaca591ff7ffc54671225fb0511b432fd9e26a098ba"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}