{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [], "collapsed_sections": [], "machine_shape": "hm" }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "source": [ "# **Understanding Named Entity Recognation Data**\n" ], "metadata": { "id": "nqrTKIyfYRRa" } }, { "cell_type": "markdown", "source": [ "# **Objective**\n", "\n", "The objective of this notebook is to be able to understand ner dataset more and extract meningful information. In order to achive this we follow Explanatory Data Analysis(EDA) procedure.\n", "\n", "The main section of this notebook organize as follows:\n", "\n", "- Load NER Data from kaggle.\n", "- Observation about the whole dataset.\n", "- Select the relevant columns.\n", "- Identify unique entity tagers in the dataset.\n", "- Data cleansing.\n", "- The distribution of top unigrams after removing stop words.\n", "- The distribution of top biagrams after removing stop words.\n", "- Conclusion\n" ], "metadata": { "id": "KJbZYeNbYfkt" } }, { "cell_type": "markdown", "source": [ "# Imports and Setup" ], "metadata": { "id": "GX1Gm0sVTU4O" } }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "qIFLx0_wimTB" }, "outputs": [], "source": [ "import pandas as pd\n", "pd.set_option('max_colwidth',150)\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "from datetime import datetime as dt\n", "from string import punctuation\n", "import re\n", "import os\n", "from sklearn.feature_extraction.text import CountVectorizer\n", "from IPython.core.interactiveshell import InteractiveShell\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "source": [ "# Download the Datasets" ], "metadata": { "id": "QqvaLRjVjIj3" } }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "T8aMPXC7t_VX" }, "outputs": [], "source": [ "pathdir = \"/content/data\"" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "id": "x3CI3PtUp2lW" }, "outputs": [], "source": [ "def download_dataset():\n", " \n", " if not os.path.isfile('ner.csv'):\n", "\n", " # Downloading Annotated Corpus for Named Entity Recognition dataset\n", " !gdown https://drive.google.com/uc?id=13y8JNgL5TQ4x-yufpBOv3QBsEiE051sE\n", "\n", " if not os.path.exists(pathdir):\n", " # Make a data folder to store the data\n", " !mkdir data\n", "\n", " !mv /content/ner.csv ./data\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "id": "zS6WbHz8wHzu" }, "outputs": [], "source": [ "download_dataset()" ] }, { "cell_type": "markdown", "source": [ "# Load Data" ], "metadata": { "id": "liJiX3Xf2hQh" } }, { "cell_type": "code", "source": [ "#specify the path to data location\n", "\n", "filepath = '/content/data/ner.csv'\n", "data = pd.read_csv(filepath, encoding = \"latin1\", on_bad_lines='skip')\n" ], "metadata": { "id": "LMwtt2rJnNhB" }, "execution_count": 5, "outputs": [] }, { "cell_type": "code", "source": [ "#Verify that the data is loaded correctly\n", "data.head().T" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 834 }, "id": "g4VoxOSnnOs9", "outputId": "1c39d739-e530-48c5-e995-301fa5859baf" }, "execution_count": 6, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " 0 1 2 3 \\\n", "Unnamed: 0 0 1 2 3 \n", "lemma thousand of demonstr have \n", "next-lemma of demonstr have march \n", "next-next-lemma demonstr have march through \n", "next-next-pos NNS VBP VBN IN \n", "next-next-shape lowercase lowercase lowercase lowercase \n", "next-next-word demonstrators have marched through \n", "next-pos IN NNS VBP VBN \n", "next-shape lowercase lowercase lowercase lowercase \n", "next-word of demonstrators have marched \n", "pos NNS IN NNS VBP \n", "prev-iob __START1__ O O O \n", "prev-lemma __start1__ thousand of demonstr \n", "prev-pos __START1__ NNS IN NNS \n", "prev-prev-iob __START2__ __START1__ O O \n", "prev-prev-lemma __start2__ __start1__ thousand of \n", "prev-prev-pos __START2__ __START1__ NNS IN \n", "prev-prev-shape wildcard wildcard capitalized lowercase \n", "prev-prev-word __START2__ __START1__ Thousands of \n", "prev-shape wildcard capitalized lowercase lowercase \n", "prev-word __START1__ Thousands of demonstrators \n", "sentence_idx 1.0 1.0 1.0 1.0 \n", "shape capitalized lowercase lowercase lowercase \n", "word Thousands of demonstrators have \n", "tag O O O O \n", "\n", " 4 \n", "Unnamed: 0 4 \n", "lemma march \n", "next-lemma through \n", "next-next-lemma london \n", "next-next-pos NNP \n", "next-next-shape capitalized \n", "next-next-word London \n", "next-pos IN \n", "next-shape lowercase \n", "next-word through \n", "pos VBN \n", "prev-iob O \n", "prev-lemma have \n", "prev-pos VBP \n", "prev-prev-iob O \n", "prev-prev-lemma demonstr \n", "prev-prev-pos NNS \n", "prev-prev-shape lowercase \n", "prev-prev-word demonstrators \n", "prev-shape lowercase \n", "prev-word have \n", "sentence_idx 1.0 \n", "shape lowercase \n", "word marched \n", "tag O " ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01234
Unnamed: 001234
lemmathousandofdemonstrhavemarch
next-lemmaofdemonstrhavemarchthrough
next-next-lemmademonstrhavemarchthroughlondon
next-next-posNNSVBPVBNINNNP
next-next-shapelowercaselowercaselowercaselowercasecapitalized
next-next-worddemonstratorshavemarchedthroughLondon
next-posINNNSVBPVBNIN
next-shapelowercaselowercaselowercaselowercaselowercase
next-wordofdemonstratorshavemarchedthrough
posNNSINNNSVBPVBN
prev-iob__START1__OOOO
prev-lemma__start1__thousandofdemonstrhave
prev-pos__START1__NNSINNNSVBP
prev-prev-iob__START2____START1__OOO
prev-prev-lemma__start2____start1__thousandofdemonstr
prev-prev-pos__START2____START1__NNSINNNS
prev-prev-shapewildcardwildcardcapitalizedlowercaselowercase
prev-prev-word__START2____START1__Thousandsofdemonstrators
prev-shapewildcardcapitalizedlowercaselowercaselowercase
prev-word__START1__Thousandsofdemonstratorshave
sentence_idx1.01.01.01.01.0
shapecapitalizedlowercaselowercaselowercaselowercase
wordThousandsofdemonstratorshavemarched
tagOOOOO
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 6 } ] }, { "cell_type": "code", "source": [ "#totally the data have 1050795 rows and 25 columns\n", "data.shape" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "iJZa9dP1vGeN", "outputId": "0f3773db-e348-4886-9393-cf550ac30d62" }, "execution_count": 7, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "(1050795, 25)" ] }, "metadata": {}, "execution_count": 7 } ] }, { "cell_type": "code", "source": [ "data.info()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "XwYxq7Wqx8QH", "outputId": "49f95da9-57cb-44b8-bff6-7b2c54388815" }, "execution_count": 8, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "\n", "RangeIndex: 1050795 entries, 0 to 1050794\n", "Data columns (total 25 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 Unnamed: 0 1050795 non-null int64 \n", " 1 lemma 1050795 non-null object \n", " 2 next-lemma 1050795 non-null object \n", " 3 next-next-lemma 1050795 non-null object \n", " 4 next-next-pos 1050795 non-null object \n", " 5 next-next-shape 1050795 non-null object \n", " 6 next-next-word 1050795 non-null object \n", " 7 next-pos 1050795 non-null object \n", " 8 next-shape 1050794 non-null object \n", " 9 next-word 1050794 non-null object \n", " 10 pos 1050794 non-null object \n", " 11 prev-iob 1050794 non-null object \n", " 12 prev-lemma 1050794 non-null object \n", " 13 prev-pos 1050794 non-null object \n", " 14 prev-prev-iob 1050794 non-null object \n", " 15 prev-prev-lemma 1050794 non-null object \n", " 16 prev-prev-pos 1050794 non-null object \n", " 17 prev-prev-shape 1050794 non-null object \n", " 18 prev-prev-word 1050794 non-null object \n", " 19 prev-shape 1050794 non-null object \n", " 20 prev-word 1050794 non-null object \n", " 21 sentence_idx 1050794 non-null float64\n", " 22 shape 1050794 non-null object \n", " 23 word 1050794 non-null object \n", " 24 tag 1050794 non-null object \n", "dtypes: float64(1), int64(1), object(23)\n", "memory usage: 200.4+ MB\n" ] } ] }, { "cell_type": "code", "source": [ "data.isnull().sum()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "o6obun2r48jC", "outputId": "f72d6bc6-ac49-4eff-e37b-996716cfcf73" }, "execution_count": 9, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "Unnamed: 0 0\n", "lemma 0\n", "next-lemma 0\n", "next-next-lemma 0\n", "next-next-pos 0\n", "next-next-shape 0\n", "next-next-word 0\n", "next-pos 0\n", "next-shape 1\n", "next-word 1\n", "pos 1\n", "prev-iob 1\n", "prev-lemma 1\n", "prev-pos 1\n", "prev-prev-iob 1\n", "prev-prev-lemma 1\n", "prev-prev-pos 1\n", "prev-prev-shape 1\n", "prev-prev-word 1\n", "prev-shape 1\n", "prev-word 1\n", "sentence_idx 1\n", "shape 1\n", "word 1\n", "tag 1\n", "dtype: int64" ] }, "metadata": {}, "execution_count": 9 } ] }, { "cell_type": "markdown", "source": [ "# Observation about the whole data\n", "\n", "- The data has 25 columns and 1050794 rows\n", "- 17 columns of the data have null values. \n", "- data type of the columns int(1), float(1), and object(23)\n", "\n" ], "metadata": { "id": "EzYWiTEN5tnh" } }, { "cell_type": "markdown", "source": [ "# Select the data which contains only Sentence, Word and tag columns" ], "metadata": { "id": "B9QsrxPE0SPS" } }, { "cell_type": "code", "source": [ "ner_data = data[['sentence_idx', 'word', 'tag']]" ], "metadata": { "id": "dWK0fXlR0jek" }, "execution_count": 10, "outputs": [] }, { "cell_type": "code", "source": [ "ner_data.shape" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "jatQyuv654PV", "outputId": "7f497c40-a0fa-41b7-e8b8-a698ca828544" }, "execution_count": 11, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "(1050795, 3)" ] }, "metadata": {}, "execution_count": 11 } ] }, { "cell_type": "code", "source": [ "ner_data.head()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "id": "FerXPCTA59DG", "outputId": "0ba0a7e8-8eec-4d32-a439-0d4d475519ac" }, "execution_count": 12, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " sentence_idx word tag\n", "0 1.0 Thousands O\n", "1 1.0 of O\n", "2 1.0 demonstrators O\n", "3 1.0 have O\n", "4 1.0 marched O" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sentence_idxwordtag
01.0ThousandsO
11.0ofO
21.0demonstratorsO
31.0haveO
41.0marchedO
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 12 } ] }, { "cell_type": "code", "source": [ "ner_data.isnull().sum()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "R1R7mjz91LgG", "outputId": "c46556ce-eed8-4f71-de8b-24726ec00480" }, "execution_count": 13, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "sentence_idx 1\n", "word 1\n", "tag 1\n", "dtype: int64" ] }, "metadata": {}, "execution_count": 13 } ] }, { "cell_type": "code", "source": [ "#drop null value\n", "ner_data = ner_data.dropna()" ], "metadata": { "id": "2MQUCtH71R3Y" }, "execution_count": 14, "outputs": [] }, { "cell_type": "code", "source": [ "# the total number of unique sentence\n", "len(ner_data['sentence_idx'].unique())" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "d9yp68G95lYQ", "outputId": "b918a22c-b93c-4f24-a563-2d691ca4a642" }, "execution_count": 15, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "35177" ] }, "metadata": {}, "execution_count": 15 } ] }, { "cell_type": "code", "source": [ "# the total number of unique word\n", "len(ner_data['word'].unique())" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "_RDZ0EwW2Kzo", "outputId": "e8965f1a-7cc3-4355-87e5-afdf1966b0ef" }, "execution_count": 16, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "30172" ] }, "metadata": {}, "execution_count": 16 } ] }, { "cell_type": "code", "source": [ "# the total number of unique tag\n", "len(ner_data['tag'].unique())" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "CYF3NaEo2ZCl", "outputId": "d98ccd59-7eba-442f-93f5-b3ba18ff4441" }, "execution_count": 17, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "17" ] }, "metadata": {}, "execution_count": 17 } ] }, { "cell_type": "code", "source": [ "ner_data['tag'].value_counts(dropna=False)[1:]" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "FsJsZvUdAqwf", "outputId": "ae3412f0-f4b8-4edf-a296-5928456f41f8" }, "execution_count": 18, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "B-geo 37525\n", "B-tim 20193\n", "B-org 20184\n", "I-per 17382\n", "B-per 17011\n", "I-org 16537\n", "B-gpe 16392\n", "I-geo 7409\n", "I-tim 6298\n", "B-art 434\n", "B-eve 348\n", "I-eve 297\n", "I-art 280\n", "I-gpe 229\n", "B-nat 226\n", "I-nat 76\n", "Name: tag, dtype: int64" ] }, "metadata": {}, "execution_count": 18 } ] }, { "cell_type": "markdown", "source": [ "## Meaning of BIO Taggers\n", "- The IOB format (short for inside, outside, beginning), also commonly referred to as the BIO format, is a common tagging format for tagging tokens in a chunking task in computational linguistics (ex. named-entity recognition).\n", "\n", " - B represent Beginning of an entity\n", " - I represent Inside an entity\n", " - O represent Outside entity\n", "\n", "## Essential info about entities in the datasets:\n", "\n", " geo = Geographical Entity\n", " org = Organization\n", " per = Person\n", " gpe = Geopolitical Entity\n", " tim = Time indicator\n", " art = Artifact\n", " eve = Event\n", " nat = Natural Phenomenon\n" ], "metadata": { "id": "Buh6FDMCMeLN" } }, { "cell_type": "markdown", "source": [ "## Observation about the data\n", "\n", "- The data has totally 35177 sentences\n", "- The data has totally 30172 unique words\n", "- The data has totally 17 unique tags. The tag names and their total count values are:\n", " O 889973\n", " B-geo 37525\n", " B-tim 20193\n", " B-org 20184\n", " I-per 17382\n", " B-per 17011\n", " I-org 16537\n", " B-gpe 16392\n", " I-geo 7409\n", " I-tim 6298\n", " B-art 434\n", " B-eve 348\n", " I-eve 297\n", " I-art 280\n", " I-gpe 229\n", " B-nat 226\n", " I-nat 76\n" ], "metadata": { "id": "cdjbymGQqHUs" } }, { "cell_type": "code", "source": [ "plt.figure(figsize=(12,6))\n", "publication_plot = sns.countplot(\n", " data=ner_data,\n", " x='tag',\n", " palette='Set1',\n", " order = ner_data['tag'].value_counts()[1:].index\n", ")\n", "\n", "plt.xticks(\n", " rotation=45, \n", " horizontalalignment='right',\n", " fontweight='light',\n", " fontsize='x-large' \n", ")" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 453 }, "id": "NseIit5Cyiuz", "outputId": "aa05bafb-0cc9-4516-9932-b8aa3dfd5e40" }, "execution_count": 19, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "(array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]),\n", " )" ] }, "metadata": {}, "execution_count": 19 }, { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": { "needs_background": "light" } } ] }, { "cell_type": "markdown", "source": [ "# observation \n", "- As we can see from the above chart,the classes are unbalanced. Geographical entity, time indicator, organizations and persons are heavily represented." ], "metadata": { "id": "kGxYry7F1rtz" } }, { "cell_type": "code", "source": [ "b=[]\n", "for i in range(5):\n", " a = data[data['sentence_idx'] == i+1]['word']\n", " b.append(' '.join(a))\n", "b[0].split('.')\n" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "D5ERpStz-R_1", "outputId": "afc5fa91-dd9e-45cd-c81e-c6d6ae367fbe" }, "execution_count": 20, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "['Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country ',\n", " ' Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country ',\n", " '']" ] }, "metadata": {}, "execution_count": 20 } ] }, { "cell_type": "code", "source": [ "# concat words and build sentences\n", "sentences=[]\n", "def concat_words(df):\n", " for i in df['sentence_idx'].unique():\n", " sent = df[df['sentence_idx'] == i]['word']\n", " sentences.append(' '.join(sent))\n", "\n", " return sentences\n", "sentences = concat_words(ner_data)\n" ], "metadata": { "id": "Nls2VWIB-TBl" }, "execution_count": 21, "outputs": [] }, { "cell_type": "code", "source": [ "len(sentences)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "dg4QOS6lGGZy", "outputId": "08fffd9f-0b0f-48f5-80cc-a420c6515ea1" }, "execution_count": 22, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "35177" ] }, "metadata": {}, "execution_count": 22 } ] }, { "cell_type": "code", "source": [ "sentences[0:3]" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "lW2ahsK1Ghw2", "outputId": "7b6a207f-908a-44d1-a29d-d14917e9d930" }, "execution_count": 23, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "['Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country . Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country .',\n", " 'Families of soldiers killed in the conflict joined the protesters who carried banners with such slogans as \" Bush Number One Terrorist \" and \" Stop the Bombings . \" Families of soldiers killed in the conflict joined the protesters who carried banners with such slogans as \" Bush Number One Terrorist \" and \" Stop the Bombings . \"',\n", " 'They marched from the Houses of Parliament to a rally in Hyde Park . They marched from the Houses of Parliament to a rally in Hyde Park .']" ] }, "metadata": {}, "execution_count": 23 } ] }, { "cell_type": "code", "source": [ "# convert into dataframe\n", "df_sentences = pd.DataFrame(sentences)\n", "df_sentences.rename(columns={0:'sentences'},inplace=True)\n", "df_sentences.head()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "id": "Ml80ma0PG7YY", "outputId": "7bd1201a-56d4-4111-b7dd-03b55abfd2e9" }, "execution_count": 24, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " sentences\n", "0 Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country . ...\n", "1 Families of soldiers killed in the conflict joined the protesters who carried banners with such slogans as \" Bush Number One Terrorist \" and \" Sto...\n", "2 They marched from the Houses of Parliament to a rally in Hyde Park . They marched from the Houses of Parliament to a rally in Hyde Park .\n", "3 Police put the number of marchers at 10,000 while organizers claimed it was 1,00,000 . Police put the number of marchers at 10,000 while organizer...\n", "4 The protest comes on the eve of the annual conference of Britain 's ruling Labor Party in the southern English seaside resort of Brighton . The pr..." ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sentences
0Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country . ...
1Families of soldiers killed in the conflict joined the protesters who carried banners with such slogans as \" Bush Number One Terrorist \" and \" Sto...
2They marched from the Houses of Parliament to a rally in Hyde Park . They marched from the Houses of Parliament to a rally in Hyde Park .
3Police put the number of marchers at 10,000 while organizers claimed it was 1,00,000 . Police put the number of marchers at 10,000 while organizer...
4The protest comes on the eve of the annual conference of Britain 's ruling Labor Party in the southern English seaside resort of Brighton . The pr...
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 24 } ] }, { "cell_type": "markdown", "source": [ "## Data cleansing" ], "metadata": { "id": "11pAAgaBDtfe" } }, { "cell_type": "code", "source": [ "def remove_special_char(df):\n", " special_char = list(punctuation)\n", " for e in ['.','?']:\n", " special_char.remove(e)\n", " special_char.append(\"\\n+\")\n", " special_char.append(\"\\s+\")\n", " special_char.append(\"said\")\n", " special_char.append(\"says\")\n", " special_char.append(\"say\")\n", " special_char.append(\"mr\")\n", "\n", " def deep_clean(sentence):\n", " sentence = str(sentence)\n", " sentence =sentence.strip()\n", " sentence = re.sub('<[^>]*>', '', sentence)\n", " for char in special_char:\n", " sentence = sentence.replace(char, '')\n", " return sentence\n", "\n", " df['sentences'] = df['sentences'].apply(deep_clean)\n", " return df" ], "metadata": { "id": "MeGBdMriVbb2" }, "execution_count": 25, "outputs": [] }, { "cell_type": "code", "source": [ "df_sentences = remove_special_char(df_sentences)" ], "metadata": { "id": "ndoFFfu5tkq8" }, "execution_count": 26, "outputs": [] }, { "cell_type": "markdown", "source": [ "## The distribution of word count in the sentences" ], "metadata": { "id": "kC6tPSqyMxY9" } }, { "cell_type": "code", "source": [ "df_sentences['word_count'] = df_sentences['sentences'].apply(lambda x: len(x.split()))" ], "metadata": { "id": "Ncq5ZSsYM5Wo" }, "execution_count": 27, "outputs": [] }, { "cell_type": "code", "source": [ "df_sentences['word_count'].describe([0.1,0.25,0.5,0.75,0.95])" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "eABFHtEiNiU1", "outputId": "9e3a31fd-7390-43b0-d039-3bff5fc65738" }, "execution_count": 28, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "count 35177.000000\n", "mean 28.302499\n", "std 14.566342\n", "min 1.000000\n", "10% 13.000000\n", "25% 18.000000\n", "50% 25.000000\n", "75% 36.000000\n", "95% 58.000000\n", "max 130.000000\n", "Name: word_count, dtype: float64" ] }, "metadata": {}, "execution_count": 28 } ] }, { "cell_type": "code", "source": [ "df_sentences[df_sentences['word_count']<6]['sentences'].count()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "74Wk77BON7gA", "outputId": "908f5896-9eb3-4db6-fd42-5bf83ee86c1b" }, "execution_count": 29, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "101" ] }, "metadata": {}, "execution_count": 29 } ] }, { "cell_type": "code", "source": [ "df_sentences[df_sentences['word_count']<6]['sentences'].head()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "uf81OifmNyrt", "outputId": "a24add60-2ea7-4910-b29e-64f91803ff45" }, "execution_count": 30, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "1594 John Garang John Garang\n", "2491 IRAQPOVERTY Washington IRAQPOVERTY Washington \n", "4809 Janice Karpinski Janice Karpinski\n", "8411 The The\n", "12943 The assassination occurred Tuesday .\n", "Name: sentences, dtype: object" ] }, "metadata": {}, "execution_count": 30 } ] }, { "cell_type": "code", "source": [ "df_sentences[df_sentences['word_count']>100]['sentences'].count()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "S09nSgudRbLX", "outputId": "83ded79d-2d86-4c77-ce83-0b5fa82367de" }, "execution_count": 31, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "12" ] }, "metadata": {}, "execution_count": 31 } ] }, { "cell_type": "code", "source": [ "sns.histplot(df_sentences['word_count'],\n", " bins=10)\n", "\n", "\n" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 297 }, "id": "9F2gx419NGqZ", "outputId": "d051663f-d64f-47b8-a5ac-c24bdf964d4f" }, "execution_count": 32, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "" ] }, "metadata": {}, "execution_count": 32 }, { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": { "needs_background": "light" } } ] }, { "cell_type": "markdown", "source": [ "- The word count of the sentence is between 1 and 130. \n", "- 75% of the data word count is 36. \n", "- There are 100 sentences their word count is less than 6 and 13 sentences\n", "their word count is above 100." ], "metadata": { "id": "xR6u2QloSQ8O" } }, { "cell_type": "markdown", "source": [ "## The distribution of top unigrams after removing stop words\n" ], "metadata": { "id": "pV2AVfGTc7dy" } }, { "cell_type": "code", "source": [ "def get_top_n_words(corpus, n=None, language=None):\n", " if language=='english':\n", " vec = CountVectorizer(stop_words = 'english').fit(corpus)\n", " else:\n", " vec = CountVectorizer().fit(corpus)\n", "\n", " bag_of_words = vec.transform(corpus)\n", " sum_words = bag_of_words.sum(axis=0) \n", " words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]\n", " words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)\n", " return words_freq[:n]\n" ], "metadata": { "id": "AyF-kwr-TXyM" }, "execution_count": 33, "outputs": [] }, { "cell_type": "code", "source": [ "common_words = get_top_n_words(df_sentences['sentences'], 20, 'english')\n", "for word, freq in common_words:\n", " print(word, freq)\n", "df1 = pd.DataFrame(common_words, columns = ['Word' , 'count'])\n", "df1.groupby('Word').sum()['count'].sort_values(ascending=False).plot.bar()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 695 }, "id": "MYRCv0MiXKEC", "outputId": "ac18bb44-8436-41dd-9cf1-2dc1f6d63e2b" }, "execution_count": 34, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "officials 3396\n", "president 3335\n", "mr 3106\n", "government 3015\n", "killed 2892\n", "people 2821\n", "new 2123\n", "united 2091\n", "military 2026\n", "country 1962\n", "police 1930\n", "minister 1836\n", "iraq 1820\n", "security 1683\n", "states 1546\n", "year 1494\n", "tuesday 1384\n", "group 1382\n", "forces 1337\n", "world 1333\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ "" ] }, "metadata": {}, "execution_count": 34 }, { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": { "needs_background": "light" } } ] }, { "cell_type": "markdown", "source": [ "## The distribution of top bigrams after removing stop words\n", "\n", "\n" ], "metadata": { "id": "FCniAY7Qhp3O" } }, { "cell_type": "code", "source": [ "def get_top_n_bigram(corpus, n=None):\n", " vec = CountVectorizer(ngram_range=(2, 2), stop_words='english').fit(corpus)\n", " bag_of_words = vec.transform(corpus)\n", " sum_words = bag_of_words.sum(axis=0) \n", " words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]\n", " words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)\n", " return words_freq[:n]\n" ], "metadata": { "id": "QFwx-V00gzfh" }, "execution_count": 35, "outputs": [] }, { "cell_type": "code", "source": [ "common_words = get_top_n_bigram(df_sentences['sentences'], 25)\n", "for word, freq in common_words:\n", " print(word, freq)\n", "df1 = pd.DataFrame(common_words, columns = ['Word' , 'count'])\n", "df1.groupby('Word').sum()['count'].sort_values(ascending=False).plot.bar()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 806 }, "id": "oMMy6ktQiUSO", "outputId": "ed5c40d1-81ff-452d-982c-b8b39caa73d7" }, "execution_count": 36, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "united states 1385\n", "prime minister 1082\n", "united nations 589\n", "president bush 500\n", "bird flu 415\n", "human rights 409\n", "european union 350\n", "news agency 323\n", "north korea 317\n", "mr bush 310\n", "security council 308\n", "security forces 302\n", "white house 293\n", "gaza strip 276\n", "foreign minister 266\n", "people killed 252\n", "new york 246\n", "west bank 242\n", "nuclear weapons 238\n", "nuclear program 206\n", "militant group 190\n", "middle east 185\n", "secretary state 185\n", "roadside bomb 181\n", "foreign ministry 179\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ "" ] }, "metadata": {}, "execution_count": 36 }, { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": { "needs_background": "light" } } ] }, { "cell_type": "markdown", "source": [ "## Conclusions:\n", "\n", "- The above explanatory data analysis proof that entity classes disirbutions are unbalanced. Geographical entity, time indicator, organizations and persons are heavily represented.\n", "\n", "- The EDA shows that there are sentences with few numbers of word counts, these needs to be cleaned as these articles might not have a complete sentence.\n", "\n", "- Some of sentences are duplicated more than one." ], "metadata": { "id": "y3rvWHwgujZj" } }, { "cell_type": "markdown", "source": [ "\n", "\n", "\n", "## Acknowledgements\n", "\n", "- The code get_top_n_bigram function is adapted from [towardsdatascience](https://towardsdatascience.com/a-complete-exploratory-data-analysis-and-visualization-for-text-data-29fb1b96fb6a)" ], "metadata": { "id": "lw1wqV8C95xi" } } ] }