{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": [],
"collapsed_sections": [],
"machine_shape": "hm"
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"source": [
"# **Understanding Named Entity Recognation Data**\n"
],
"metadata": {
"id": "nqrTKIyfYRRa"
}
},
{
"cell_type": "markdown",
"source": [
"# **Objective**\n",
"\n",
"The objective of this notebook is to be able to understand ner dataset more and extract meningful information. In order to achive this we follow Explanatory Data Analysis(EDA) procedure.\n",
"\n",
"The main section of this notebook organize as follows:\n",
"\n",
"- Load NER Data from kaggle.\n",
"- Observation about the whole dataset.\n",
"- Select the relevant columns.\n",
"- Identify unique entity tagers in the dataset.\n",
"- Data cleansing.\n",
"- The distribution of top unigrams after removing stop words.\n",
"- The distribution of top biagrams after removing stop words.\n",
"- Conclusion\n"
],
"metadata": {
"id": "KJbZYeNbYfkt"
}
},
{
"cell_type": "markdown",
"source": [
"# Imports and Setup"
],
"metadata": {
"id": "GX1Gm0sVTU4O"
}
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"id": "qIFLx0_wimTB"
},
"outputs": [],
"source": [
"import pandas as pd\n",
"pd.set_option('max_colwidth',150)\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"from datetime import datetime as dt\n",
"from string import punctuation\n",
"import re\n",
"import os\n",
"from sklearn.feature_extraction.text import CountVectorizer\n",
"from IPython.core.interactiveshell import InteractiveShell\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"source": [
"# Download the Datasets"
],
"metadata": {
"id": "QqvaLRjVjIj3"
}
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"id": "T8aMPXC7t_VX"
},
"outputs": [],
"source": [
"pathdir = \"/content/data\""
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"id": "x3CI3PtUp2lW"
},
"outputs": [],
"source": [
"def download_dataset():\n",
" \n",
" if not os.path.isfile('ner.csv'):\n",
"\n",
" # Downloading Annotated Corpus for Named Entity Recognition dataset\n",
" !gdown https://drive.google.com/uc?id=13y8JNgL5TQ4x-yufpBOv3QBsEiE051sE\n",
"\n",
" if not os.path.exists(pathdir):\n",
" # Make a data folder to store the data\n",
" !mkdir data\n",
"\n",
" !mv /content/ner.csv ./data\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"id": "zS6WbHz8wHzu"
},
"outputs": [],
"source": [
"download_dataset()"
]
},
{
"cell_type": "markdown",
"source": [
"# Load Data"
],
"metadata": {
"id": "liJiX3Xf2hQh"
}
},
{
"cell_type": "code",
"source": [
"#specify the path to data location\n",
"\n",
"filepath = '/content/data/ner.csv'\n",
"data = pd.read_csv(filepath, encoding = \"latin1\", on_bad_lines='skip')\n"
],
"metadata": {
"id": "LMwtt2rJnNhB"
},
"execution_count": 5,
"outputs": []
},
{
"cell_type": "code",
"source": [
"#Verify that the data is loaded correctly\n",
"data.head().T"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 834
},
"id": "g4VoxOSnnOs9",
"outputId": "1c39d739-e530-48c5-e995-301fa5859baf"
},
"execution_count": 6,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" 0 1 2 3 \\\n",
"Unnamed: 0 0 1 2 3 \n",
"lemma thousand of demonstr have \n",
"next-lemma of demonstr have march \n",
"next-next-lemma demonstr have march through \n",
"next-next-pos NNS VBP VBN IN \n",
"next-next-shape lowercase lowercase lowercase lowercase \n",
"next-next-word demonstrators have marched through \n",
"next-pos IN NNS VBP VBN \n",
"next-shape lowercase lowercase lowercase lowercase \n",
"next-word of demonstrators have marched \n",
"pos NNS IN NNS VBP \n",
"prev-iob __START1__ O O O \n",
"prev-lemma __start1__ thousand of demonstr \n",
"prev-pos __START1__ NNS IN NNS \n",
"prev-prev-iob __START2__ __START1__ O O \n",
"prev-prev-lemma __start2__ __start1__ thousand of \n",
"prev-prev-pos __START2__ __START1__ NNS IN \n",
"prev-prev-shape wildcard wildcard capitalized lowercase \n",
"prev-prev-word __START2__ __START1__ Thousands of \n",
"prev-shape wildcard capitalized lowercase lowercase \n",
"prev-word __START1__ Thousands of demonstrators \n",
"sentence_idx 1.0 1.0 1.0 1.0 \n",
"shape capitalized lowercase lowercase lowercase \n",
"word Thousands of demonstrators have \n",
"tag O O O O \n",
"\n",
" 4 \n",
"Unnamed: 0 4 \n",
"lemma march \n",
"next-lemma through \n",
"next-next-lemma london \n",
"next-next-pos NNP \n",
"next-next-shape capitalized \n",
"next-next-word London \n",
"next-pos IN \n",
"next-shape lowercase \n",
"next-word through \n",
"pos VBN \n",
"prev-iob O \n",
"prev-lemma have \n",
"prev-pos VBP \n",
"prev-prev-iob O \n",
"prev-prev-lemma demonstr \n",
"prev-prev-pos NNS \n",
"prev-prev-shape lowercase \n",
"prev-prev-word demonstrators \n",
"prev-shape lowercase \n",
"prev-word have \n",
"sentence_idx 1.0 \n",
"shape lowercase \n",
"word marched \n",
"tag O "
],
"text/html": [
"\n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
"
\n",
" \n",
" \n",
" \n",
" Unnamed: 0 | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
"
\n",
" \n",
" lemma | \n",
" thousand | \n",
" of | \n",
" demonstr | \n",
" have | \n",
" march | \n",
"
\n",
" \n",
" next-lemma | \n",
" of | \n",
" demonstr | \n",
" have | \n",
" march | \n",
" through | \n",
"
\n",
" \n",
" next-next-lemma | \n",
" demonstr | \n",
" have | \n",
" march | \n",
" through | \n",
" london | \n",
"
\n",
" \n",
" next-next-pos | \n",
" NNS | \n",
" VBP | \n",
" VBN | \n",
" IN | \n",
" NNP | \n",
"
\n",
" \n",
" next-next-shape | \n",
" lowercase | \n",
" lowercase | \n",
" lowercase | \n",
" lowercase | \n",
" capitalized | \n",
"
\n",
" \n",
" next-next-word | \n",
" demonstrators | \n",
" have | \n",
" marched | \n",
" through | \n",
" London | \n",
"
\n",
" \n",
" next-pos | \n",
" IN | \n",
" NNS | \n",
" VBP | \n",
" VBN | \n",
" IN | \n",
"
\n",
" \n",
" next-shape | \n",
" lowercase | \n",
" lowercase | \n",
" lowercase | \n",
" lowercase | \n",
" lowercase | \n",
"
\n",
" \n",
" next-word | \n",
" of | \n",
" demonstrators | \n",
" have | \n",
" marched | \n",
" through | \n",
"
\n",
" \n",
" pos | \n",
" NNS | \n",
" IN | \n",
" NNS | \n",
" VBP | \n",
" VBN | \n",
"
\n",
" \n",
" prev-iob | \n",
" __START1__ | \n",
" O | \n",
" O | \n",
" O | \n",
" O | \n",
"
\n",
" \n",
" prev-lemma | \n",
" __start1__ | \n",
" thousand | \n",
" of | \n",
" demonstr | \n",
" have | \n",
"
\n",
" \n",
" prev-pos | \n",
" __START1__ | \n",
" NNS | \n",
" IN | \n",
" NNS | \n",
" VBP | \n",
"
\n",
" \n",
" prev-prev-iob | \n",
" __START2__ | \n",
" __START1__ | \n",
" O | \n",
" O | \n",
" O | \n",
"
\n",
" \n",
" prev-prev-lemma | \n",
" __start2__ | \n",
" __start1__ | \n",
" thousand | \n",
" of | \n",
" demonstr | \n",
"
\n",
" \n",
" prev-prev-pos | \n",
" __START2__ | \n",
" __START1__ | \n",
" NNS | \n",
" IN | \n",
" NNS | \n",
"
\n",
" \n",
" prev-prev-shape | \n",
" wildcard | \n",
" wildcard | \n",
" capitalized | \n",
" lowercase | \n",
" lowercase | \n",
"
\n",
" \n",
" prev-prev-word | \n",
" __START2__ | \n",
" __START1__ | \n",
" Thousands | \n",
" of | \n",
" demonstrators | \n",
"
\n",
" \n",
" prev-shape | \n",
" wildcard | \n",
" capitalized | \n",
" lowercase | \n",
" lowercase | \n",
" lowercase | \n",
"
\n",
" \n",
" prev-word | \n",
" __START1__ | \n",
" Thousands | \n",
" of | \n",
" demonstrators | \n",
" have | \n",
"
\n",
" \n",
" sentence_idx | \n",
" 1.0 | \n",
" 1.0 | \n",
" 1.0 | \n",
" 1.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" shape | \n",
" capitalized | \n",
" lowercase | \n",
" lowercase | \n",
" lowercase | \n",
" lowercase | \n",
"
\n",
" \n",
" word | \n",
" Thousands | \n",
" of | \n",
" demonstrators | \n",
" have | \n",
" marched | \n",
"
\n",
" \n",
" tag | \n",
" O | \n",
" O | \n",
" O | \n",
" O | \n",
" O | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
"\n",
" \n",
"
\n",
"
\n",
" "
]
},
"metadata": {},
"execution_count": 6
}
]
},
{
"cell_type": "code",
"source": [
"#totally the data have 1050795 rows and 25 columns\n",
"data.shape"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "iJZa9dP1vGeN",
"outputId": "0f3773db-e348-4886-9393-cf550ac30d62"
},
"execution_count": 7,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"(1050795, 25)"
]
},
"metadata": {},
"execution_count": 7
}
]
},
{
"cell_type": "code",
"source": [
"data.info()"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "XwYxq7Wqx8QH",
"outputId": "49f95da9-57cb-44b8-bff6-7b2c54388815"
},
"execution_count": 8,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"\n",
"RangeIndex: 1050795 entries, 0 to 1050794\n",
"Data columns (total 25 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 Unnamed: 0 1050795 non-null int64 \n",
" 1 lemma 1050795 non-null object \n",
" 2 next-lemma 1050795 non-null object \n",
" 3 next-next-lemma 1050795 non-null object \n",
" 4 next-next-pos 1050795 non-null object \n",
" 5 next-next-shape 1050795 non-null object \n",
" 6 next-next-word 1050795 non-null object \n",
" 7 next-pos 1050795 non-null object \n",
" 8 next-shape 1050794 non-null object \n",
" 9 next-word 1050794 non-null object \n",
" 10 pos 1050794 non-null object \n",
" 11 prev-iob 1050794 non-null object \n",
" 12 prev-lemma 1050794 non-null object \n",
" 13 prev-pos 1050794 non-null object \n",
" 14 prev-prev-iob 1050794 non-null object \n",
" 15 prev-prev-lemma 1050794 non-null object \n",
" 16 prev-prev-pos 1050794 non-null object \n",
" 17 prev-prev-shape 1050794 non-null object \n",
" 18 prev-prev-word 1050794 non-null object \n",
" 19 prev-shape 1050794 non-null object \n",
" 20 prev-word 1050794 non-null object \n",
" 21 sentence_idx 1050794 non-null float64\n",
" 22 shape 1050794 non-null object \n",
" 23 word 1050794 non-null object \n",
" 24 tag 1050794 non-null object \n",
"dtypes: float64(1), int64(1), object(23)\n",
"memory usage: 200.4+ MB\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"data.isnull().sum()"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "o6obun2r48jC",
"outputId": "f72d6bc6-ac49-4eff-e37b-996716cfcf73"
},
"execution_count": 9,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"Unnamed: 0 0\n",
"lemma 0\n",
"next-lemma 0\n",
"next-next-lemma 0\n",
"next-next-pos 0\n",
"next-next-shape 0\n",
"next-next-word 0\n",
"next-pos 0\n",
"next-shape 1\n",
"next-word 1\n",
"pos 1\n",
"prev-iob 1\n",
"prev-lemma 1\n",
"prev-pos 1\n",
"prev-prev-iob 1\n",
"prev-prev-lemma 1\n",
"prev-prev-pos 1\n",
"prev-prev-shape 1\n",
"prev-prev-word 1\n",
"prev-shape 1\n",
"prev-word 1\n",
"sentence_idx 1\n",
"shape 1\n",
"word 1\n",
"tag 1\n",
"dtype: int64"
]
},
"metadata": {},
"execution_count": 9
}
]
},
{
"cell_type": "markdown",
"source": [
"# Observation about the whole data\n",
"\n",
"- The data has 25 columns and 1050794 rows\n",
"- 17 columns of the data have null values. \n",
"- data type of the columns int(1), float(1), and object(23)\n",
"\n"
],
"metadata": {
"id": "EzYWiTEN5tnh"
}
},
{
"cell_type": "markdown",
"source": [
"# Select the data which contains only Sentence, Word and tag columns"
],
"metadata": {
"id": "B9QsrxPE0SPS"
}
},
{
"cell_type": "code",
"source": [
"ner_data = data[['sentence_idx', 'word', 'tag']]"
],
"metadata": {
"id": "dWK0fXlR0jek"
},
"execution_count": 10,
"outputs": []
},
{
"cell_type": "code",
"source": [
"ner_data.shape"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "jatQyuv654PV",
"outputId": "7f497c40-a0fa-41b7-e8b8-a698ca828544"
},
"execution_count": 11,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"(1050795, 3)"
]
},
"metadata": {},
"execution_count": 11
}
]
},
{
"cell_type": "code",
"source": [
"ner_data.head()"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 206
},
"id": "FerXPCTA59DG",
"outputId": "0ba0a7e8-8eec-4d32-a439-0d4d475519ac"
},
"execution_count": 12,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" sentence_idx word tag\n",
"0 1.0 Thousands O\n",
"1 1.0 of O\n",
"2 1.0 demonstrators O\n",
"3 1.0 have O\n",
"4 1.0 marched O"
],
"text/html": [
"\n",
" \n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" sentence_idx | \n",
" word | \n",
" tag | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1.0 | \n",
" Thousands | \n",
" O | \n",
"
\n",
" \n",
" 1 | \n",
" 1.0 | \n",
" of | \n",
" O | \n",
"
\n",
" \n",
" 2 | \n",
" 1.0 | \n",
" demonstrators | \n",
" O | \n",
"
\n",
" \n",
" 3 | \n",
" 1.0 | \n",
" have | \n",
" O | \n",
"
\n",
" \n",
" 4 | \n",
" 1.0 | \n",
" marched | \n",
" O | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
"\n",
" \n",
"
\n",
"
\n",
" "
]
},
"metadata": {},
"execution_count": 12
}
]
},
{
"cell_type": "code",
"source": [
"ner_data.isnull().sum()"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "R1R7mjz91LgG",
"outputId": "c46556ce-eed8-4f71-de8b-24726ec00480"
},
"execution_count": 13,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"sentence_idx 1\n",
"word 1\n",
"tag 1\n",
"dtype: int64"
]
},
"metadata": {},
"execution_count": 13
}
]
},
{
"cell_type": "code",
"source": [
"#drop null value\n",
"ner_data = ner_data.dropna()"
],
"metadata": {
"id": "2MQUCtH71R3Y"
},
"execution_count": 14,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# the total number of unique sentence\n",
"len(ner_data['sentence_idx'].unique())"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "d9yp68G95lYQ",
"outputId": "b918a22c-b93c-4f24-a563-2d691ca4a642"
},
"execution_count": 15,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"35177"
]
},
"metadata": {},
"execution_count": 15
}
]
},
{
"cell_type": "code",
"source": [
"# the total number of unique word\n",
"len(ner_data['word'].unique())"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "_RDZ0EwW2Kzo",
"outputId": "e8965f1a-7cc3-4355-87e5-afdf1966b0ef"
},
"execution_count": 16,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"30172"
]
},
"metadata": {},
"execution_count": 16
}
]
},
{
"cell_type": "code",
"source": [
"# the total number of unique tag\n",
"len(ner_data['tag'].unique())"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "CYF3NaEo2ZCl",
"outputId": "d98ccd59-7eba-442f-93f5-b3ba18ff4441"
},
"execution_count": 17,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"17"
]
},
"metadata": {},
"execution_count": 17
}
]
},
{
"cell_type": "code",
"source": [
"ner_data['tag'].value_counts(dropna=False)[1:]"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "FsJsZvUdAqwf",
"outputId": "ae3412f0-f4b8-4edf-a296-5928456f41f8"
},
"execution_count": 18,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"B-geo 37525\n",
"B-tim 20193\n",
"B-org 20184\n",
"I-per 17382\n",
"B-per 17011\n",
"I-org 16537\n",
"B-gpe 16392\n",
"I-geo 7409\n",
"I-tim 6298\n",
"B-art 434\n",
"B-eve 348\n",
"I-eve 297\n",
"I-art 280\n",
"I-gpe 229\n",
"B-nat 226\n",
"I-nat 76\n",
"Name: tag, dtype: int64"
]
},
"metadata": {},
"execution_count": 18
}
]
},
{
"cell_type": "markdown",
"source": [
"## Meaning of BIO Taggers\n",
"- The IOB format (short for inside, outside, beginning), also commonly referred to as the BIO format, is a common tagging format for tagging tokens in a chunking task in computational linguistics (ex. named-entity recognition).\n",
"\n",
" - B represent Beginning of an entity\n",
" - I represent Inside an entity\n",
" - O represent Outside entity\n",
"\n",
"## Essential info about entities in the datasets:\n",
"\n",
" geo = Geographical Entity\n",
" org = Organization\n",
" per = Person\n",
" gpe = Geopolitical Entity\n",
" tim = Time indicator\n",
" art = Artifact\n",
" eve = Event\n",
" nat = Natural Phenomenon\n"
],
"metadata": {
"id": "Buh6FDMCMeLN"
}
},
{
"cell_type": "markdown",
"source": [
"## Observation about the data\n",
"\n",
"- The data has totally 35177 sentences\n",
"- The data has totally 30172 unique words\n",
"- The data has totally 17 unique tags. The tag names and their total count values are:\n",
" O 889973\n",
" B-geo 37525\n",
" B-tim 20193\n",
" B-org 20184\n",
" I-per 17382\n",
" B-per 17011\n",
" I-org 16537\n",
" B-gpe 16392\n",
" I-geo 7409\n",
" I-tim 6298\n",
" B-art 434\n",
" B-eve 348\n",
" I-eve 297\n",
" I-art 280\n",
" I-gpe 229\n",
" B-nat 226\n",
" I-nat 76\n"
],
"metadata": {
"id": "cdjbymGQqHUs"
}
},
{
"cell_type": "code",
"source": [
"plt.figure(figsize=(12,6))\n",
"publication_plot = sns.countplot(\n",
" data=ner_data,\n",
" x='tag',\n",
" palette='Set1',\n",
" order = ner_data['tag'].value_counts()[1:].index\n",
")\n",
"\n",
"plt.xticks(\n",
" rotation=45, \n",
" horizontalalignment='right',\n",
" fontweight='light',\n",
" fontsize='x-large' \n",
")"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 453
},
"id": "NseIit5Cyiuz",
"outputId": "aa05bafb-0cc9-4516-9932-b8aa3dfd5e40"
},
"execution_count": 19,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"(array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]),\n",
" )"
]
},
"metadata": {},
"execution_count": 19
},
{
"output_type": "display_data",
"data": {
"text/plain": [
""
],
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAuAAAAGRCAYAAAAkSAbwAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAgAElEQVR4nO3debhdVXn48e9LQhhkSIAYAwkEIUoRBTUCilVEhUCl4FgGhZ8TVsE6VcGpIIjiVAoOWFQmq0UKVcCimCKOlSFRZEYioIQiREDBWqnA+v2x1ubunNx7c5Pcs/a54ft5nvPcc9be5553z+9ee+21I6WEJEmSpDrW6joASZIk6bHEBFySJEmqyARckiRJqsgEXJIkSarIBFySJEmqyARckiRJqmhy1wHUttlmm6U5c+Z0HYYkSZLWYIsWLfptSmn6cMMecwn4nDlzWLhwYddhSJIkaQ0WEb8aaZhNUCRJkqSKTMAlSZKkikzAJUmSpIpMwCVJkqSKTMAlSZKkikzAJUmSpIpMwCVJkqSKTMAlSZKkikzAJUmSpIpMwCVJkqSKTMAlSZKkikzAJUmSpIpMwCVJkqSKTMAlSZKkiiZ3HUDXFs7buesQmLfwiq5DkCRJUiXWgEuSJEkVmYBLkiRJFZmAS5IkSRWZgEuSJEkVmYBLkiRJFZmAS5IkSRWZgEuSJEkVmYBLkiRJFZmAS5IkSRWZgEuSJEkVmYBLkiRJFZmAS5IkSRWZgEuSJEkVmYBLkiRJFZmAS5IkSRWZgEuSJEkV9S0Bj4h1I+KKiPh5RFwXER8q5WdExK0RcVV57VTKIyJOjojFEXF1RDyj9b8OjYiby+vQVvkzI+Ka8p2TIyL6NT2SJEnSeJjcx//9ILBHSukPEbE28KOI+FYZ9u6U0rk94+8NzC2vXYBTgF0iYhPgaGAekIBFEXFBSum+Ms4bgcuBi4D5wLeQJEmSBlTfasBT9ofyce3ySqN8ZT/grPK9y4CpETET2AtYkFK6tyTdC4D5ZdhGKaXLUkoJOAvYv1/TI0mSJI2HvrYBj4hJEXEVcDc5ib68DDq+NDM5MSLWKWVbALe3vr6klI1WvmSY8uHiOCwiFkbEwqVLl672dEmSJEmrqq8JeErp4ZTSTsAsYOeI2AF4L7Ad8CxgE+DIfsZQ4jg1pTQvpTRv+vTp/f45SZIkaURVekFJKf0OuBSYn1K6szQzeRA4Hdi5jHYHMLv1tVmlbLTyWcOUS5IkSQOrn72gTI+IqeX9esCLgRtL221KjyX7A9eWr1wAHFJ6Q9kV+H1K6U7gYmDPiJgWEdOAPYGLy7D7I2LX8r8OAc7v1/RIkiRJ46GfvaDMBM6MiEnkRP+clNI3I+K7ETEdCOAq4G/L+BcB+wCLgT8CrwVIKd0bEccBV5bxjk0p3VvevwU4A1iP3PuJPaBIkiRpoPUtAU8pXQ08fZjyPUYYPwGHjzDsNOC0YcoXAjusXqSSJElSPT4JU5IkSarIBFySJEmqyARckiRJqsgEXJIkSarIBFySJEmqyARckiRJqsgEXJIkSarIBFySJEmqyARckiRJqsgEXJIkSarIBFySJEmqyARckiRJqsgEXJIkSarIBFySJEmqyARckiRJqsgEXJIkSarIBFySJEmqyARckiRJqsgEXJIkSarIBFySJEmqyARckiRJqsgEXJIkSarIBFySJEmqyARckiRJqsgEXJIkSarIBFySJEmqyARckiRJqsgEXJIkSarIBFySJEmqyARckiRJqsgEXJIkSarIBFySJEmqqG8JeESsGxFXRMTPI+K6iPhQKd86Ii6PiMUR8bWImFLK1ymfF5fhc1r/672l/KaI2KtVPr+ULY6Io/o1LZIkSdJ46WcN+IPAHimlHYGdgPkRsSvwMeDElNK2wH3A68v4rwfuK+UnlvGIiO2BA4CnAPOBz0XEpIiYBHwW2BvYHjiwjCtJkiQNrL4l4Cn7Q/m4dnklYA/g3FJ+JrB/eb9f+UwZ/sKIiFJ+dkrpwZTSrcBiYOfyWpxSuiWl9H/A2WVcSZIkaWD1tQ14qam+CrgbWAD8EvhdSumhMsoSYIvyfgvgdoAy/PfApu3ynu+MVC5JkiQNrL4m4Cmlh1NKOwGzyDXW2/Xz90YSEYdFxMKIWLh06dIuQpAkSZKASr2gpJR+B1wKPBuYGhGTy6BZwB3l/R3AbIAyfGPgnnZ5z3dGKh/u909NKc1LKc2bPn36uEyTJEmStCr62QvK9IiYWt6vB7wYuIGciL+ijHYocH55f0H5TBn+3ZRSKuUHlF5StgbmAlcAVwJzS68qU8g3al7Qr+mRJEmSxsPkFY+yymYCZ5beStYCzkkpfTMirgfOjogPAz8DvlTG/xLw5YhYDNxLTqhJKV0XEecA1wMPAYenlB4GiIgjgIuBScBpKaXr+jg9kiRJ0mrrWwKeUroaePow5beQ24P3lv8JeOUI/+t44Phhyi8CLlrtYCVJkqRKfBKmJEmSVJEJuCRJklSRCbgkSZJUkQm4JEmSVJEJuCRJklSRCbgkSZJUkQm4JEmSVJEJuCRJklSRCbgkSZJUkQm4JEmSVJEJuCRJklSRCbgkSZJUkQm4JEmSVJEJuCRJklSRCbgkSZJUkQm4JEmSVJEJuCRJklSRCbgkSZJUkQm4JEmSVJEJuCRJklSRCbgkSZJUkQm4JEmSVJEJuCRJklSRCbgkSZJUkQm4JEmSVJEJuCRJklSRCbgkSZJUkQm4JEmSVJEJuCRJklSRCbgkSZJUkQm4JEmSVJEJuCRJklRR3xLwiJgdEZdGxPURcV1EvK2UHxMRd0TEVeW1T+s7742IxRFxU0Ts1SqfX8oWR8RRrfKtI+LyUv61iJjSr+mRJEmSxkM/a8AfAt6VUtoe2BU4PCK2L8NOTCntVF4XAZRhBwBPAeYDn4uISRExCfgssDewPXBg6/98rPyvbYH7gNf3cXokSZKk1da3BDyldGdK6afl/QPADcAWo3xlP+DslNKDKaVbgcXAzuW1OKV0S0rp/4Czgf0iIoA9gHPL988E9u/P1EiSJEnjo0ob8IiYAzwduLwUHRERV0fEaRExrZRtAdze+tqSUjZS+abA71JKD/WUS5IkSQOr7wl4RGwAnAe8PaV0P3AKsA2wE3An8KkKMRwWEQsjYuHSpUv7/XOSJEnSiPqagEfE2uTk+ysppX8HSCndlVJ6OKX0CPAFchMTgDuA2a2vzyplI5XfA0yNiMk95ctJKZ2aUpqXUpo3ffr08Zk4SZIkaRX0sxeUAL4E3JBS+sdW+czWaC8Fri3vLwAOiIh1ImJrYC5wBXAlMLf0eDKFfKPmBSmlBFwKvKJ8/1Dg/H5NjyRJkjQeJq94lFW2G/Aa4JqIuKqUvY/ci8lOQAJuA94EkFK6LiLOAa4n96ByeErpYYCIOAK4GJgEnJZSuq78vyOBsyPiw8DPyAm/JEmSNLD6loCnlH4ExDCDLhrlO8cDxw9TftFw30sp3cJQExZJkiRp4PkkTEmSJKkiE3BJkiSpIhNwSZIkqSITcEmSJKkiE3BJkiSpIhNwSZIkqSITcEmSJKkiE3BJkiSpIhNwSZIkqSITcEmSJKkiE3BJkiSpIhNwSZIkqSITcEmSJKkiE3BJkiSpIhNwSZIkqSITcEmSJKkiE3BJkiSpIhNwSZIkqSITcEmSJKkiE3BJkiSpIhNwSZIkqSITcEmSJKkiE3BJkiSpIhNwSZIkqSITcEmSJKkiE3BJkiSpIhNwSZIkqaIxJeARcclYyiRJkiSNbvJoAyNiXWB9YLOImAZEGbQRsEWfY5MkSZLWOKMm4MCbgLcDmwOLGErA7wc+08e4JEmSpDXSqAl4Sukk4KSIeGtK6dOVYpIkSZLWWCuqAQcgpfTpiHgOMKf9nZTSWX2KS5IkSVojjSkBj4gvA9sAVwEPl+IEmIBLkiRJK2Gs3RDOA3ZLKb0lpfTW8vq70b4QEbMj4tKIuD4irouIt5XyTSJiQUTcXP5OK+URESdHxOKIuDointH6X4eW8W+OiENb5c+MiGvKd06OiFg+EkmSJGlwjDUBvxZ4wkr+74eAd6WUtgd2BQ6PiO2Bo4BLUkpzgUvKZ4C9gbnldRhwCuSEHTga2AXYGTi6SdrLOG9sfW/+SsYoSZIkVTWmJijAZsD1EXEF8GBTmFL665G+kFK6E7izvH8gIm4gd124H7B7Ge1M4HvAkaX8rJRSAi6LiKkRMbOMuyCldC9ARCwA5kfE94CNUkqXlfKzgP2Bb41xmiaU+R/8Wtch8O3j/maF4xz05QMqRDK6r77m7K5DkCRJGtFYE/BjVudHImIO8HTgcmBGSc4BfgPMKO+3AG5vfW1JKRutfMkw5ZIkSdLAGmsvKN9f1R+IiA2A84C3p5TubzfTTimliEir+r9XIobDyM1a2HLLLfv9c5IkSdKIxvoo+gci4v7y+lNEPBwR94/he2uTk++vpJT+vRTfVZqWUP7eXcrvAGa3vj6rlI1WPmuY8uWklE5NKc1LKc2bPn36isKWJEmS+mZMCXhKacOU0kYppY2A9YCXA58b7TulR5IvATeklP6xNegCoOnJ5FDg/Fb5IaU3lF2B35emKhcDe0bEtHLz5Z7AxWXY/RGxa/mtQ1r/S5IkSRpIY+0F5VEp+waw1wpG3Q14DbBHRFxVXvsAJwAvjoibgReVzwAXAbcAi4EvAG8pv3cvcBxwZXkd29yQWcb5YvnOL1lDb8CUJEnSmmOsD+J5WevjWuR+wf802ndSSj8CRuqX+4XDjJ+Aw0f4X6cBpw1TvhDYYbQ4JEmSpEEy1l5Q9m29fwi4jdxtoCRJkqSVMNZeUF7b70AkSZKkx4Kx9oIyKyK+HhF3l9d5ETFrxd+UJEmS1DbWmzBPJ/dSsnl5XVjKJEmSJK2EsSbg01NKp6eUHiqvMwA71JYkSZJW0lgT8Hsi4tURMam8Xg3c08/AJEmSpDXRWBPw1wGvAn4D3Am8Avh/fYpJkiRJWmONtRvCY4FDU0r3AUTEJsAnyYm5JEmSpDEaaw3405rkGx59OuXT+xOSJEmStOYaawK+VkRMaz6UGvCx1p5LkiRJKsaaRH8K+ElE/Fv5/Erg+P6EJEmSJK25xvokzLMiYiGwRyl6WUrp+v6FJUmSJK2ZxtyMpCTcJt2SJEnSahhrG3BJkiRJ48AEXJIkSarIBFySJEmqyK4E9Zhz0kH/3HUIALztq2/qOgRJktQBa8AlSZKkikzAJUmSpIpsgiINqNuO2brrEACYc8ytXYcgSdIaxRpwSZIkqSITcEmSJKkiE3BJkiSpItuAS1otd965b9chADBz5oVdhyBJ0phYAy5JkiRVZA24pMeEUw5+dtch8Oav/KTrECRJA8AacEmSJKkiE3BJkiSpIhNwSZIkqSITcEmSJKkiE3BJkiSpIhNwSZIkqaK+JeARcVpE3B0R17bKjomIOyLiqvLapzXsvRGxOCJuioi9WuXzS9niiDiqVb51RFxeyr8WEVP6NS2SJEnSeOlnDfgZwPxhyk9MKe1UXhcBRMT2wAHAU8p3PhcRkyJiEvBZYG9ge+DAMi7Ax8r/2ha4D3h9H6dFkiRJGhd9S8BTSj8A7h3j6PsBZ6eUHkwp3QosBnYur8UppVtSSv8HnA3sFxEB7AGcW75/JrD/uE6AJEmS1AddtAE/IiKuLk1UppWyLYDbW+MsKWUjlW8K/C6l9FBPuSRJkjTQaifgpwDbADsBdwKfqvGjEXFYRCyMiIVLly6t8ZOSJEnSsKom4Cmlu1JKD6eUHgG+QG5iAnAHMLs16qxSNlL5PcDUiJjcUz7S756aUpqXUpo3ffr08ZkYSZIkaRVUTcAjYmbr40uBpoeUC4ADImKdiNgamAtcAVwJzC09nkwh36h5QUopAZcCryjfPxQ4v8Y0SJIkSatj8opHWTUR8a/A7sBmEbEEOBrYPSJ2AhJwG/AmgJTSdRFxDnA98BBweErp4fJ/jgAuBiYBp6WUris/cSRwdkR8GPgZ8KV+TYskSZI0XvqWgKeUDhymeMQkOaV0PHD8MOUXARcNU34LQ01YJEmSpAnBJ2FKkiRJFZmAS5IkSRWZgEuSJEkVmYBLkiRJFZmAS5IkSRWZgEuSJEkVmYBLkiRJFZmAS5IkSRWZgEuSJEkVmYBLkiRJFZmAS5IkSRWZgEuSJEkVmYBLkiRJFZmAS5IkSRWZgEuSJEkVmYBLkiRJFZmAS5IkSRWZgEuSJEkVmYBLkiRJFZmAS5IkSRWZgEuSJEkVmYBLkiRJFZmAS5IkSRWZgEuSJEkVmYBLkiRJFZmAS5IkSRWZgEuSJEkVmYBLkiRJFZmAS5IkSRWZgEuSJEkVmYBLkiRJFZmAS5IkSRWZgEuSJEkV9S0Bj4jTIuLuiLi2VbZJRCyIiJvL32mlPCLi5IhYHBFXR8QzWt85tIx/c0Qc2ip/ZkRcU75zckREv6ZFkiRJGi/9rAE/A5jfU3YUcElKaS5wSfkMsDcwt7wOA06BnLADRwO7ADsDRzdJexnnja3v9f6WJEmSNHD6loCnlH4A3NtTvB9wZnl/JrB/q/yslF0GTI2ImcBewIKU0r0ppfuABcD8MmyjlNJlKaUEnNX6X5IkSdLAqt0GfEZK6c7y/jfAjPJ+C+D21nhLStlo5UuGKR9WRBwWEQsjYuHSpUtXbwokSZKk1dDZTZil5jpV+q1TU0rzUkrzpk+fXuMnJUmSpGHVTsDvKs1HKH/vLuV3ALNb480qZaOVzxqmXJIkSRpotRPwC4CmJ5NDgfNb5YeU3lB2BX5fmqpcDOwZEdPKzZd7AheXYfdHxK6l95NDWv9LkiRJGliT+/WPI+Jfgd2BzSJiCbk3kxOAcyLi9cCvgFeV0S8C9gEWA38EXguQUro3Io4DrizjHZtSam7sfAu5p5X1gG+VlyRJkjTQ+paAp5QOHGHQC4cZNwGHj/B/TgNOG6Z8IbDD6sQoSZIk1eaTMCVJkqSKTMAlSZKkikzAJUmSpIpMwCVJkqSKTMAlSZKkikzAJUmSpIpMwCVJkqSKTMAlSZKkikzAJUmSpIpMwCVJkqSKTMAlSZKkikzAJUmSpIpMwCVJkqSKTMAlSZKkikzAJUmSpIpMwCVJkqSKJncdgCRpyF0nXdp1CMx42wu6DkGS1mjWgEuSJEkVmYBLkiRJFZmAS5IkSRWZgEuSJEkVmYBLkiRJFZmAS5IkSRWZgEuSJEkVmYBLkiRJFfkgHknSSjv11FO7DoHDDjus6xAkaZVYAy5JkiRVZAIuSZIkVWQCLkmSJFVkAi5JkiRVZAIuSZIkVWQCLkmSJFXUSQIeEbdFxDURcVVELCxlm0TEgoi4ufydVsojIk6OiMURcXVEPKP1fw4t498cEYd2MS2SJEnSyuiyBvwFKaWdUkrzyuejgEtSSnOBS8pngL2BueV1GHAK5IQdOBrYBdgZOLpJ2iVJkqRBNUhNUPYDzizvzwT2b5WflbLLgKkRMRPYC1iQUro3pXQfsACYXztoSZIkaWV0lYAn4DsRsSgimkeZzUgp3Vne/waYUd5vAdze+u6SUjZSuSRJkjSwunoU/XNTSndExOOBBRFxY3tgSilFRBqvHytJ/mEAW2655Xj9W0mSJGmldVIDnlK6o/y9G/g6uQ33XaVpCeXv3WX0O4DZra/PKmUjlQ/3e6emlOallOZNnz59PCdFkiRJWinVE/CIeFxEbNi8B/YErgUuAJqeTA4Fzi/vLwAOKb2h7Ar8vjRVuRjYMyKmlZsv9yxlkiRJ0sDqognKDODrEdH8/ldTSt+OiCuBcyLi9cCvgFeV8S8C9gEWA38EXguQUro3Io4DrizjHZtSurfeZEiSJEkrr3oCnlK6BdhxmPJ7gBcOU56Aw0f4X6cBp413jJIkSVK/DFI3hJIkSdIazwRckiRJqsgEXJIkSarIBFySJEmqyARckiRJqsgEXJIkSarIBFySJEmqyARckiRJqsgEXJIkSarIBFySJEmqyARckiRJqsgEXJIkSarIBFySJEmqyARckiRJqsgEXJIkSarIBFySJEmqyARckiRJqsgEXJIkSarIBFySJEmqyARckiRJqsgEXJIkSarIBFySJEmqyARckiRJqsgEXJIkSarIBFySJEmqyARckiRJqsgEXJIkSarIBFySJEmqyARckiRJqsgEXJIkSarIBFySJEmqyARckiRJqmjCJ+ARMT8iboqIxRFxVNfxSJIkSaOZ3HUAqyMiJgGfBV4MLAGujIgLUkrXdxuZJGkQLJy3c9chMG/hFaMOn//Br1WKZHTfPu5vug5BesyY0Ak4sDOwOKV0C0BEnA3sB5iAS5I0jg768gFdhwDAV19z9qjDTzronytFMrq3ffVNow6/7ZitK0UysjnH3LrCce68c98KkYxu5swLuw5h3E30JihbALe3Pi8pZZIkSdJAipRS1zGssoh4BTA/pfSG8vk1wC4ppSN6xjsMOKx8fDJw0ziHshnw23H+n+NtIsQIxjnejHN8Gef4mQgxgnGON+McX8Y5fvoR41YppenDDZjoTVDuAGa3Ps8qZctIKZ0KnNqvICJiYUppXr/+/3iYCDGCcY434xxfxjl+JkKMYJzjzTjHl3GOn9oxTvQmKFcCcyNi64iYAhwAXNBxTJIkSdKIJnQNeErpoYg4ArgYmAScllK6ruOwJEmSpBFN6AQcIKV0EXBRx2H0rXnLOJoIMYJxjjfjHF/GOX4mQoxgnOPNOMeXcY6fqjFO6JswJUmSpIlmorcBlyRJkiYUE3BJkiSpIhNwSZIkqSITcEmSJKkiE3CpiIgJsT30xjlR4pYGSURE673b0GNAREz4nt+0ciJi7fL3cV3H0sudzhpiUA8ggxrXcFJKjwBExIubjXbQRES04nx5+/MgmQgnCU0C1k7EBt1EihUGM96I2DAi1k4ppYjYOyJ2GsRtSOMnIjYpy/yhiHhhRDyj65iGM4jby3AiYvOI2Lm8PzAijuk4pOVExJyI2C6l9OeIeDlwbERs0HVcbQN3UOzSaEnCIG0YrcRhu4h4QURsOagHkFayuGdEPKXreIbTXu4R8VHgQmD2IC1zeDT5TuX9B4HPAk/tNqrl9ZwkHBwRG6eUHhmkJDwi1kpDfbBOHuATrmXmWWv5D9y6Wf4+LSL2iYinRsSUkuQO0nKfCVwLPC8iDgD+A3hit1EtrzU/t4yIJ0bEDl3HNJxWnLMjYseI2CYippWygVjuZZmfC7y5LPMFwBO6jWp5PfskImL9LuMZSalJvhB4X0S8F/gKcHu3US2rxHgicHFEvB34N+BnKaU/dBvZsuwHvCgrf5M0vAr4C+BB4NqU0jdLeaQBmWER8VLgDOB3wEzg3cBXU0pLu4yr0TM/dwYuBU4DTkopLe40uBFExBOBw4ELUkrf7zqekUTEU4F/AD4zaHH2LPenARcA1wOvSin9oT18QGI8HHgBMA24BXhnSumBLuNr9MR5ELA9sC7w7ZTSf3Ya3DAi4mXAP5eP9wDfBt6fUvqfQVju8GjCeBrwMmAD4A0ppdO7jWpZzXEmIvYDPlKKtwTOA05NKf1Xd9ENacX5UnKcU4A/AncC70kpXdVpgEVEbAJ8BtgFmA28KaV0ekRMSik93G10Wc+2/g5yrPOALwPfTyl9r8PwllOO6f8ObA4cl1I6upQPUo70HPKDdbYDPpBSOiEiJqeUHuo4tEcNxBnqIGit/J8g1yzuCrwB+GxEfKGMk7queYpsBvB+4D3AC8k7v08Ah5dhneqpAf17YA/gz+T5+a6I2LbL+IYTEQcClwMvAe7oOJwRlYTxM+Qd302lbCBqQ3uW+9uAdwAPA/OBcyJio0GoCW/F+DHydrQIOB14PXB2RGzcYXiPasX5SeBTwLOB5wDfiYiPRMRmXcbXKPukzYC3A+8i7zvPA54HfD4iNijLvet9Z1PDeD6wIfB/wH/HgLULLseZvYCvAp8GngW8ETgEmNNhaMsocb4IOAv4HPAk4BTgRcDuHYb2qJJw3UvevjcF7gI2KOUPd70varS29RPIlWkLycf1twNHRcTmHYa3jDLPbgMCWAo8qSS7zTrR6Txt7Wd+Xf7eChwUEduXJkiTOgpteSklX+UFvIp89r5r+bwJ8BZgCbnmtsvYmqsVa5Frbj4KPK41/Ehyjf0xwOO7npclpmOAe4F9gb2ADwJ/IO+st+k6vp5YX0a+NPm/wNNL2aSu4xomzpcAvwH+BPxV7/oxCC/gaOA+4KXkhPGjwA3Ad4ANyzhrdRzjXwI3A7uVz/OBB4DDesbrdL4CLycf5J7VzDPgdcBDwJFdxUg++Db7pEnkmvmzm30PsDbw98DPyLV4jxuQ+bkP8DXg1eRk8YGy7U8eZtzO1lHylYSPlffbAL8g1343w9fueD6uVV5fAv6xlM0kJ2afaY23cZdxlhgOAa4py/5L5BPudzf79673Ra04nwf8kqH8Yx75JPGQQYqzFe8s4Lkl5nOB53QdU098GwJbAc8HLgGuA55ShjXLflqnMXY9kwbpBXwI+H57RSdfmv4gcBWwZUdxNQe6fcrB7DLgCmCrnvHeQ05wPwFM7zDOADYDrgbe2jPOESV5+BzwpI7m57A7MuDF5JqHm4HtRhu34zj3IJ/dXwjs3Dv/O17u08tyf1Nr+Nrkqx9LSswbDMC8PQi4orx/KTkRe1P5vBFwUFex9cR5JPDdZn615vU7yrb+5I7j24fcjvrbwGU9w6aQk/ArgG/QqjCoHGMzz55GrgE9qDXsjLLs96Mk4WUftWOH83R9csL4auBxZbv5fGs6Xgv8dZfLvRXrfwL7kyur7iCfODRx7lu2+ykdLvOtyScF7yqfNyXX2C8iX61pTmr/H6XipcN5uR/wvfL+VWW9fHP5vEHZ1jo5oWnNz6nArPK+SWLnk5Pwcxiq0DgO+GBHMT4emAE8oTVsb3ISfg2wfSl7D/DxLtbPR+PqcoUblFdrRTqanIA1tXTNAp0HPAI8r8MY9yDXep5XdnqPAO+j5wyOXOt8F7BZ5fjaJy0zyMnCb4GDS9mU1vAzyTXj/0jlk5qeOF9ETr4OAdZpzedLySdcTRJevSa8J87nkWtCX8lQ8jof+FXZ6T2rw/WyHecsYOOyk3v3MKaf6EkAAB1lSURBVMPPKevtf/RuY7XjLQe4C8lJzqPJdxm2C7m2fqeu5msrlneX7Xnj8rlJEnco5bt3GNtzyU2MTiVXCvyRXMPY3tankO9X+D6weYex7khOBk9sz8fy/nTyycx7yU0+HgZ26CDGZ7S271OAfyXf3HYKQ8eoSeQE8p/oKHEgn8g08Xwb+C75/onPUmrmgXXKfP14s2/tIM5nkivVTiEfj5rYppGPQVeSTxhOKPulTiqEWvG+mny/zMHA7ynJdxm2B/kk9ikdxNU+ofoh+YTwO2V7Wq8M2wu4kdyM81vkXGXnjmL8QVkfvw/8XWucvUvcvyMfix7ueh/f2Q93OtEj1yzuWTbEt/SUb0eu1esk0SHXKn6oHRf5sv6DwDtZPgnftHJ80Xr/afLdxpATrR8xlGxNKX9PLBvyfcARoy2TPsb5UXLbsJ+QE5mFwJ5l2D7ks+VFdHMgbsd5Arn5xvUl1rubA0VZX28jH6h36zjOz5QDxFZl2f4HQyc1zc7xA8DXyzrxyUrLfKRt/VnA/5Ttvb2TXh/4Jvmu+Wo19KPE2TSV+SCtk2qGmiU8v4sYye19/w54e2u+fQj4KfnqVjvBXRvYpPb62ax75GRwaVnW3xhhek4iX6K+nA4OyuR7On7e2h++lZyE/RfLNus5npyUd3X1cBNycvW+8vkF5NrPm9vzldx++dcdxrlp2QfdC1zYXhfL36nk+youJV+hqXbFY5RtfRvyScH/NfO3lK9Driw4r+Y+qSe2l5ArKo4mV1B8k5xwfxBYv4zzPPIJ1xcotcyVY9yXXAnw9+QTghPLNn9ka5xnlf3UGV3EuFzMXQfQwUJq73T3IZ/FvQOYUco+UDaA95JvJtqOfJb/kxorP3AAMLu8D3I3c/8LLAZe0zPux0qsb2sf4KhYq8iySdguwPcoVwrINbaLyM1mmuR7Evnu6Z3JNeC/ofKlafKNLXcDzyyfDy4b6vzWOHuRT7rO7HBdfTs5cWjaBB5e4ty/Nc588onYcZVjay/3meQTmL8sn3cqO+tTyQe6SeTk4evAm8g1ZT8HNqoY48HkS47vp9x/APxNmZ+fLuvqvuSrS1czVNNc9SShxPD6sk/aqpR9jHwCdhK5lvTpwEUV90lvBJ7Y+rwdOVFYAry2Vb4h+eD2c+BkOm6n3DMNW5BvWr6TnCg8ek9Na5zN+71OjhLfFPI9KP/RKvsUuZLg++Qa2/PL/qDrphInk0+2NybXJr+n7E9/Qq6dP29A4twLuJh8deMlrfJm216b3MSnWrOOnvXtYHKe8R7gqaXszeQTmm+QK1gOJucf1zJ08lA1CSd307kIeFv5vAH55OpGciXA+xmqCV+bbq4Yzy0x/m35PJ18hfjKso9/X8/4y93z0ck62nUAnU14Pqj9glzD8ANyEvNC8tnmO8iXKX5bVrIf9XvlJ9ca7ADcT0nAW8NOLSvRxymXKFvDPlKGvaX2htkTxwHk/kD/haHL/FPISeTPy8Zwdnl/U5neN5YdS9VLqeRuyN7Wivt3LNvWrqm53bWjnUlTa/c1ylUPcvvA+4E3ls8btXZ6u3QRZ/nt95J7azi9xLxWK94/kGtEF5Qd4S/KsAPLOtC3KzUsm3x/iny1ZVH53T+Sb2QM8gHuenK3eT8mX5pstvWqO+myT/o1uebuF2Xf8+oy7ITWweSann1S35Z9WbcuAbZulc0gn0QtBf61Z/wNyE1Ofg18sqN1ctgKCHITqTvLcn5Gq7yTg3FrW2mW49PJNbaHtMZ5LfnE65vk5oWdNpMoMe1LvnrUnHBPJV+p+Qr5atyxwNwBWeZ7kJPwnwB7tco72a+33n+cXElxaVnm15O77KTsm/6TXLn24zJPO9knld98Irkp3DRyZcvN5KtcU8p8vbXsu9btcJ3crMS0Oflk+0byfRNPIFf8PAJ8qKv4Roy76wA6WlhvJp+xzyufX8nyNYvbkxOw3Vo7yr6v/JQaGHLN9zat8s+TTxJex/JJ+DHAX3Q4P9clJw0PAJf3DJtMru3+BLkW5+Otncmp5Jq89SvGuj65fdgh5DasDzB01rwWOaHs7Qmji531ZPLJ4T7AXw0T51vJl9omdxUnufboRPIl6Z+0ypvaxTlleX+efKLY1Dyd3s/lzrIHuqdQ2nOXZT+FXOP9J+DlZZwnkJvObNaKvXby/Sby1aCmB55mn/Sy1jgbl3X2qZX3SZuUv8+k3PRZ5tUnyAe6j/aMvyFwFK1a84rzsVl+u5FP/k8kt1lupmF2mc8/pvsa2hk9n6eTa4//ufb6t4I4p9GT4JJPVL8PTB2A+Jpl/hxyk4gPk5870Azfq+xv/ovSzLDjeGeRmzo9q3xel9ym/6eUJl2l/InAel3tk1pxrE05ASefeJ/N0D0pJ5Nvvr2YyvedDRNn09T1U+QmhFPL54+QryrcXbaxwekxrOsAOlhIQU4ImpvEXk6uWTysfJ7KMIkBfa5dpvRwUP5uVg6+X2XZS79fJCcOb6CjHgWaeThM2SbkxGop+ebQEXcWZR7/I/nMv283lYy0zMoG+iNyDe0bWuUbkS/31W7OsVycZaf8XXKt530se0POdPIJz/tqxTjKcp9DPgF8hNLTQClfrvkBOfn5JLm2+akV4n1NmYcLyCcLk1rDTie3pV3u8nMXO2hyLedx5f0BtG7ConSnNZb1Zpxjap/cPYHczOi7DN2D8PiyHf+c5ZPwTrpGLH9fVvYt3ykxLyHX4DW9N8wuy/46OurthHxCuIRcCdHuYvJAci9Rz+hqPvbEuSP5itGJwN6t8v3ItbZN87j2jbfVm0CWZX4fubbzB2WdPLE13nzyQ8FuAF7Y4fx8b9l3f6Ns1038zY2hCxlqrtmuSKgyT1vxPL7sMzfqGX4huQKjGe8k8s2jM2rE1xPjpvT09kbOny4BvtAqOwn4295pGYRX5wFUWFjNjq3dj/YCcnvafWjVLJbhbyffoNf32kR6LkGW90377/3JNd5fZNma8C+STxiOoIMknOXbTG7CUPv5TciXIReRbw4dbvrmkG8kWkQfb3bqiXMrWpdEyXdD30Nur97U6M0uO8YrqFjT0BPntuTL+zPL52eSaxcuJ1/aX4+cCDXtf7uKczqt3mvINbNNTwJHtMontba7meQu9a6lT0lPa31bi9wc5tPky6O/bI2zbvm7C7lGpNqd+sPNy/J5CvnA+07y1aL21Y4gNy97JxV6k+iNrZTNLX/fQE5uLmxtN00Svgg4ufa8HCbWvyTXcL+ufF6fnND+ilw7unkp34rcHGlO5fia7WEH4DDy5fyfkWsQdyQnYmfR6ju9o/n4aK0r+ca7C8mJ+Fcp7anLunD2ACzz55b9ZFOJ9jTyCdjvgS+1xvtrcs19tWXes98M8v1a/0O+eb45bjY9ymxPx72tlTj2K+vk9eTmL01LgfXIPZz8pEzHyeSTnurdM5cYryM31VtAziuadfZE8n087y4x3s2APXfk0enoOoCKC6yp/Vib3FXWtWXlafcsMrXsaE6oGNccSo8Q5LP4/6bUepPb2j3C8kn42eS2jFX7BGXZM/J/ICeGvyore9Pd4LSyk15IPplZ7kQGeDKVHhZETvZvKzvkG8jJzCRyTdPNZdg1Jd7LqNCudoQ4P0K+0XYJ+bL+oeSD3yvJB76ryzr7Q3KteLU4h1nuPy07tWvIScTG5NqSE8hdOx0+3P8gn1z0fbk3B4SyPR9Tln1v93g7kQ/a1RPwVgyPbx003lMOJn9i2RsbNyBflanWnpp8Q9Np5f1LyScx25bPryMfgHuT8FPLuln9+QOtuKeQ7985vnx+Ytm+P0NuH/og+YaxrcrwmiewzXLeoOx/mvs3NiVfqbmE3BvTueTmMQuBLTqYh02c69Nq01u28ReRrxzeQL4S8hnyyWKX3fOuRT6x/3z5PIfcxPBM8g3B/8uyNeFd9UXf9GSzDrld/4PtuMqwHUvs8zqcn88kV/AdWfbnF5f9UtOxwizy/v/n5K56u+gxaB75JOZoco5xFflkumnSswu5InAxuUKt8+5kR5yWrgPo40Jqn3nuTb6hcpfyeWvyTUI3kc86Nyxl3yInN01b1b5f9iG3411MrtV8kKGbrpqz4nYS3m6OMrPDeXssuQb5YHJ7+s+RE6/3lOGbkW/G/DXwNx0u99eU5X4g+WlYZ5CT2+YA/Uzy5bP3kRONZp7XaFfbezf83eQTsEPIZ/APA39fhj+R3Kb2feSeO6rF2RPzMWV+vpZ89eirZRs6lnzA3pTc/vIRSvvqDpZ5s60/u3zeiHwSdh25XeB25Ee6f7PsnKudaA2zzP+boZ54diMf2H7K0MMsti37hUW1ljX5ROnlZRl+v/w9qGec1zOUhDfNUaZT8TL0KPHvTG4jvwH5CtcXSvna5AR3KblmbDL1L+vvXdbBK8py3bdnvEPIl8sfKa/ZNeIbIc6Lyrr4Q3Jt47QybCq5rfW55CZ891Iqtzpc5jPIN7CuR66Vb04eZ5FPsh9hKEGvtcx7t/U7GWpWtDa5ScSfyc1hX0zu3eg/yrbe1Q3125AT7w+0ynYjN5e5BXhBKduQfIyv/lAg8pWj3YCjWmVTy/7ol6396XqlvPN7FEadnq4D6NNCaq/8L2nt1K5vHdx2Jl+q/AVDd8b/hA5qQMntUR8h14I0N2FOZuiS+r7k5Pwchm6G6Oqph1uVHfMrWmVBvkT+CEN9aW9GTti62pn8NbkW+XU95ceTTwz2GeF7tWu+dyc/KKK37/m3lfn5VyN8r3acW5BrvA/uKW967nhR+TybfENhrYRxtG39OaV8Y/KT2R4gX/U6m1x719RCVmtuVt7/NfkA/Aj5xrAdS/lLyUnvfeRa56uo1NvJMPE2fehe1ipbp/X+9eRk5weU2vFBepFPrn/OUBvlueRehU7uIl5yEvtHcg38a8o6+AitG2pb4+5AR5fMy7r5h7K9PJfcS8dd5KSn90bM3ek4+S5xNCcOu5KvEjZd+s0h39T6d7R68akQz2jb+k6lfHLZTzbPIvinsn4+2l1v5Xm4Lfm4fjut5LYM243cBeZNzX6+o+U8s+wbH2bofpmmsrRJwm8q60FnPcKt1DR1HUCfF9inygL5MPkS9C3khPu5Zfh0co3iW8hnoVVrFslnwmuRk4GvMPQQi6a5TDsJfxm51rmzmu8Sx3bky3rtflXXIt80+B1yMjml5zu1dyY7kNt9PsJQLXI7efgh8M0BWD+fT75E/luGuhdsbsSdTL5p6F/KetJprwhlB30P5cEvPfPzMuCcYb5T8xL/SNt601XaRuSk4qfkWqfmQFe16yxyrzC3k69ofJ7cDOoXwNPK8G3IvTb8Lblb1OpXO8o6+O6yL7oP+LfWsHazhDeSKw2q1tSOEHNzIG5fOfwfcp/fGzLUjrn6UxnJTXR+zFDXp5uX7f7zw4zbZVeym5CTxOZK5jRyreJnu46RcrLcU9Ys62Zbfjb5qtLryj70OPJV7a4eAjXctn4jQ0n4FPLVxP8BPtz6Xif955OvZP6G3LxoZs+w55BPxn5Gq2eWyvE9rizbm4Hv9M4vchJ+A/kqQmddIq7UNHUdQB8X1gvJZ+7PbpXtTW7TdBNDl6h7z+q76HKu2ZF8gNzW9xRa7f8oT2Okw5tyWrFsSm77fWxvPOQH7PzLAMS4Afly7i+BBa3yZkM9kdbT0TqO9f3kxPYHDLVdbmp0zgDO7zrGJiZyUvu5VlnTX/rpXS73Ubb1b5dtvakFbW4WvYp842DVZKwcxJbS6oWBoYdc3Ehulz5cTzNd7JPWYuheiQdoJeFleHOpd8PKcU1h+W5Ym+R7Ttm2m8e5N893uJbcVOIZNWNtxTe77Iu2Ku+XAP/cGv7/GICrCOQaxl+QK6aeQG6+0Y7zZZSbWCvHNYvcPGPPVlmTdG9Z9j+zyVfpvkFOen9R9quddDW5gm39epZNwpvmKFV732rF1e4d6r1lX3RS77Imt63uurnRhuRmow/QugGYoWP7xlS+sXq1pqfrAPq4oF5SDspze8pfSj7jvLF1YO6sqyd6znbJCdlV5DPmueSb3n5FPrsbiP4ryTX2S8rBuelVYj3y5fOPVowj2n97ytob6rnkhGIKOam4jA6fcFnia3fxdhQ5Ufw88IRSti65+cEXKsa03PwsnyeV+ff3Jc73t79DrjXr5KErJYaxbOvt5ijHlm2q2s3W5bf3I1/tmN4zv3ckJ4g/ZKg5St+7PS1/R6xZbH1uHlZ1LrlW9DhyglPlRupWHNuR7zu4nHxyOrc1D+eQaz9P79m2jiDXmnWW4Ja4f0Y+KbyVfMNqU+kyg9zjyWu63r+Ta8CvIj9j4NayP2oSm83ISfAhHcT1xLLMv0dph9wqv4Pc21GzPm/D0FXtznq+GMO2/iOGrnqtR76a9Ait9tcVY53c8/kfyM0NT6aDK+4j7JN6Y3w1pTlhq2xgnrw75mntOoA+LsT55INyk2S3z/J+UnaIC4Ht+hxH+6mQo9XcfJqhNqlHkmvCf00+m++sp4ZWrL1J2bkltvPJtYk/JNc01bxUvlyPC635vX75+2py8nAN+RL0V8iXqZoDS1dt6XvnZ3Oi9Uty10/nlJirxTnC/GwOHOuQk9cTy3JfQD4R+yG5RqezJjIrsa0/pZRtRL7aVPVBMcCTyEniq3vKNyHXjN1D68mw/V7mrLhm8Qxyc4n1yDdmLiUnZndSuacGcuJyD7lJ1ofKNn16a3kuJD/htllfO7n3ZJT4v0tOsM7oKT+BfINw9a7ceuIIcqXFOeQnMJ7fM/z4sp1v1VF825KbOP6IoZsB7yKfcA1bcdDx/BzLtn5da3tbj9wcpW8P1GPlcpCjyc31TqNUClWabyvaJ51Z9knrk4/tS4GLul7eqzy9XQfQ54X5X+QbcTZvlc0gJ49vKQe7v+3j7zfJ4KrU3OxC7mliq8rzbGrvBteajleybL+q7yoHxIvLxrtMO8w+x/k8co81zc1/wVCt0n7k3gY2bm2ovyAnD+2eZGr0drKi+fnFVtzvIdeafI/WzaOV4lzR/DyXnITPICdjF5NvcGov9y6T8DFt6/T5YD3a/ydf2v92mXcvapVvTK4FfQ65Ru/jlebZWGoWH50e8iX+l1G/d46nkW9gPL5V9mbyQ1emk5sf7NPP5bqKcbf353PJV95uIPeK8VryUwV/T8WHAbGCKx7kRPdG8on1UeSa+VPJJzyddudW5mGThD+PfFztss38eGzrn6gU66rkIB8n37tQ8yE7K7NPWp/8bILb6KDLznGZ3q4D6MMCbDdH2JZck/wL8oN3DiLX3H2nDF9EqUXpQxxNkrWyNTdd7lD+hlxTcwe5JqSdrL6CfDn/iGG+137QTq0bWJ9EbuPZu6G+iuWfcLkBw1+y6uuJwqrMT3JN+BXkNnib9q7TgzA/R/h+9eR7ULb1nphGvIpQ3j+ZXCP/Q/IDv15Jvrnpv8i1YJcAX6w4D1dYs9jli5xc3w2c21P+BfIVozvKOvsPHcY4Ws3i1uQrhNMZ6ur2RnJt8tep8ETYVkyj1S5uRX4A0Hrkfpa/TL7n42clzh26XhdKnO0kfPdWeRc3BU6IbZ3VuHpEB4+XX5l9UpmPA/eEyzFPa9cBrMZCGq1m8VXk9muPI7ddO498uedm8llpc5llAeWR9OMcWxPHhKm5IXeJ9AD5aXGHkZOufyrDnlwOGssl3z3/o+o09G6o5ET7T8PFWdaFapesVnZ+9uz0ji477tqX/8Y0P6l8kjjI23pPTKNdRdgfOK+8fzq5+c6t5JOGixm6ofVbwLHN9yvN34GqWeyJbU6Zp5dQEi5yzewfyF11vpTcp/uvGKHLzj7HN5aaxTN6vrMF5epc5VhXVLv4mdb6OoWhp+9W7zlmBdPRXl+f31EME2JbZxVzkK63/0HeJ43rdHYdwCounLHULL615zszaJ3Nkbsr+w09N26NY4wDX3PTiukN5HZ/+7fKjiXX3DwR+AuGHiIwUBtBz4b6l80yHm6HRqVLVqs6P3veV7/8t7Lzs1I8A7+tt35nRVcR3the1mV93KRV9jFy++ou+qkemJrFYWJrTgy/S+5i8i6WrcWdQU4y3lc5rrHWLDYni53vOxlj7eIgLPcVTMdc8sOCrqF0K1z59wd+W2cC5SCjLOOB3CeN2zR2HcAqLJSVrVnsfcjB9uTLa/9NH7soYsBrblpx7k6+OejYnvIryDcEPlgOJMd2Ed8Yp6G9obZ3hsMl4X29ZLW685Nlk/Dql/9Wdn72OY4Jsa33/OZYryK0k51dyScXt9eKcwzL/fldxTFKbAvIzyA4spmH5D7yNyZf5h+1mdQ4xzNhrm6uYDlP2NrFEvd5dNTt3KBv60yQHGQF0zCw+6Rxmb6uA1jJhbHaNbXkhyIcSJ9rw8pvDWTNTU+Mc8vB6xKGumo7j9yWdi/yww2+Qb6E9pddrwMrmI7vkGsldp/I83MQDoZdz8+Jtq2PMO/GfBWBfIIxCH1Bd1qzuILYtiEn4T8A9uhZL35FpUSMCV6z2FrOE752kZ6HvnU8HwduW58IOcgY5/FA7pNWe9q6DmAlFsLuTMCaWgas5maEGJuN9PvkZg+L2gczcm3jw3TQB+wqTEdzk1MnD91wfo7L707Ibb0n1jFdRWAATraGib3TmsWVmK87k5/Y+b9UvHLAGlCzOMy8fH7X8UzU16Bv6xMhBxnDNAzsPmm1pqvrAFZiAUzYmloGpOZmDPP3P8ndYh1Yypq2i9uSbwrcp+s4xzAd25H7qu60H2Dn52rPuwm5rQ8zHZ1flVnF2DutWRzDfL2I3KvRg5SnclaOYcLXLPbMyzWudrGD+Tiw2/pEyEHGMA0Du09a1Vdzt/aEEBHbAp8j90c8mfy0wJenlG4rw59M7t7ptSmls7qKczgRMZfc7+v6wDuB55NX/ueklH7WZWyNiNgGOIU8Xz+UUrokIgK4gHym/IKU0sNdxrgyImJSl/E6P1frtybstt5WpuPT5C7oDkop/bTjkNYIZfl/nJzgXtdRDHPJ6+hzgWNSSh8r2/dk8n7+m+Qn7n6xi/jGKiK2Iz9o513N9qWVN+jb+kTIQR5r1uo6gJWRUlpM7uP3z8AO5Mdf3xYRzXQ8TO7q57cdhTiilNLN5NjvJ3dF9GFyjcPArPgppV+SY/wT8A8R8QLyg0yeBLwwpfRwREzqMsaV0XVy6/xcrd+asNt6W5mOd5Cb8vy843DWGCmlm4BXdJV8lxhuJj/c6UfAX0XEHin7M/khZVuSr4INtJTSjeSrdLd1HctENujb+kTIQR5rJlQNeGMi1ywOQs3NirTO5PciX/Z/akrpzxExOaX0ULfRTTzOz1U3kbf14XR9VUbjz5pFDWdQt/WJkIM8VkzIBByW2emtAxwDHEGuKduhJDcDufIDRMTapZZkYJWN9HDgnSmlh0wWV4/zc9VN5G1djw1lHT2JfH/C+uTke1G3UUnDmwg5yGPBhE3AwZrFWpyf48v5ufLc1jXorFmUtDImdAIO1ixKjxVu6xp01ixKGqsJn4C3eUCWHhvc1iVJE9kalYBLkiRJg25CdUMoSZIkTXQm4JIkSVJFJuCSJElSRSbgkiRJUkUm4JIkImJqRLyl6zgk6bHABFySBDAVMAGXpAomdx2AJGkgnABsExFXAZcCTwOmAWsDH0gpnQ8QER8EXg0sBW4HFqWUPtlNyJI0MZmAS5IAjgJ2SCntFBGTgfVTSvdHxGbAZRFxATAPeDmwIzkx/ymwqLOIJWmCMgGXJPUK4CMR8TzgEWALYAawG3B+SulPwJ8i4sIOY5SkCcsEXJLU62BgOvDMlNKfI+I2YN1uQ5KkNYc3YUqSAB4ANizvNwbuLsn3C4CtSvmPgX0jYt2I2AB4SQdxStKEZw24JImU0j0R8eOIuBa4EtguIq4BFgI3lnGuLG3BrwbuAq4Bft9VzJI0UUVKqesYJEkTRERskFL6Q0SsD/wAOCyl9NOu45KkicQacEnSyjg1IrYntwk/0+RbklaeNeCSJElSRd6EKUmSJFVkAi5JkiRVZAIuSZIkVWQCLkmSJFVkAi5JkiRVZAIuSZIkVfT/AWVkrd0JMpQaAAAAAElFTkSuQmCC\n"
},
"metadata": {
"needs_background": "light"
}
}
]
},
{
"cell_type": "markdown",
"source": [
"# observation \n",
"- As we can see from the above chart,the classes are unbalanced. Geographical entity, time indicator, organizations and persons are heavily represented."
],
"metadata": {
"id": "kGxYry7F1rtz"
}
},
{
"cell_type": "code",
"source": [
"b=[]\n",
"for i in range(5):\n",
" a = data[data['sentence_idx'] == i+1]['word']\n",
" b.append(' '.join(a))\n",
"b[0].split('.')\n"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "D5ERpStz-R_1",
"outputId": "afc5fa91-dd9e-45cd-c81e-c6d6ae367fbe"
},
"execution_count": 20,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"['Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country ',\n",
" ' Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country ',\n",
" '']"
]
},
"metadata": {},
"execution_count": 20
}
]
},
{
"cell_type": "code",
"source": [
"# concat words and build sentences\n",
"sentences=[]\n",
"def concat_words(df):\n",
" for i in df['sentence_idx'].unique():\n",
" sent = df[df['sentence_idx'] == i]['word']\n",
" sentences.append(' '.join(sent))\n",
"\n",
" return sentences\n",
"sentences = concat_words(ner_data)\n"
],
"metadata": {
"id": "Nls2VWIB-TBl"
},
"execution_count": 21,
"outputs": []
},
{
"cell_type": "code",
"source": [
"len(sentences)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "dg4QOS6lGGZy",
"outputId": "08fffd9f-0b0f-48f5-80cc-a420c6515ea1"
},
"execution_count": 22,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"35177"
]
},
"metadata": {},
"execution_count": 22
}
]
},
{
"cell_type": "code",
"source": [
"sentences[0:3]"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "lW2ahsK1Ghw2",
"outputId": "7b6a207f-908a-44d1-a29d-d14917e9d930"
},
"execution_count": 23,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"['Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country . Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country .',\n",
" 'Families of soldiers killed in the conflict joined the protesters who carried banners with such slogans as \" Bush Number One Terrorist \" and \" Stop the Bombings . \" Families of soldiers killed in the conflict joined the protesters who carried banners with such slogans as \" Bush Number One Terrorist \" and \" Stop the Bombings . \"',\n",
" 'They marched from the Houses of Parliament to a rally in Hyde Park . They marched from the Houses of Parliament to a rally in Hyde Park .']"
]
},
"metadata": {},
"execution_count": 23
}
]
},
{
"cell_type": "code",
"source": [
"# convert into dataframe\n",
"df_sentences = pd.DataFrame(sentences)\n",
"df_sentences.rename(columns={0:'sentences'},inplace=True)\n",
"df_sentences.head()"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 206
},
"id": "Ml80ma0PG7YY",
"outputId": "7bd1201a-56d4-4111-b7dd-03b55abfd2e9"
},
"execution_count": 24,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" sentences\n",
"0 Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country . ...\n",
"1 Families of soldiers killed in the conflict joined the protesters who carried banners with such slogans as \" Bush Number One Terrorist \" and \" Sto...\n",
"2 They marched from the Houses of Parliament to a rally in Hyde Park . They marched from the Houses of Parliament to a rally in Hyde Park .\n",
"3 Police put the number of marchers at 10,000 while organizers claimed it was 1,00,000 . Police put the number of marchers at 10,000 while organizer...\n",
"4 The protest comes on the eve of the annual conference of Britain 's ruling Labor Party in the southern English seaside resort of Brighton . The pr..."
],
"text/html": [
"\n",
" \n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" sentences | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country . ... | \n",
"
\n",
" \n",
" 1 | \n",
" Families of soldiers killed in the conflict joined the protesters who carried banners with such slogans as \" Bush Number One Terrorist \" and \" Sto... | \n",
"
\n",
" \n",
" 2 | \n",
" They marched from the Houses of Parliament to a rally in Hyde Park . They marched from the Houses of Parliament to a rally in Hyde Park . | \n",
"
\n",
" \n",
" 3 | \n",
" Police put the number of marchers at 10,000 while organizers claimed it was 1,00,000 . Police put the number of marchers at 10,000 while organizer... | \n",
"
\n",
" \n",
" 4 | \n",
" The protest comes on the eve of the annual conference of Britain 's ruling Labor Party in the southern English seaside resort of Brighton . The pr... | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
"\n",
" \n",
"
\n",
"
\n",
" "
]
},
"metadata": {},
"execution_count": 24
}
]
},
{
"cell_type": "markdown",
"source": [
"## Data cleansing"
],
"metadata": {
"id": "11pAAgaBDtfe"
}
},
{
"cell_type": "code",
"source": [
"def remove_special_char(df):\n",
" special_char = list(punctuation)\n",
" for e in ['.','?']:\n",
" special_char.remove(e)\n",
" special_char.append(\"\\n+\")\n",
" special_char.append(\"\\s+\")\n",
" special_char.append(\"said\")\n",
" special_char.append(\"says\")\n",
" special_char.append(\"say\")\n",
" special_char.append(\"mr\")\n",
"\n",
" def deep_clean(sentence):\n",
" sentence = str(sentence)\n",
" sentence =sentence.strip()\n",
" sentence = re.sub('<[^>]*>', '', sentence)\n",
" for char in special_char:\n",
" sentence = sentence.replace(char, '')\n",
" return sentence\n",
"\n",
" df['sentences'] = df['sentences'].apply(deep_clean)\n",
" return df"
],
"metadata": {
"id": "MeGBdMriVbb2"
},
"execution_count": 25,
"outputs": []
},
{
"cell_type": "code",
"source": [
"df_sentences = remove_special_char(df_sentences)"
],
"metadata": {
"id": "ndoFFfu5tkq8"
},
"execution_count": 26,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## The distribution of word count in the sentences"
],
"metadata": {
"id": "kC6tPSqyMxY9"
}
},
{
"cell_type": "code",
"source": [
"df_sentences['word_count'] = df_sentences['sentences'].apply(lambda x: len(x.split()))"
],
"metadata": {
"id": "Ncq5ZSsYM5Wo"
},
"execution_count": 27,
"outputs": []
},
{
"cell_type": "code",
"source": [
"df_sentences['word_count'].describe([0.1,0.25,0.5,0.75,0.95])"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "eABFHtEiNiU1",
"outputId": "9e3a31fd-7390-43b0-d039-3bff5fc65738"
},
"execution_count": 28,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"count 35177.000000\n",
"mean 28.302499\n",
"std 14.566342\n",
"min 1.000000\n",
"10% 13.000000\n",
"25% 18.000000\n",
"50% 25.000000\n",
"75% 36.000000\n",
"95% 58.000000\n",
"max 130.000000\n",
"Name: word_count, dtype: float64"
]
},
"metadata": {},
"execution_count": 28
}
]
},
{
"cell_type": "code",
"source": [
"df_sentences[df_sentences['word_count']<6]['sentences'].count()"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "74Wk77BON7gA",
"outputId": "908f5896-9eb3-4db6-fd42-5bf83ee86c1b"
},
"execution_count": 29,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"101"
]
},
"metadata": {},
"execution_count": 29
}
]
},
{
"cell_type": "code",
"source": [
"df_sentences[df_sentences['word_count']<6]['sentences'].head()"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "uf81OifmNyrt",
"outputId": "a24add60-2ea7-4910-b29e-64f91803ff45"
},
"execution_count": 30,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"1594 John Garang John Garang\n",
"2491 IRAQPOVERTY Washington IRAQPOVERTY Washington \n",
"4809 Janice Karpinski Janice Karpinski\n",
"8411 The The\n",
"12943 The assassination occurred Tuesday .\n",
"Name: sentences, dtype: object"
]
},
"metadata": {},
"execution_count": 30
}
]
},
{
"cell_type": "code",
"source": [
"df_sentences[df_sentences['word_count']>100]['sentences'].count()"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "S09nSgudRbLX",
"outputId": "83ded79d-2d86-4c77-ce83-0b5fa82367de"
},
"execution_count": 31,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"12"
]
},
"metadata": {},
"execution_count": 31
}
]
},
{
"cell_type": "code",
"source": [
"sns.histplot(df_sentences['word_count'],\n",
" bins=10)\n",
"\n",
"\n"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 297
},
"id": "9F2gx419NGqZ",
"outputId": "d051663f-d64f-47b8-a5ac-c24bdf964d4f"
},
"execution_count": 32,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
""
]
},
"metadata": {},
"execution_count": 32
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"