{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [] }, "kernelspec": { "name": "python3", "display_name": "Python 3" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "eBpjBBZc6IvA" }, "source": [ "# Fatima Fellowship Coding Challenge (Pick 1)\n", "\n", "Thank you for applying to the Fatima Fellowship. To help us select the Fellows and assess your ability to do machine learning research, we are asking that you complete a short coding challenge. Please pick **1 of these 5** coding challenges, whichever is most aligned with your interests. These coding challenges are not meant to take too long, do NOT spend more than 4-6 hours on them -- you can submit whatever you have.\n", "\n", "**How to submit**: Please make a copy of this colab notebook, add your code and results, and submit your colab notebook along with your application. If you have never used a colab notebook, [check out this video](https://www.youtube.com/watch?v=i-HnvsehuSw)" ] }, { "cell_type": "markdown", "source": [ "\n", "\n", "---\n", "\n", "\n", "### **Important**: Beore you get started, please make sure to make a **copy of this notebook** and set sharing permissions so that **anyone with the link can view**. Otherwise, we will NOT be able to assess your application.\n", "\n", "\n", "\n", "---\n", "\n" ], "metadata": { "id": "lQNUZjvuRt-m" } }, { "cell_type": "markdown", "metadata": { "id": "braBzmRpMe7_" }, "source": [ "# 1. Deep Learning for Vision" ] }, { "cell_type": "markdown", "metadata": { "id": "1IWw-NZf5WfF" }, "source": [ "**Generated by AI detector**: Train a model to detect if images are generated by AI\n", "\n", "* Find a dataset of natural images and images generated by AI (here is one such dataset on the [Hugging Face Hub](https://huggingface.co/datasets/competitions/aiornot) but you're welcome to use any dataset you've found.\n", "* Create a training and test set.\n", "* Build a neural network (using Tensorflow, PyTorch, or any framework you like)\n", "* Train it to classify the image as being generated by an AI or not until a reasonable accuracy is reached\n", "* [Upload the the model to the Hugging Face Hub](https://huggingface.co/docs/hub/adding-a-model), and add a link to your model below.\n", "* Look at some of the images that were classified incorrectly. Please explain what you might do to improve your model's performance on these images in the future (you do not need to impelement these suggestions)\n", "\n", "**Submission instructions**: Please write your code below and include some examples of images that were classified" ] }, { "cell_type": "code", "source": [ "### WRITE YOUR CODE TO TRAIN THE MODEL HERE\n", "print('Hi')" ], "metadata": { "id": "K2GJaYBpw91T", "outputId": "f26681e5-f682-42d2-e837-f949a159c779", "colab": { "base_uri": "https://localhost:8080/" } }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Hi\n" ] } ] }, { "cell_type": "markdown", "source": [ "**Write up**: \n", "* Link to the model on Hugging Face Hub: \n", "* Include some examples of misclassified images. Please explain what you might do to improve your model's performance on these images in the future (you do not need to impelement these suggestions)\n", "\n", "[Please put your write up here]" ], "metadata": { "id": "qSeLed2JxvGI" } }, { "cell_type": "markdown", "metadata": { "id": "sFU9LTOyMiMj" }, "source": [ "# 2. Deep Learning for NLP\n", "\n", "**Fake news classifier**: Train a text classification model to detect fake news articles!\n", "\n", "* Download the dataset here: https://www.kaggle.com/datasets/sadikaljarif/fake-news-detection-dataset-english (if you'd like, you can also look at fake news datasets in other languages, which are available on the Huggingface Hub)\n", "* Develop an NLP model for classification that uses a pretrained language model and the *text* of the article. It should *NOT* use the URL\n", "* Finetune your model on the dataset, and generate an AUC curve of your model on the test set of your choice. \n", "* [Upload the the model to the Hugging Face Hub](https://huggingface.co/docs/hub/adding-a-model), and add a link to your model below.\n", "* *Answer the following question*: Look at some of the news articles that were classified incorrectly. Please explain what you might do to improve your model's performance on these news articles in the future (you do not need to impelement these suggestions)" ] }, { "cell_type": "code", "source": [ "#installing libraries\n", "!pip install opendatasets\n", "!pip install pandas\n", "!pip install -q kaggle\n", "!pip install transformers\n" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "MvlmVtfz8LY5", "outputId": "14ca9f13-661c-45b7-9ed0-62c1ce5aaef4" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", "Collecting opendatasets\n", " Downloading opendatasets-0.1.22-py3-none-any.whl (15 kB)\n", "Requirement already satisfied: kaggle in /usr/local/lib/python3.9/dist-packages (from opendatasets) (1.5.13)\n", "Requirement already satisfied: tqdm in /usr/local/lib/python3.9/dist-packages (from opendatasets) (4.65.0)\n", "Requirement already satisfied: click in /usr/local/lib/python3.9/dist-packages (from opendatasets) (8.1.3)\n", "Requirement already satisfied: requests in /usr/local/lib/python3.9/dist-packages (from kaggle->opendatasets) (2.25.1)\n", "Requirement already satisfied: certifi in /usr/local/lib/python3.9/dist-packages (from kaggle->opendatasets) (2022.12.7)\n", "Requirement already satisfied: six>=1.10 in /usr/local/lib/python3.9/dist-packages (from kaggle->opendatasets) (1.15.0)\n", "Requirement already satisfied: python-dateutil in /usr/local/lib/python3.9/dist-packages (from kaggle->opendatasets) (2.8.2)\n", "Requirement already satisfied: python-slugify in /usr/local/lib/python3.9/dist-packages (from kaggle->opendatasets) (8.0.1)\n", "Requirement already satisfied: urllib3 in /usr/local/lib/python3.9/dist-packages (from kaggle->opendatasets) (1.26.14)\n", "Requirement already satisfied: text-unidecode>=1.3 in /usr/local/lib/python3.9/dist-packages (from python-slugify->kaggle->opendatasets) (1.3)\n", "Requirement already satisfied: chardet<5,>=3.0.2 in /usr/local/lib/python3.9/dist-packages (from requests->kaggle->opendatasets) (4.0.0)\n", "Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.9/dist-packages (from requests->kaggle->opendatasets) (2.10)\n", "Installing collected packages: opendatasets\n", "Successfully installed opendatasets-0.1.22\n", "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", "Requirement already satisfied: pandas in /usr/local/lib/python3.9/dist-packages (1.4.4)\n", "Requirement already satisfied: numpy>=1.18.5 in /usr/local/lib/python3.9/dist-packages (from pandas) (1.22.4)\n", "Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.9/dist-packages (from pandas) (2.8.2)\n", "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.9/dist-packages (from pandas) (2022.7.1)\n", "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.9/dist-packages (from python-dateutil>=2.8.1->pandas) (1.15.0)\n", "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", "Collecting transformers\n", " Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m6.3/6.3 MB\u001b[0m \u001b[31m33.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25hRequirement already satisfied: filelock in /usr/local/lib/python3.9/dist-packages (from transformers) (3.9.0)\n", "Collecting tokenizers!=0.11.3,<0.14,>=0.11.1\n", " Downloading tokenizers-0.13.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m7.6/7.6 MB\u001b[0m \u001b[31m81.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25hRequirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.9/dist-packages (from transformers) (4.65.0)\n", "Requirement already satisfied: requests in /usr/local/lib/python3.9/dist-packages (from transformers) (2.25.1)\n", "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.9/dist-packages (from transformers) (6.0)\n", "Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.9/dist-packages (from transformers) (1.22.4)\n", "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.9/dist-packages (from transformers) (23.0)\n", "Collecting huggingface-hub<1.0,>=0.11.0\n", " Downloading huggingface_hub-0.13.2-py3-none-any.whl (199 kB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m199.2/199.2 KB\u001b[0m \u001b[31m21.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25hRequirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.9/dist-packages (from transformers) (2022.6.2)\n", "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.9/dist-packages (from huggingface-hub<1.0,>=0.11.0->transformers) (4.5.0)\n", "Requirement already satisfied: chardet<5,>=3.0.2 in /usr/local/lib/python3.9/dist-packages (from requests->transformers) (4.0.0)\n", "Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.9/dist-packages (from requests->transformers) (2.10)\n", "Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.9/dist-packages (from requests->transformers) (1.26.14)\n", "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.9/dist-packages (from requests->transformers) (2022.12.7)\n", "Installing collected packages: tokenizers, huggingface-hub, transformers\n", "Successfully installed huggingface-hub-0.13.2 tokenizers-0.13.2 transformers-4.26.1\n" ] } ] }, { "cell_type": "code", "source": [ "#Importing libraries\n", "import opendatasets as od\n", "from tensorflow.keras.models import Model, Sequential\n", "from tensorflow.keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPooling2D,Input\n", "from tensorflow.keras.callbacks import EarlyStopping\n", "from tensorflow.python.ops.numpy_ops import np_utils\n", "from transformers import BertModel, TFBertModel \n", "import tensorflow as tf\n", "from tensorflow.keras.optimizers import Adam\n", "from transformers import BertTokenizer, BertForSequenceClassification, AdamW, TFBertModel\n", "from sklearn.metrics import roc_auc_score\n", "from torch.utils.data import DataLoader, RandomSampler, SequentialSampler\n", "\n", "from tensorflow.keras import regularizers\n", "from sklearn.metrics import classification_report\n", "from sklearn.metrics import confusion_matrix\n", "\n", "import pandas as pd\n", "from matplotlib import rcParams\n", "import seaborn as sns\n", "import numpy as np\n", "from PIL import Image\n", "from sklearn.model_selection import train_test_split\n", "from matplotlib import pyplot as plt\n", "from transformers import AutoTokenizer" ], "metadata": { "id": "OaxYb0_T8Wn4" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "#loading the datasets\n", "fake_news=pd.read_csv(\"/Fake.csv\")\n", "true_news=pd.read_csv(\"/content/True.csv\")" ], "metadata": { "id": "bJU3ck0SIQqx" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "#Exploring the datasets\n", "fake_news.head()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "id": "fhucg9DT1VDX", "outputId": "a9be23df-d443-4a0e-dd41-f5547f674810" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " title \\\n", "0 Donald Trump Sends Out Embarrassing New Year’... \n", "1 Drunk Bragging Trump Staffer Started Russian ... \n", "2 Sheriff David Clarke Becomes An Internet Joke... \n", "3 Trump Is So Obsessed He Even Has Obama’s Name... \n", "4 Pope Francis Just Called Out Donald Trump Dur... \n", "\n", " text subject \\\n", "0 Donald Trump just couldn t wish all Americans ... News \n", "1 House Intelligence Committee Chairman Devin Nu... News \n", "2 On Friday, it was revealed that former Milwauk... News \n", "3 On Christmas day, Donald Trump announced that ... News \n", "4 Pope Francis used his annual Christmas Day mes... News \n", "\n", " date \n", "0 December 31, 2017 \n", "1 December 31, 2017 \n", "2 December 30, 2017 \n", "3 December 29, 2017 \n", "4 December 25, 2017 " ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titletextsubjectdate
0Donald Trump Sends Out Embarrassing New Year’...Donald Trump just couldn t wish all Americans ...NewsDecember 31, 2017
1Drunk Bragging Trump Staffer Started Russian ...House Intelligence Committee Chairman Devin Nu...NewsDecember 31, 2017
2Sheriff David Clarke Becomes An Internet Joke...On Friday, it was revealed that former Milwauk...NewsDecember 30, 2017
3Trump Is So Obsessed He Even Has Obama’s Name...On Christmas day, Donald Trump announced that ...NewsDecember 29, 2017
4Pope Francis Just Called Out Donald Trump Dur...Pope Francis used his annual Christmas Day mes...NewsDecember 25, 2017
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 51 } ] }, { "cell_type": "code", "source": [ "#Exploring the datasets\n", "true_news.head()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "id": "KcSJxAah1gv2", "outputId": "0d13a388-7693-437e-933e-c153043b3037" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " title \\\n", "0 As U.S. budget fight looms, Republicans flip t... \n", "1 U.S. military to accept transgender recruits o... \n", "2 Senior U.S. Republican senator: 'Let Mr. Muell... \n", "3 FBI Russia probe helped by Australian diplomat... \n", "4 Trump wants Postal Service to charge 'much mor... \n", "\n", " text subject \\\n", "0 WASHINGTON (Reuters) - The head of a conservat... politicsNews \n", "1 WASHINGTON (Reuters) - Transgender people will... politicsNews \n", "2 WASHINGTON (Reuters) - The special counsel inv... politicsNews \n", "3 WASHINGTON (Reuters) - Trump campaign adviser ... politicsNews \n", "4 SEATTLE/WASHINGTON (Reuters) - President Donal... politicsNews \n", "\n", " date \n", "0 December 31, 2017 \n", "1 December 29, 2017 \n", "2 December 31, 2017 \n", "3 December 30, 2017 \n", "4 December 29, 2017 " ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titletextsubjectdate
0As U.S. budget fight looms, Republicans flip t...WASHINGTON (Reuters) - The head of a conservat...politicsNewsDecember 31, 2017
1U.S. military to accept transgender recruits o...WASHINGTON (Reuters) - Transgender people will...politicsNewsDecember 29, 2017
2Senior U.S. Republican senator: 'Let Mr. Muell...WASHINGTON (Reuters) - The special counsel inv...politicsNewsDecember 31, 2017
3FBI Russia probe helped by Australian diplomat...WASHINGTON (Reuters) - Trump campaign adviser ...politicsNewsDecember 30, 2017
4Trump wants Postal Service to charge 'much mor...SEATTLE/WASHINGTON (Reuters) - President Donal...politicsNewsDecember 29, 2017
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 52 } ] }, { "cell_type": "code", "source": [ "#Subject Vs frequency bar graph\n", "true_news['subject'].value_counts().plot(kind='barh')\n", "rcParams['figure.figsize'] = 7,10" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 594 }, "id": "0eLZrf9F1qi2", "outputId": "b68de132-4fc1-4a2e-8317-a81d1cbf0b7c" }, "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": {} } ] }, { "cell_type": "code", "source": [ "#Subject Vs frequency bar graph\n", "fake_news['subject'].value_counts().plot(kind='barh')\n", "rcParams['figure.figsize'] = 10,7" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 594 }, "id": "POwJJJTV14r0", "outputId": "c8818dd5-cf98-4fa1-d37c-9c7b40dcdb6a" }, "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": {} } ] }, { "cell_type": "code", "source": [ "# Assigning label to the datasets\n", "fake_news[\"label\"]=\"fake\"\n", "true_news[\"label\"]=\"true\"\n", "\n", "# Merging the real and true news datasets to create the final one\n", "final_news_dataset= pd.concat([fake_news,true_news])\n", "\n", "#Shuffling\n", "final_news_dataset = final_news_dataset.sample(frac=1).reset_index(drop=True)\n", "\n", "# Exploring the final dataset\n", "final_news_dataset.head(10)\n", "\n" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 363 }, "id": "noYYMxvF2Ag8", "outputId": "ce5142c8-631f-4420-e201-63d769ec11d6" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " title \\\n", "0 U.S. responds in court fight over illegal Indo... \n", "1 Numbskull Republican Ignores History, Says Re... \n", "2 US-UK DIRTY WAR: ‘Latin American-style’ Death ... \n", "3 SUPREME COURT JUSTICE Goes All Creepy Predicti... \n", "4 Hillary Clinton: ‘Israel First’ (and no peace ... \n", "5 Boiler Room EP #119 – Zombie Disneyland & The ... \n", "6 EU Parliament calls on Myanmar to free Reuters... \n", "7 Donald Trump Releases Statement On Cruz Sex S... \n", "8 McConnell Just ADMITTED The NRA Must Approve ... \n", "9 Putin says question of who hacked Democratic p... \n", "\n", " text subject \\\n", "0 BOSTON (Reuters) - U.S. immigration officials ... politicsNews \n", "1 Republican Rep. Ted Poe (R-Texas) spoke with F... News \n", "2 Patrick Henningsen 21st Century WireThis week... Middle-east \n", "3 What the heck is wrong with these loony libera... politics \n", "4 Robert Fantina CounterpunchAlthough the United... US_News \n", "5 Tune in to the Alternate Current Radio Network... US_News \n", "6 BRUSSELS (Reuters) - The president of the Euro... worldnews \n", "7 With the sex scandal allegations piling up aga... News \n", "8 We could already make the assumption that Sena... News \n", "9 MOSCOW (Reuters) - Russian President Vladimir ... politicsNews \n", "\n", " date label \n", "0 December 21, 2017 true \n", "1 January 20, 2016 fake \n", "2 July 14, 2016 fake \n", "3 Jul 10, 2016 fake \n", "4 January 18, 2016 fake \n", "5 July 29, 2017 fake \n", "6 December 14, 2017 true \n", "7 March 25, 2016 fake \n", "8 July 6, 2016 fake \n", "9 December 23, 2016 true " ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titletextsubjectdatelabel
0U.S. responds in court fight over illegal Indo...BOSTON (Reuters) - U.S. immigration officials ...politicsNewsDecember 21, 2017true
1Numbskull Republican Ignores History, Says Re...Republican Rep. Ted Poe (R-Texas) spoke with F...NewsJanuary 20, 2016fake
2US-UK DIRTY WAR: ‘Latin American-style’ Death ...Patrick Henningsen 21st Century WireThis week...Middle-eastJuly 14, 2016fake
3SUPREME COURT JUSTICE Goes All Creepy Predicti...What the heck is wrong with these loony libera...politicsJul 10, 2016fake
4Hillary Clinton: ‘Israel First’ (and no peace ...Robert Fantina CounterpunchAlthough the United...US_NewsJanuary 18, 2016fake
5Boiler Room EP #119 – Zombie Disneyland & The ...Tune in to the Alternate Current Radio Network...US_NewsJuly 29, 2017fake
6EU Parliament calls on Myanmar to free Reuters...BRUSSELS (Reuters) - The president of the Euro...worldnewsDecember 14, 2017true
7Donald Trump Releases Statement On Cruz Sex S...With the sex scandal allegations piling up aga...NewsMarch 25, 2016fake
8McConnell Just ADMITTED The NRA Must Approve ...We could already make the assumption that Sena...NewsJuly 6, 2016fake
9Putin says question of who hacked Democratic p...MOSCOW (Reuters) - Russian President Vladimir ...politicsNewsDecember 23, 2016true
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 55 } ] }, { "cell_type": "code", "source": [ "final_news_dataset.isnull().sum()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "p6QU5NZQ4Mbz", "outputId": "579159ab-f47d-4cb9-fa66-d728a4a42107" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "title 0\n", "text 0\n", "subject 0\n", "date 0\n", "label 0\n", "dtype: int64" ] }, "metadata": {}, "execution_count": 56 } ] }, { "cell_type": "code", "source": [ "# Reducing additional features\n", "final_news_dataset.drop([\"subject\",\"date\"], axis=1)\n", "\n", "# Exploring labelwise value counts\n", "final_news_dataset.label.value_counts()\n", "\n", "#viewing the processed data\n", "sns.set_theme(style=\"whitegrid\")\n", "sns.countplot(x=final_news_dataset[\"label\"])" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "zXrtchOF4R6T", "outputId": "d0e6aac8-922a-4715-d7ec-5680cc1e3de5" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "fake 23481\n", "true 21417\n", "Name: label, dtype: int64" ] }, "metadata": {}, "execution_count": 57 } ] }, { "cell_type": "code", "source": [], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 629 }, "id": "6HCRcYsl5Bgf", "outputId": "62e0a242-d791-43e3-8a04-3c66e88c3b58" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "" ] }, "metadata": {}, "execution_count": 58 }, { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "iVBORw0KGgoAAAANSUhEUgAAAnkAAAJSCAYAAAC/YtNUAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAAsTAAALEwEAmpwYAAAc2UlEQVR4nO3df6xX9X3H8de9wL1W1F7RYi+0i9ZOw0YM6b1rs7WsG8aAjbW1PwZBY1qnmyZ2P1xFO1foKLoBtslsnXbR1CxBSbPGMpiV1pl1btOmshh3Zzeco27KnU7QgaiX671nfxhuvK3WW+B7v/i+j8df3vM5N+d9TDh55pz7/Z6OpmmaAABQSme7BwAA4PATeQAABYk8AICCRB4AQEEiDwCgoOntHuBIMzo6mn379mXGjBnp6Oho9zgAAK+raZoMDw9n5syZ6ewcf+9O5P2Yffv2Zfv27e0eAwBgwk477bQce+yx47aJvB8zY8aMJK/8z+rq6mrzNAAAr2///v3Zvn37WL+8msj7MQce0XZ1daW7u7vN0wAAvLHX+hMzH7wAAChI5AEAFCTyAAAKEnkAAAWJPACAgkQeAEBBIg8AoCCRBwBQkMgDAChI5AEAFCTyAAAKEnkAAAWJPACAgkQeAEBBIg8AoCCRBwBQkMgDAChI5AEAFCTyAAAKEnkAAAWJPACAgkQeAEBBIg8AoCCRB1DU6MvD7R4BpqQj5d/e9HYPAEBrdE6fkW3rLm73GDDl9K24pd0jJHEnDwCgJJEHAFCQyAMAKEjkAQAUJPIAAAoSeQAABYk8AICCRB4AQEEiDwCgIJEHAFCQyAMAKEjkAQAUJPIAAAoSeQAABYk8AICCRB4AQEEiDwCgIJEHAFCQyAMAKEjkAQAUJPIAAAoSeQAABYk8AICCRB4AQEEiDwCgIJEHAFCQyAMAKEjkAQAUJPIAAAoSeQAABYk8AICCRB4AQEEiDwCgIJEHAFCQyAMAKEjkAQAUJPIAAAoSeQAABYk8AICCRB4AQEEiDwCgIJEHAFCQyAMAKEjkHQH2D4+0ewSYkvzbAyqb3u4BSLpmTMvyFRvaPQZMObevO7/dIwC0jDt5AAAFiTwAgIJEHgBAQSIPAKAgkQcAUJDIAwAoSOQBABQk8gAAChJ5AAAFiTwAgIJEHgBAQSIPAKAgkQcAUJDIAwAoSOQBABQk8gAAChJ5AAAFiTwAgIJEHgBAQSIPAKAgkQcAUJDIAwAoSOQBABQk8gAAChJ5AAAFiTwAgIImJfKeffbZXHLJJVm8eHE+/OEP5/LLL8/u3buTJA899FDOPffcLF68OBdddFF27do19nutWAMAmAomJfI6Ojpy8cUXZ+vWrdm8eXPe+c535vrrr8/o6GiuvPLKrFy5Mlu3bk1/f3+uv/76JGnJGgDAVDEpkdfT05P3ve99Yz8vWLAgO3fuzMDAQLq7u9Pf358kWbZsWe6+++4kackaAMBUMX2yDzg6Opo77rgjixYtyuDgYObMmTO2NmvWrIyOjua5555ryVpPT8+E5xwYGDi0E/0Z9PX1TdqxgPG2bdvW7hFaxrUF2udIuLZMeuR98YtfzNFHH50LLrgg3/3udyf78BM2f/78dHd3t3sMoMWEENAKk3VtGRoaet0bU5MaeWvXrs3jjz+em2++OZ2dnent7c3OnTvH1nfv3p3Ozs709PS0ZA0AYKqYtK9Q+fKXv5yBgYHceOON6erqSvLK3bKXXnopDz74YJJk48aNWbJkScvWAACmikm5k/foo4/ma1/7Wk4++eQsW7YsSfKOd7wjN954Y9atW5dVq1ZlaGgoc+fOzfr165MknZ2dh30NAGCq6Giapmn3EEeSA8+2J/tv8pav2DBpxwJecfu689s9QsttW3dxu0eAKadvxS2Tdqyf1i3eeAEAUJDIAwAoSOQBABQk8gAAChJ5AAAFiTwAgIJEHgBAQSIPAKAgkQcAUJDIAwAoSOQBABQk8gAAChJ5AAAFiTwAgIJEHgBAQSIPAKAgkQcAUJDIAwAoSOQBABQk8gAAChJ5AAAFiTwAgIJEHgBAQSIPAKAgkQcAUJDIAwAoSOQBABQk8gAAChJ5AAAFiTwAgIJEHgBAQSIPAKAgkQcAUJDIAwAoSOQBABQk8gAAChJ5AAAFiTwAgIJEHgBAQSIPAKAgkQcAUJDIAwAoSOQBABQk8gAAChJ5AAAFiTwAgIJEHgBAQSIPAKAgkQcAUJDIAwAoSOQBABQk8gAAChJ5AAAFiTwAgIJEHgBAQSIPAKAgkQcAUJDIAwAoSOQBABQk8gAAChJ5AAAFiTwAgIJEHgBAQSIPAKAgkQcAUJDIAwAoSOQBABQk8gAAChJ5AAAFiTwAgIJEHgBAQSIPAKAgkQcAUJDIAwAoSOQBABQk8gAAChJ5AAAFiTwAgIJEHgBAQSIPAKAgkQcAUJDIAwAoSOQBABQk8gAAChJ5AAAFiTwAgIJEHgBAQSIPAKAgkQcAUJDIAwAoSOQBABQk8gAAChJ5AAAFiTwAgIJEHgBAQSIPAKAgkQcAUJDIAwAoSOQBABQk8gAAChJ5AAAFTVrkrV27NosWLcrpp5+e7du3j21ftGhRlixZko985CP5yEc+kvvuu29s7aGHHsq5556bxYsX56KLLsquXbsOeQ0AYCqYtMg788wzs2HDhsydO/cn1m644YZs2rQpmzZtysKFC5Mko6OjufLKK7Ny5cps3bo1/f39uf766w9pDQBgqpi0yOvv709vb++E9x8YGEh3d3f6+/uTJMuWLcvdd999SGsAAFPF9HYPkCSf/exn0zRN+vr6csUVV+S4447L4OBg5syZM7bPrFmzMjo6mueee+6g13p6eibztAAA2qbtkbdhw4b09vZm//79ufbaa7N69eoj4vHqwMDApB2rr69v0o4FjLdt27Z2j9Ayri3QPkfCtaXtkXfgEW5XV1eWL1+eyy67bGz7zp07x/bbvXt3Ojs709PTc9BrP4v58+enu7v7EM4MeDMQQkArTNa1ZWho6HVvTLX1K1ReeOGF7N27N0nSNE3uuuuuzJs3L8krkfXSSy/lwQcfTJJs3LgxS5YsOaQ1AICpYtLu5K1Zsybf+c538swzz+TTn/50enp6cvPNN+czn/lMRkZGMjo6mlNPPTWrVq1KknR2dmbdunVZtWpVhoaGMnfu3Kxfv/6Q1gAApoqOpmmadg9xJDlw23OyH9cuX7Fh0o4FvOL2dee3e4SW27bu4naPAFNO34pbJu1YP61bvPECAKAgkQcAUJDIAwAoSOQBABQk8gAAChJ5AAAFiTwAgIJEHgBAQSIPAKAgkQcAUJDIAwAoSOQBABQk8gAAChJ5AAAFiTwAgIJEHgBAQSIPAKAgkQcAUJDIAwAoSOQBABQk8gAAChJ5AAAFiTwAgIJEHgBAQSIPAKAgkQcAUJDIAwAoSOQBABQk8gAAChJ5AAAFiTwAgIJEHgBAQSIPAKAgkQcAUJDIAwAoSOQBABQk8gAAChJ5AAAFiTwAgIImHHm33nrra27/+te/ftiGAQDg8Jhw5N14442vuf2mm246bMMAAHB4TH+jHe6///4kyejoaB544IE0TTO29sQTT2TmzJmtmw4AgIPyhpF3zTXXJEmGhobyh3/4h2PbOzo68ra3vS1/9Ed/1LrpAAA4KG8Yeffee2+SZMWKFVm3bl3LBwIA4NC9YeQd8OrAGx0dHbfW2elDugAAR5IJR96//uu/ZvXq1fn3f//3DA0NJUmapklHR0d++MMftmxAAAB+dhOOvKuvvjq//uu/nuuuuy5HHXVUK2cCAOAQTTjynnzyyfz+7/9+Ojo6WjkPAACHwYT/mO6ss87KP/zDP7RyFgAADpMJ38kbGhrK5Zdfnr6+vpx44onj1nzqFgDgyDLhyHv3u9+dd7/73a2cBQCAw2TCkXf55Ze3cg4AAA6jCUfegdebvZZf/uVfPizDAABweEw48g683uyAZ599NsPDwznppJPyt3/7t4d9MAAADt6EI+/A680OGBkZyU033ZSZM2ce9qEAADg0B/0+smnTpuXSSy/NLbfccjjnAQDgMDikl87+4z/+oy9HBgA4Ak34ce0HP/jBcUH34osvZv/+/Vm1alVLBgMA4OBNOPLWr18/7ue3vOUtOeWUU3LMMccc9qEAADg0E4689773vUmS0dHRPPPMMznxxBPT2XlIT3sBAGiRCVfa888/nxUrVuSMM87Ir/7qr+aMM87IVVddlb1797ZyPgAADsKEI2/NmjV58cUXs3nz5jz88MPZvHlzXnzxxaxZs6aV8wEAcBAm/Lj2vvvuyz333JO3vOUtSZJTTjklf/Inf5KzzjqrZcMBAHBwJnwnr7u7O7t37x637dlnn01XV9dhHwoAgEMz4Tt5n/jEJ3LRRRflU5/6VObMmZOdO3fmtttuyyc/+clWzgcAwEGYcORddtllOemkk7J58+Y8/fTTmT17di6++GKRBwBwBJrw49prr702p5xySm677bbcddddue2223Lqqafm2muvbeV8AAAchAlH3pYtWzJ//vxx2+bPn58tW7Yc9qEAADg0E468jo6OjI6Ojts2MjLyE9sAAGi/CUdef39//uzP/mws6kZHR/OVr3wl/f39LRsOAICDM+EPXlxzzTX57d/+7XzgAx/InDlzMjg4mLe97W25+eabWzkfAAAHYcKR9/a3vz133nlnHn744QwODqa3tzdnnHGG99cCAByBJhx5SdLZ2ZkFCxZkwYIFLRoHAIDDwW04AICCRB4AQEEiDwCgIJEHAFCQyAMAKEjkAQAUJPIAAAoSeQAABYk8AICCRB4AQEEiDwCgIJEHAFCQyAMAKEjkAQAUJPIAAAoSeQAABYk8AICCRB4AQEEiDwCgIJEHAFCQyAMAKEjkAQAUJPIAAAoSeQAABYk8AICCRB4AQEEiDwCgIJEHAFDQpETe2rVrs2jRopx++unZvn372PYdO3Zk6dKlWbx4cZYuXZof/ehHLV0DAJgqJiXyzjzzzGzYsCFz584dt33VqlVZvnx5tm7dmuXLl2flypUtXQMAmComJfL6+/vT29s7btuuXbvyyCOP5JxzzkmSnHPOOXnkkUeye/fulqwBAEwl09t14MHBwZx00kmZNm1akmTatGmZPXt2BgcH0zTNYV+bNWtWe04UAKAN2hZ5R7qBgYFJO1ZfX9+kHQsYb9u2be0eoWVcW6B9joRrS9sir7e3N0899VRGRkYybdq0jIyM5Omnn05vb2+apjnsaz+r+fPnp7u7uwVnDhxJhBDQCpN1bRkaGnrdG1Nt+wqVE044IfPmzcuWLVuSJFu2bMm8efMya9aslqwBAEwlHU3TNK0+yJo1a/Kd73wnzzzzTI4//vj09PTkb/7mb/LYY4/l6quvzp49e3Lcccdl7dq1ede73pUkLVmbiANFPNl38pav2DBpxwJecfu689s9QsttW3dxu0eAKadvxS2Tdqyf1i2TEnlvJiIPpg6RB7TCkRJ53ngBAFCQyAMAKEjkAQAUJPIAAAoSeQAABYk8AICCRB4AQEEiDwCgIJEHAFCQyAMAKEjkAQAUJPIAAAoSeQAABYk8AICCRB4AQEEiDwCgIJEHAFCQyAMAKEjkAQAUJPIAAAoSeQAABYk8AICCRB4AQEEiDwCgIJEHAFCQyAMAKEjkAQAUJPIAAAoSeQAABYk8AICCRB4AQEEiDwCgIJEHAFCQyAMAKEjkAQAUJPIAAAoSeQAABYk8AICCRB4AQEEiDwCgIJEHAFCQyAMAKEjkAQAUJPIAAAoSeQAABYk8AICCRB4AQEEiDwCgIJEHAFCQyAMAKEjkAQAUJPIAAAoSeQAABYk8AICCRB4AQEEiDwCgIJEHAFCQyAMAKEjkAQAUJPIAAAoSeQAABYk8AICCRB4AQEEiDwCgIJEHAFCQyAMAKEjkAQAUJPIAAAoSeQAABYk8AICCRB4AQEEiDwCgIJEHAFCQyAMAKEjkAQAUJPIAAAoSeQAABYk8AICCRB4AQEEiDwCgIJEHAFCQyAMAKEjkAQAUJPIAAAoSeQAABYk8AICCRB4AQEEiDwCgIJEHAFCQyAMAKEjkAQAUJPIAAAoSeQAABYk8AICCRB4AQEEiDwCgIJEHAFCQyAMAKEjkAQAUJPIAAAoSeQAABYk8AICCprd7gCRZtGhRurq60t3dnST57Gc/m4ULF+ahhx7KypUrMzQ0lLlz52b9+vU54YQTkuSg1wAApoIj5k7eDTfckE2bNmXTpk1ZuHBhRkdHc+WVV2blypXZunVr+vv7c/311yfJQa8BAEwVR0zk/biBgYF0d3env78/SbJs2bLcfffdh7QGADBVHBGPa5NXHtE2TZO+vr5cccUVGRwczJw5c8bWZ82aldHR0Tz33HMHvdbT0zPheQYGBg7LeU1EX1/fpB0LGG/btm3tHqFlXFugfY6Ea8sREXkbNmxIb29v9u/fn2uvvTarV6/OWWed1daZ5s+fP/Y3gkBdQghohcm6tgwNDb3ujakj4nFtb29vkqSrqyvLly/PP//zP6e3tzc7d+4c22f37t3p7OxMT0/PQa8BAEwVbY+8F154IXv37k2SNE2Tu+66K/Pmzcv8+fPz0ksv5cEHH0ySbNy4MUuWLEmSg14DAJgq2v64dteuXfnMZz6TkZGRjI6O5tRTT82qVavS2dmZdevWZdWqVeO+CiXJQa8BAEwVbY+8d77znfnWt771mmvvec97snnz5sO6BgAwFbT9cS0AAIefyAMAKEjkAQAUJPIAAAoSeQAABYk8AICCRB4AQEEiDwCgIJEHAFCQyAMAKEjkAQAUJPIAAAoSeQAABYk8AICCRB4AQEEiDwCgIJEHAFCQyAMAKEjkAQAUJPIAAAoSeQAABYk8AICCRB4AQEEiDwCgIJEHAFCQyAMAKEjkAQAUJPIAAAoSeQAABYk8AICCRB4AQEEiDwCgIJEHAFCQyAMAKEjkAQAUJPIAAAoSeQAABYk8AICCRB4AQEEiDwCgIJEHAFCQyAMAKEjkAQAUJPIAAAoSeQAABYk8AICCRB4AQEEiDwCgIJEHAFCQyAMAKEjkAQAUJPIAAAoSeQAABYk8AICCRB4AQEEiDwCgIJEHAFCQyAMAKEjkAQAUJPIAAAoSeQAABYk8AICCRB4AQEEiDwCgIJEHAFCQyAMAKEjkAQAUJPIAAAoSeQAABYk8AICCRB4AQEEiDwCgIJEHAFCQyAMAKEjkAQAUJPIAAAoSeQAABYk8AICCRB4AQEEiDwCgIJEHAFCQyAMAKEjkAQAUJPIAAAoSeQAABYk8AICCRB4AQEEiDwCgIJEHAFCQyAMAKEjkAQAUJPIAAAoSeQAABYk8AICCRB4AQEEiDwCgIJEHAFCQyAMAKEjkAQAUJPIAAAoSeQAABZWNvB07dmTp0qVZvHhxli5dmh/96EftHgkAYNKUjbxVq1Zl+fLl2bp1a5YvX56VK1e2eyQAgEkzvd0DtMKuXbvyyCOP5Otf/3qS5JxzzskXv/jF7N69O7Nmzfqpv9s0TZJk//79LZ/z1Y47esakHg9IhoaG2j1C6x11bLsngClnMq8tB3rlQL+8WsnIGxwczEknnZRp06YlSaZNm5bZs2dncHDwDSNveHg4SbJ9+/aWz/lql3z41Ek9HpAMDAy0e4TWe/8F7Z4Appx2XFuGh4dz1FFHjdtWMvIOxcyZM3PaaadlxowZ6ejoaPc4AACvq2maDA8PZ+bMmT+xVjLyent789RTT2VkZCTTpk3LyMhInn766fT29r7h73Z2dubYYz3eAADeHH78Dt4BJT94ccIJJ2TevHnZsmVLkmTLli2ZN2/eGz6qBQCooqN5rb/UK+Cxxx7L1VdfnT179uS4447L2rVr8653vavdYwEATIqykQcAMJWVfFwLADDViTwAgIJEHgBAQSIPAKAgkQev4Stf+cqkv9oOqO+ee+7J2WefnY9+9KP5z//8z9fc5/vf/34+9rGPTfJkVCTy4DV89atfHXvF3au9/PLLbZgGqGLjxo35nd/5nXzrW9/ytV60XMk3XsCh+OM//uMkybJly9LZ2Zm5c+fm+OOPz44dO7Jv377ceOON+fjHP57vf//7SZInnnhi3M/f+973ctNNN2X//v2ZMWNGPve5z2XBggXtOh3gCHHddddl27Zt2bFjR26//fbMnj07O3bsyPDwcH7u534u1113Xd761reO+509e/bk8ssvz6JFi/KpT30qd955Z26//faMjIzkmGOOyRe+8AWxyOtrgJ9w2mmnNc8//3zTNE1z1VVXNeedd16zb9++pmma5r//+7+b9773vWP7vvrnxx9/vPmN3/iNZu/evU3TNM327dubD37wg5M7PHDEuuCCC5p77723aZqm2bVr19j2L3/5y8369eubpmmaBx54oDnvvPOaJ554ojnvvPOab3/7203TNM0PfvCD5pJLLmmGhoaapmmav/u7v2uWLl06yWfAm4k7eTABS5YsydFHH/2G+9133335r//6r5x//vlj215++eU888wzOfHEE1s5IvAms2nTpmzevDnDw8N54YUXcvLJJ4+t/e///m8uvPDCrF27Nv39/UmSe++9N//2b/+WT37yk0leeTH9nj172jE6bxIiDybg1YE3ffr0NK96UczQ0NC4fRcuXJh169ZN2mzAm8+DDz6YO+64Ixs3bsysWbOyefPmfOMb3xhbf+tb35q3v/3t+fu///uxyGuaJh//+Mfzu7/7u+0amzcZH7yA1zBz5sw8//zzr7l24oknZnh4OI8//niSZMuWLWNr73//+3Pffffl0UcfHdv28MMPt3ZY4E1nz549OeaYY9LT05P9+/fnm9/85rj1rq6u/Pmf/3n+4z/+I2vWrEnTNFm0aFE2bdqU//mf/0mSjIyMZGBgoB3j8ybhTh68hosuuigXXnhhjjrqqMydO3fc2vTp03PNNdfk05/+dGbNmpVf+7VfG1s7+eSTs379+lxzzTV56aWXMjw8nPe85z0544wzJvkMgCPZwoUL89d//ddZvHhxjj/++PT39+df/uVfxu3T1dWVG264IVdeeWU+//nPZ/Xq1fm93/u9XHbZZRkZGcnw8HCWLFmS+fPnt+ksONJ1NK9+7gQAQAke1wIAFCTyAAAKEnkAAAWJPACAgkQeAEBBIg/gZ7Ro0aL80z/90xvud/rpp499n+LP6lB+FyAReQAAJYk8AICCRB7AQXr44YezdOnS9Pf35wMf+EBWr16d/fv3j9vne9/7Xs4888y8733vy9q1azM6Ojq29ld/9Vc5++yz80u/9Ev5zd/8zTz55JOTfQpAYSIP4CB1dnbmc5/7XB544IFs3Lgx999/f26//fZx+3z3u9/NN7/5zdx555259957x95Res899+RrX/tavvrVr+b+++9PX19f/uAP/qAdpwEUJfIADtL8+fOzYMGCTJ8+Pe94xzuydOnS/OAHPxi3zyWXXJKenp7MmTMnF154YbZs2ZIk2bhxY37rt34rp556aqZPn55LL700P/zhD93NAw6b6e0eAODNaseOHfnTP/3TDAwM5MUXX8zIyEh+8Rd/cdw+vb29Y/89d+7cPP3000mSnTt35rrrrsvatWvH1pumyVNPPZW5c+dOzgkApYk8gIP0hS98Ib/wC7+QL33pSznmmGNy2223ZevWreP2GRwczM///M8neSXsZs+eneSV+Lv00ktz7rnnTvrcwNTgcS3AQdq3b19mzpyZmTNn5rHHHssdd9zxE/vceuut+b//+78MDg7mL//yL/OhD30oSbJs2bL8xV/8RR599NEkyd69e/Ptb397UucHanMnD+AgXXXVVfn85z+fW2+9NfPmzcuHPvShPPDAA+P2OfPMM/Oxj30szz//fM4777x84hOfSJKcddZZ2bdvX6644oo8+eSTOfbYY/Mrv/IrOfvss9txKkBBHU3TNO0eAgCAw8vjWgCAgkQeAEBBIg8AoCCRBwBQkMgDAChI5AEAFCTyAAAKEnkAAAWJPACAgv4fdbB9EaexJy4AAAAASUVORK5CYII=\n" }, "metadata": {} } ] }, { "cell_type": "code", "source": [ "# Minimizing the features\n", "final_news_dataset[\"text\"]=final_news_dataset[\"title\"]+final_news_dataset[\"text\"]\n", "dataset=final_news_dataset[[\"text\",\"label\"]]\n", "\n", "# Maping the labels into 0s and 1s\n", "dataset['label'] = final_news_dataset['label'].map({'true':1, 'fake':0})" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "KhsDEe-I5OiT", "outputId": "fefc48c1-6839-433f-c79a-881344b898f8" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ ":6: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " dataset['label'] = final_news_dataset['label'].map({'true':1, 'fake':0})\n" ] } ] }, { "cell_type": "code", "source": [ "# storing the features \n", "max_len=100\n", "text=dataset[\"text\"]\n", "label=dataset[\"label\"]" ], "metadata": { "id": "icH5EXWj6l-9" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "\n", "# Load the tokenizer and model\n", "tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n", "model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)\n" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "WqdRXa1XEvPr", "outputId": "14ac8ce1-4fbb-4fd2-a7f2-0fb9fa5cba8f" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight']\n", "- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n", "- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n", "Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']\n", "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n" ] } ] }, { "cell_type": "code", "source": [ "text_train, text_test, label_train, label_test = train_test_split(text, label, stratify = label, test_size = 0.2, random_state = 50)" ], "metadata": { "id": "B580Edzq7OKu" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "import transformers\n", "\n", "def tokenize_text(input_text):\n", " # Initialize a BERT tokenizer with the pretrained model\n", " tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-uncased')\n", " \n", " # Tokenize the input text\n", " tokenized_text = tokenizer.batch_encode_plus(\n", " input_text,\n", " max_length=100,\n", " add_special_tokens=True,\n", " padding='max_length',\n", " truncation=True,\n", " return_attention_mask=True,\n", " return_token_type_ids=False,\n", " verbose=True\n", " )\n", " \n", " # Return the tokenized text\n", " return tokenized_text" ], "metadata": { "id": "_VmBjJZs9BXu" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "data_train_token = tokenize_text(text_train)\n", "data_test_token = tokenize_text(text_test)" ], "metadata": { "id": "YbrRMAyG9EYN" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "import tensorflow as tf\n", "from tensorflow.keras.layers import Input, Dense, Dropout\n", "from tensorflow.keras.models import Model\n", "import transformers\n", "\n", "def create_model(maxlen):\n", " # Load the BERT model and tokenizer\n", " bert_model = transformers.TFBertModel.from_pretrained('bert-base-uncased')\n", " bert_tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-uncased')\n", " \n", " # Define input layers for BERT inputs\n", " input_ids = Input(shape=(maxlen,), dtype=tf.int32)\n", " input_mask = Input(shape=(maxlen,), dtype=tf.int32)\n", " \n", " # Use the BERT model to encode the input text\n", " bert_layer = bert_model([input_ids, input_mask])[1]\n", " \n", " # Apply dropout regularization\n", " x = Dropout(0.5)(bert_layer)\n", " \n", " # Add a fully connected layer with activation function tanh\n", " x = Dense(64, activation='tanh')(x)\n", " \n", " # Apply dropout regularization again\n", " x = Dropout(0.2)(x)\n", " \n", " # Add a final output layer with sigmoid activation function\n", " x = Dense(1, activation='sigmoid')(x)\n", " \n", " # Define the model with inputs and outputs\n", " model = Model(inputs=[input_ids, input_mask], outputs=x)\n", " \n", " # Return the model\n", " return model" ], "metadata": { "id": "R_ypyyBA-_oh" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "# Checking out the model\n", "\n", "model=create_model(100)\n", "model.summary()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "1R3mS0IK_G9l", "outputId": "0e1eb00c-bfd5-4680-f033-c791f3fa2f42" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']\n", "- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n", "- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n", "All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.\n", "If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.\n" ] }, { "output_type": "stream", "name": "stdout", "text": [ "Model: \"model_1\"\n", "__________________________________________________________________________________________________\n", " Layer (type) Output Shape Param # Connected to \n", "==================================================================================================\n", " input_3 (InputLayer) [(None, 100)] 0 [] \n", " \n", " input_4 (InputLayer) [(None, 100)] 0 [] \n", " \n", " tf_bert_model_1 (TFBertModel) TFBaseModelOutputWi 109482240 ['input_3[0][0]', \n", " thPoolingAndCrossAt 'input_4[0][0]'] \n", " tentions(last_hidde \n", " n_state=(None, 100, \n", " 768), \n", " pooler_output=(Non \n", " e, 768), \n", " past_key_values=No \n", " ne, hidden_states=N \n", " one, attentions=Non \n", " e, cross_attentions \n", " =None) \n", " \n", " dropout_76 (Dropout) (None, 768) 0 ['tf_bert_model_1[0][1]'] \n", " \n", " dense_2 (Dense) (None, 64) 49216 ['dropout_76[0][0]'] \n", " \n", " dropout_77 (Dropout) (None, 64) 0 ['dense_2[0][0]'] \n", " \n", " dense_3 (Dense) (None, 1) 65 ['dropout_77[0][0]'] \n", " \n", "==================================================================================================\n", "Total params: 109,531,521\n", "Trainable params: 109,531,521\n", "Non-trainable params: 0\n", "__________________________________________________________________________________________________\n" ] } ] }, { "cell_type": "code", "source": [ "import tensorflow as tf\n", "import matplotlib.pyplot as plt\n", "\n", "# Set up the optimizer with specific parameters\n", "optimizer = tf.keras.optimizers.legacy.Adam(\n", " learning_rate=1e-05,\n", " epsilon=1e-08,\n", " decay=0.01,\n", " clipnorm=1.0\n", ")\n", "\n", "\n", "# Compile the model with binary cross-entropy loss and accuracy metric\n", "model.compile(\n", " optimizer=optimizer,\n", " loss='binary_crossentropy',\n", " metrics=['accuracy']\n", ")\n", "\n", "# Set up an early stopping callback with specific parameters\n", "callback = tf.keras.callbacks.EarlyStopping(\n", " monitor='val_loss',\n", " mode='max',\n", " verbose=1,\n", " patience=50,\n", " baseline=0.4,\n", " min_delta=0.0001,\n", " restore_best_weights=False\n", ")\n" ], "metadata": { "id": "GPVqHrW2BBBe" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "history = model.fit(x = {'input_1':data_train_token['input_ids'],'input_2':data_train_token['attention_mask']}, y = label_train, epochs=10, validation_split = 0.2, batch_size = 30, callbacks=[callback])" ], "metadata": { "id": "m8etkxDDUfnc" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "# make predictions on the test data\n", "test_pred = model.predict(text_test)\n", "\n", "# calculate AUC score on the test data\n", "auc_score = roc_auc_score(label_test, test_pred)\n", "\n", "# plot ROC curve\n", "fpr, tpr, _ = roc_curve(label_test, test_pred)\n", "plt.plot(fpr, tpr)\n", "plt.title('ROC Curve (AUC = {:.2f})'.format(auc_score))\n", "plt.xlabel('False Positive Rate')\n", "plt.ylabel('True Positive Rate')\n", "plt.show()" ], "metadata": { "id": "aLSwrYrYZN_D" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "import numpy as np\n", "from sklearn.metrics import confusion_matrix\n", "from mlxtend.plotting import plot_confusion_matrix\n", "import matplotlib.pyplot as plt\n", "\n", "\n", "conf_matrix = confusion_matrix(Y_test,y_pred)\n", "fig, ax = plot_confusion_matrix(conf_mat=conf_matrix, figsize=(6, 6), cmap=plt.cm.Greens)\n", "plt.xlabel('Predictions', fontsize=18)\n", "plt.ylabel('Actuals', fontsize=18)\n", "plt.title('Confusion Matrix', fontsize=18)\n", "plt.show()\n" ], "metadata": { "id": "LIE2zuvMPDfN" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "\n", "# # Evaluate the model and generate an AUC curve\n", "# # model.eval()\n", "# y_true = []\n", "# y_pred = []\n", "# with torch.no_grad():\n", "# for batch in test_dataloader:\n", "# input_ids, attention_mask, labels = batch\n", "# outputs = model(input_ids, attention_mask=attention_mask)\n", "# logits = outputs.logits\n", "# probs = torch.softmax(logits, dim=1)[:, 1]\n", "# y_true.extend(labels.numpy())\n", "# y_pred.extend(probs.numpy())\n", "# auc = roc_auc_score(y_true, y_pred)\n", "# print(f'AUC: {auc}')\n" ], "metadata": { "id": "nOoRwd7tFFO_" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "testcase = \"New York City is set to reopen its public schools for in-person learning in the fall with no remote option for students, Mayor Bill de Blasio announced on Monday, making it the largest school district in the country to offer no virtual learning. The announcement came as the city has achieved its goal of vaccinating at least one million residents against Covid-19 and as public health officials have said that it is safe for schools to fully reopen.\"" ], "metadata": { "id": "T8o_10CHa9qW" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "test_token = tokenize_text(testcase)" ], "metadata": { "id": "xmzHCwfZbqV7" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "test_text_pred = np.where(model.predict({ 'input_1' : test_token['input_ids'] , 'input_2' : test_token['attention_mask']}) >=0.5,1,0)" ], "metadata": { "id": "4ysgjuEYb1eS" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "if(test_text_pred[0]==0): print(\"Fake news\")\n", "else: print(\"True News\")" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "cM8-uH_Kb8Ok", "outputId": "94931573-ddb5-4516-8a52-b60bfc405186" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Fake news\n" ] } ] }, { "cell_type": "markdown", "source": [ "**Write up**: \n", "* Link to the model on Hugging Face Hub: \n", "* Include some examples of misclassified news articles. Please explain what you might do to improve your model's performance on these news articles in the future (you do not need to impelement these suggestions)\n", "\n", "[Please put your write up here]" ], "metadata": { "id": "kpInVUMLyJ24" } }, { "cell_type": "markdown", "source": [], "metadata": { "id": "ZTSnl1RBoCCy" } }, { "cell_type": "markdown", "metadata": { "id": "jTfHpo6BOmE8" }, "source": [ "# 3. Deep RL / Robotics" ] }, { "cell_type": "markdown", "metadata": { "id": "saB64bbTXWgZ" }, "source": [ "**RL for Classical Control:** Using any of the [classical control](https://github.com/openai/gym/blob/master/docs/environments.md#classic-control) environments from OpenAI's `gym`, implement a deep NN that learns an optimal policy which maximizes the reward of the environment.\n", "\n", "* Describe the NN you implemented and the behavior you observe from the agent as the model converges (or diverges).\n", "* Plot the reward as a function of steps (or Epochs).\n", "Compare your results to a random agent.\n", "* Discuss whether you think your model has learned the optimal policy and potential methods for improving it and/or where it might fail.\n", "* (Optional) [Upload the the model to the Hugging Face Hub](https://huggingface.co/docs/hub/adding-a-model), and add a link to your model below.\n", "\n", "\n", "You may use any frameworks you like, but you must implement your NN on your own (no pre-defined/trained models like [`stable_baselines`](https://stable-baselines.readthedocs.io/en/master/)).\n", "\n", "You may use any simulator other than `gym` _however_:\n", "* The environment has to be similar to the classical control environments (or more complex like [`robosuite`](https://github.com/ARISE-Initiative/robosuite)).\n", "* You cannot choose a game/Atari/text based environment. The purpose of this challenge is to demonstrate an understanding of basic kinematic/dynamic systems." ] }, { "cell_type": "code", "source": [ "### WRITE YOUR CODE TO TRAIN THE MODEL HERE" ], "metadata": { "id": "CUhkTcoeynVv" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "**Write up**: \n", "* (Optional) link to the model on Hugging Face Hub: \n", "* Discuss whether you think your model has learned the optimal policy and potential methods for improving it and/or where it might fail.\n", "\n", "[Please put your write up here]" ], "metadata": { "id": "bWllPZhJyotg" } }, { "cell_type": "markdown", "metadata": { "id": "rbrRbrISa5J_" }, "source": [ "# 4. Theory / Linear Algebra " ] }, { "cell_type": "markdown", "metadata": { "id": "KFkLRCzTXTzL" }, "source": [ "**Implement Contrastive PCA** Read [this paper](https://www.nature.com/articles/s41467-018-04608-8) and implement contrastive PCA in Python.\n", "\n", "* First, please discuss what kind of dataset this would make sense to use this method on\n", "* Implement the method in Python (do not use previous implementations of the method if they already exist)\n", "* Then create a synthetic dataset and apply the method to the synthetic data. Compare with standard PCA.\n" ] }, { "cell_type": "markdown", "source": [ "**Write up**: Discuss what kind of dataset it would make sense to use Contrastive PCA\n", "\n", "[Please put your write up here]" ], "metadata": { "id": "TpyqWl-ly0wy" } }, { "cell_type": "code", "source": [ "### WRITE YOUR CODE HERE" ], "metadata": { "id": "1CQzUSfQywRk" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "# 5. Systems" ], "metadata": { "id": "dlqmZS5Hy6q-" } }, { "cell_type": "markdown", "source": [ "**Inference on the edge**: Measure the inference times in various computationally-constrained settings\n", "\n", "* Pick a few different speech detection models (we suggest looking at models on the [Hugging Face Hub](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&sort=downloads))\n", "* Simulate different memory constraints and CPU allocations that are realistic for edge devices that might run such models, such as smart speakers or microcontrollers, and measure what is the average inference time of the models under these conditions \n", "* How does the inference time vary with (1) choice of model (2) available system memory (3) available CPU (4) size of input?\n", "\n", "Are there any surprising discoveries? (Note that this coding challenge is fairly open-ended, so we will be considering the amount of effort invested in discovering something interesting here)." ], "metadata": { "id": "QW_eiDFw1QKm" } }, { "cell_type": "code", "source": [ "### WRITE YOUR CODE HERE" ], "metadata": { "id": "OYp94wLP1kWJ" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "**Write up**: What surprising discoveries do you see?\n", "\n", "[Please put your write up here]" ], "metadata": { "id": "yoHmutWx2jer" } } ] }