oracat
/

bert-paper-classifier-arxiv

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "1c71aba7-c0f3-4378-9b63-55529e0994b4",
+   "metadata": {},
+   "source": [
+    "# Data\n",
+    "\n",
+    "Мы используем следующий датасет для файнтюнинга:\n",
+    "\n",
+    "- [arXiv papers](https://www.kaggle.com/datasets/neelshah18/arxivdataset)\n",
+    "\n",
+    "Среди статей на arXiv есть также статьи по вычислительной биологии, геномике, etc.\n",
+    "\n",
+    "Среди альтернатив — [датасет](https://zenodo.org/record/7695390) из [недавнего исследования](https://www.biorxiv.org/content/10.1101/2023.04.10.536208v1.full.pdf) с названиями и лейблами статей из PubMed. В нём 20 миллионов статей, но приведены только заголовки (без абстрактов).\n",
+    "\n",
+    "В данном ноутбуке мы используем данные и теги с arXiv."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e9874f4a-3898-4c89-a0f7-04eeabf2b389",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "# Models\n",
+    "\n",
+    "В качестве базовой модели мы используем BERT, натренированный на биомедицинских данных (из PubMed). \n",
+    "\n",
+    "- [BiomedNLP-PubMedBERT](https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "991e48e7-897f-45a3-8a0b-539ea67b4eb5",
+   "metadata": {},
+   "source": [
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2f130f05-21ee-46f9-889f-488e8c676aba",
+   "metadata": {},
+   "source": [
+    "# Imports"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "757a0582-1b8c-4f1c-b26f-544688e391f4",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "import transformers\n",
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "from tqdm import tqdm\n",
+    "\n",
+    "import torch\n",
+    "from datasets import Dataset, ClassLabel\n",
+    "from transformers import AutoTokenizer, AutoModelForMaskedLM, AutoModelForSequenceClassification\n",
+    "from transformers import TrainingArguments, Trainer\n",
+    "from transformers import pipeline\n",
+    "import evaluate"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "03847b87-d096-49a5-b6e2-023fa08b94c2",
+   "metadata": {},
+   "source": [
+    "# Load data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b3e902ea-4e0f-4d76-b27b-59e472b2b556",
+   "metadata": {},
+   "source": [
+    "Загрузим данные для файнтюнинга — в частности, нам понадобятся названия статей, их абстракты и теги."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "1be8f69e-bd7d-4ca9-ba9f-044b8e7bc497",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "df = pd.read_json(\"arxivData.json\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "791edb3c-a96d-4042-b35d-c8097bbbef79",
+   "metadata": {},
+   "source": [
+    " "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d5b6158a-728e-4ada-bcdc-a4a49328f002",
+   "metadata": {},
+   "source": [
+    "Совместим заголовки и абстракты и сохраним текст в соответствующей колонке:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "c8709a7b-becf-4f19-8b4f-8773cd5c60f1",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "df['text'] = df['title'] + \"\\n\" + df['summary']"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "ed0ed687-6439-494a-a5a8-c572bc2e4059",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>author</th>\n",
+       "      <th>day</th>\n",
+       "      <th>id</th>\n",
+       "      <th>link</th>\n",
+       "      <th>month</th>\n",
+       "      <th>summary</th>\n",
+       "      <th>tag</th>\n",
+       "      <th>title</th>\n",
+       "      <th>year</th>\n",
+       "      <th>text</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>[{'name': 'Ahmed Osman'}, {'name': 'Wojciech S...</td>\n",
+       "      <td>1</td>\n",
+       "      <td>1802.00209v1</td>\n",
+       "      <td>[{'rel': 'alternate', 'href': 'http://arxiv.or...</td>\n",
+       "      <td>2</td>\n",
+       "      <td>We propose an architecture for VQA which utili...</td>\n",
+       "      <td>[{'term': 'cs.AI', 'scheme': 'http://arxiv.org...</td>\n",
+       "      <td>Dual Recurrent Attention Units for Visual Ques...</td>\n",
+       "      <td>2018</td>\n",
+       "      <td>Dual Recurrent Attention Units for Visual Ques...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>[{'name': 'Ji Young Lee'}, {'name': 'Franck De...</td>\n",
+       "      <td>12</td>\n",
+       "      <td>1603.03827v1</td>\n",
+       "      <td>[{'rel': 'alternate', 'href': 'http://arxiv.or...</td>\n",
+       "      <td>3</td>\n",
+       "      <td>Recent approaches based on artificial neural n...</td>\n",
+       "      <td>[{'term': 'cs.CL', 'scheme': 'http://arxiv.org...</td>\n",
+       "      <td>Sequential Short-Text Classification with Recu...</td>\n",
+       "      <td>2016</td>\n",
+       "      <td>Sequential Short-Text Classification with Recu...</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                                              author  day            id  \\\n",
+       "0  [{'name': 'Ahmed Osman'}, {'name': 'Wojciech S...    1  1802.00209v1   \n",
+       "1  [{'name': 'Ji Young Lee'}, {'name': 'Franck De...   12  1603.03827v1   \n",
+       "\n",
+       "                                                link  month  \\\n",
+       "0  [{'rel': 'alternate', 'href': 'http://arxiv.or...      2   \n",
+       "1  [{'rel': 'alternate', 'href': 'http://arxiv.or...      3   \n",
+       "\n",
+       "                                             summary  \\\n",
+       "0  We propose an architecture for VQA which utili...   \n",
+       "1  Recent approaches based on artificial neural n...   \n",
+       "\n",
+       "                                                 tag  \\\n",
+       "0  [{'term': 'cs.AI', 'scheme': 'http://arxiv.org...   \n",
+       "1  [{'term': 'cs.CL', 'scheme': 'http://arxiv.org...   \n",
+       "\n",
+       "                                               title  year  \\\n",
+       "0  Dual Recurrent Attention Units for Visual Ques...  2018   \n",
+       "1  Sequential Short-Text Classification with Recu...  2016   \n",
+       "\n",
+       "                                                text  \n",
+       "0  Dual Recurrent Attention Units for Visual Ques...  \n",
+       "1  Sequential Short-Text Classification with Recu...  "
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "df.head(2)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ce1de806-a4d2-4e58-a3a8-f3542392f22e",
+   "metadata": {},
+   "source": [
+    "## Labels"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b5183517-8b02-47bc-812a-415b5651e07d",
+   "metadata": {},
+   "source": [
+    "Будем использовать категории из arXiv'а, такие как `astro-ph` для статей по астрофизике или `cs.CV` для computer vision (computer science)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "ba4e7197-23b6-4cb4-9b44-620c6b730eb7",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Total: 126 labels such as adap-org, astro-ph, ..., stat.OT\n"
+     ]
+    }
+   ],
+   "source": [
+    "df['category'] = [eval(i)[0]['term'].strip() for i in df['tag']]\n",
+    "categories = np.unique(df['category'])\n",
+    "num_labels = len(categories)\n",
+    "print(f\"Total: {num_labels} labels such as {categories[0]}, {categories[1]}, ..., {categories[-1]}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "1508a6d9-856d-4ecf-a0f3-895d3ffbe99b",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>category</th>\n",
+       "      <th>category_index</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>adap-org</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>astro-ph</td>\n",
+       "      <td>1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>astro-ph.CO</td>\n",
+       "      <td>2</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>astro-ph.EP</td>\n",
+       "      <td>3</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>astro-ph.GA</td>\n",
+       "      <td>4</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "      category  category_index\n",
+       "0     adap-org               0\n",
+       "1     astro-ph               1\n",
+       "2  astro-ph.CO               2\n",
+       "3  astro-ph.EP               3\n",
+       "4  astro-ph.GA               4"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "pd.DataFrame({\n",
+    "    \"category\": categories,\n",
+    "    \"category_index\": np.arange(num_labels),\n",
+    "}).head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "5c082c3a-7b0e-4320-b62d-f75a6c9f2398",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "df = pd.DataFrame({\n",
+    "    \"category\": categories,\n",
+    "    \"category_index\": np.arange(num_labels),\n",
+    "}).set_index(\"category\").join(df.set_index(\"category\"), how=\"right\", sort=False).reset_index()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "76d8ccb9-a993-4d82-9dd3-689380e92e55",
+   "metadata": {},
+   "source": [
+    "# Model"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "a0c154f7-d2fa-46a1-8b69-57174bf00632",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n",
+    "print(device)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2bf6513d-664d-4b94-8b05-7e8df205e3ec",
+   "metadata": {},
+   "source": [
+    "Токенайзер (название + абстракт -> токены):"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "12fa49a7-2ac5-4f78-84fe-93305926692e",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "tokenizer = AutoTokenizer.from_pretrained(\"microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0ea1b4e5-9067-4292-ba12-8f560bbf26fd",
+   "metadata": {},
+   "source": [
+    "Сама модель, в которой `AutoModelForSequenceClassification` заменит голову для задачи классификации:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "d6eb92bc-c293-47ad-b9cc-2a63e8f1de69",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Some weights of the model checkpoint at microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight']\n",
+      "- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
+      "- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n",
+      "Some weights of BertForSequenceClassification were not initialized from the model checkpoint at microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract and are newly initialized: ['classifier.bias', 'classifier.weight']\n",
+      "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
+     ]
+    }
+   ],
+   "source": [
+    "model = AutoModelForSequenceClassification.from_pretrained(\"microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract\", num_labels=num_labels).to(device)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "f5c79846-e6fc-42c0-bb8d-949678f5e60a",
+   "metadata": {
+    "scrolled": true,
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "BertForSequenceClassification(\n",
+      "  (bert): BertModel(\n",
+      "    (embeddings): BertEmbeddings(\n",
+      "      (word_embeddings): Embedding(30522, 768, padding_idx=0)\n",
+      "      (position_embeddings): Embedding(512, 768)\n",
+      "      (token_type_embeddings): Embedding(2, 768)\n",
+      "      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+      "      (dropout): Dropout(p=0.1, inplace=False)\n",
+      "    )\n",
+      "    (encoder): BertEncoder(\n",
+      "      (layer): ModuleList(\n",
+      "        (0-11): 12 x BertLayer(\n",
+      "          (attention): BertAttention(\n",
+      "            (self): BertSelfAttention(\n",
+      "              (query): Linear(in_features=768, out_features=768, bias=True)\n",
+      "              (key): Linear(in_features=768, out_features=768, bias=True)\n",
+      "              (value): Linear(in_features=768, out_features=768, bias=True)\n",
+      "              (dropout): Dropout(p=0.1, inplace=False)\n",
+      "            )\n",
+      "            (output): BertSelfOutput(\n",
+      "              (dense): Linear(in_features=768, out_features=768, bias=True)\n",
+      "              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+      "              (dropout): Dropout(p=0.1, inplace=False)\n",
+      "            )\n",
+      "          )\n",
+      "          (intermediate): BertIntermediate(\n",
+      "            (dense): Linear(in_features=768, out_features=3072, bias=True)\n",
+      "            (intermediate_act_fn): GELUActivation()\n",
+      "          )\n",
+      "          (output): BertOutput(\n",
+      "            (dense): Linear(in_features=3072, out_features=768, bias=True)\n",
+      "            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)\n",
+      "            (dropout): Dropout(p=0.1, inplace=False)\n",
+      "          )\n",
+      "        )\n",
+      "      )\n",
+      "    )\n",
+      "    (pooler): BertPooler(\n",
+      "      (dense): Linear(in_features=768, out_features=768, bias=True)\n",
+      "      (activation): Tanh()\n",
+      "    )\n",
+      "  )\n",
+      "  (dropout): Dropout(p=0.1, inplace=False)\n",
+      "  (classifier): Linear(in_features=768, out_features=126, bias=True)\n",
+      ")\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(model)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5ce6eefc-91ce-4486-9568-b686d04adcc7",
+   "metadata": {},
+   "source": [
+    "# Training"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "71add72c-eafb-491a-8820-31ce7336524f",
+   "metadata": {},
+   "source": [
+    "## Data Loaders"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2a0b579c-998a-4d2e-bf0e-d4c7406d22da",
+   "metadata": {},
+   "source": [
+    "Для работы с `transformers`, возможно, будет удобнее использовать библиотеку `datasets` для работы с данными."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "47b0e14a-866b-49ac-8b95-49a91a0bcc22",
+   "metadata": {},
+   "source": [
+    "Создадим (hugging face) [датасет](https://huggingface.co/docs/datasets/tabular_load#pandas-dataframes):"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "id": "dc1a3f33-0ef9-43c9-ab5f-eb9ae304b897",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "np.random.seed(42)\n",
+    "train_indices = np.sort(np.random.choice(np.arange(len(df)), size=37_000, replace=False))\n",
+    "test_indices = np.array([i for i in np.arange(len(df)) if i not in train_indices])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "id": "d948f8a6-1a7a-4baa-88a0-418596a1f275",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "train_df = df.loc[:,[\"text\", \"category\"]].iloc[train_indices]\n",
+    "test_df = df.loc[:,[\"text\", \"category\"]].iloc[test_indices]\n",
+    "\n",
+    "train_ds = Dataset.from_pandas(train_df, split=\"train\")\n",
+    "test_ds = Dataset.from_pandas(test_df, split=\"test\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "id": "50242a35-3067-41e5-8de8-f7e6a4fb6e9c",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Map:   0%|          | 0/37000 [00:00<?, ? examples/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Map:   0%|          | 0/4000 [00:00<?, ? examples/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "def tokenize_text(row):\n",
+    "    return tokenizer(\n",
+    "        row[\"text\"],\n",
+    "        max_length=512,\n",
+    "        truncation=True,\n",
+    "        padding='max_length',\n",
+    "    )\n",
+    "\n",
+    "train_ds = train_ds.map(tokenize_text, batched=True)\n",
+    "test_ds = test_ds.map(tokenize_text, batched=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 77,
+   "id": "35d454d1-fbdc-4847-8b60-4c6c442364b1",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Map:   0%|          | 0/37000 [00:00<?, ? examples/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Map:   0%|          | 0/4000 [00:00<?, ? examples/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Casting the dataset:   0%|          | 0/37000 [00:00<?, ? examples/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Casting the dataset:   0%|          | 0/4000 [00:00<?, ? examples/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "labels_map = ClassLabel(num_classes=num_labels, names=list(categories))\n",
+    "\n",
+    "def transform_labels(row):\n",
+    "    # default name for a label (label or label_ids)\n",
+    "    return {\"label\": labels_map.str2int(row[\"category\"])}\n",
+    "\n",
+    "# OR: \n",
+    "# \n",
+    "# labels_map = pd.Series(\n",
+    "#     np.arange(num_labels),\n",
+    "#     index=categories,\n",
+    "# )\n",
+    "# \n",
+    "# def transform_labels(row):\n",
+    "#     return {\"label\": labels_map[row[\"category\"]]}\n",
+    "\n",
+    "train_ds = train_ds.map(transform_labels, batched=True)\n",
+    "test_ds = test_ds.map(transform_labels, batched=True)\n",
+    "\n",
+    "train_ds = train_ds.cast_column('label', labels_map)\n",
+    "test_ds = test_ds.cast_column('label', labels_map)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6f3862ef-ed78-461f-ba68-8f059f01d355",
+   "metadata": {},
+   "source": [
+    " "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "811c5fe3-218e-4187-878d-65abc157f802",
+   "metadata": {},
+   "source": [
+    "## Prepare training"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 110,
+   "id": "d2160c7d-4130-47ae-9d6d-6684e4ba7e9b",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Some weights of the model checkpoint at microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight']\n",
+      "- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
+      "- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n",
+      "Some weights of BertForSequenceClassification were not initialized from the model checkpoint at microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract and are newly initialized: ['classifier.bias', 'classifier.weight']\n",
+      "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
+     ]
+    }
+   ],
+   "source": [
+    "model = AutoModelForSequenceClassification.from_pretrained(\n",
+    "    \"microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract\", \n",
+    "    num_labels=num_labels,\n",
+    "    id2label={i:labels_map.names[i] for i in range(len(categories))},\n",
+    "    label2id={labels_map.names[i]:i for i in range(len(categories))},\n",
+    ").to(device)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 111,
+   "id": "72e74c2b-89d7-4c17-8df1-dcfd40ead01e",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "tokenizer = AutoTokenizer.from_pretrained(\"microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ebb91037-fbdf-4453-87de-6da5eec3304f",
+   "metadata": {},
+   "source": [
+    "Будем вычислять accuracy:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 112,
+   "id": "630f6fa5-4c53-4962-b36d-5ee9aad6e29d",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "metric = evaluate.load(\"accuracy\")\n",
+    "\n",
+    "def compute_metrics(eval_pred):\n",
+    "    logits, labels = eval_pred\n",
+    "    predictions = np.argmax(logits, axis=-1)\n",
+    "    return metric.compute(predictions=predictions, references=labels)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 113,
+   "id": "f64425b7-72b7-466a-8e3e-cd7624893139",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "training_args = TrainingArguments(\n",
+    "    output_dir=\"bert-paper-classifier-arxiv\", \n",
+    "    evaluation_strategy=\"epoch\",\n",
+    "    per_device_train_batch_size=64,\n",
+    "    num_train_epochs=10,\n",
+    "    logging_steps=10,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 114,
+   "id": "b850cd9b-eb36-40ec-8cf2-26206fedcf27",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "trainer = Trainer(\n",
+    "    model=model,\n",
+    "    args=training_args,\n",
+    "    train_dataset=train_ds,\n",
+    "    eval_dataset=test_ds,\n",
+    "    compute_metrics=compute_metrics,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e6b88166-d82e-4502-acef-494fbb206d30",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "trainer.train()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7ed8c94a-e3ef-47f9-96a8-c112eb7f11bc",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Convert to a python file and run training:\n",
+    "#! jupyter nbconvert finetuning-arxiv.ipynb --to python"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cc8dad7d-8105-4f37-9087-615314c35afb",
+   "metadata": {},
+   "source": [
+    "# Save and share"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 37,
+   "id": "38d24722-d5c6-40ac-b568-3cd7fd9f225e",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "trainer.args.hub_model_id = \"bert-paper-classifier-arxiv\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 50,
+   "id": "9530790c-bc63-48f4-9a01-8c534fa90e00",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "('bert-paper-classifier/tokenizer_config.json',\n",
+       " 'bert-paper-classifier/special_tokens_map.json',\n",
+       " 'bert-paper-classifier/vocab.txt',\n",
+       " 'bert-paper-classifier/added_tokens.json',\n",
+       " 'bert-paper-classifier/tokenizer.json')"
+      ]
+     },
+     "execution_count": 50,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "tokenizer.save_pretrained(\"bert-paper-classifier-arxiv\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 116,
+   "id": "0498df97-cd2c-4732-9d07-ee2013f8bd55",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "trainer.save_model(\"bert-paper-classifier-arxiv\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7af12b9e-0d77-48ec-af6f-38556e13b067",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "Запушим модель на HF Hub:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 51,
+   "id": "5de0e91f-bc23-4413-b22e-5aa32b09ef12",
+   "metadata": {
+    "scrolled": true,
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+      "To disable this warning, you can either:\n",
+      "\t- Avoid using `tokenizers` before the fork if possible\n",
+      "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
+      "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+      "To disable this warning, you can either:\n",
+      "\t- Avoid using `tokenizers` before the fork if possible\n",
+      "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
+      "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+      "To disable this warning, you can either:\n",
+      "\t- Avoid using `tokenizers` before the fork if possible\n",
+      "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
+      "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+      "To disable this warning, you can either:\n",
+      "\t- Avoid using `tokenizers` before the fork if possible\n",
+      "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
+      "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+      "To disable this warning, you can either:\n",
+      "\t- Avoid using `tokenizers` before the fork if possible\n",
+      "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
+      "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+      "To disable this warning, you can either:\n",
+      "\t- Avoid using `tokenizers` before the fork if possible\n",
+      "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
+      "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+      "To disable this warning, you can either:\n",
+      "\t- Avoid using `tokenizers` before the fork if possible\n",
+      "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
+      "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+      "To disable this warning, you can either:\n",
+      "\t- Avoid using `tokenizers` before the fork if possible\n",
+      "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
+      "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+      "To disable this warning, you can either:\n",
+      "\t- Avoid using `tokenizers` before the fork if possible\n",
+      "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
+      "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+      "To disable this warning, you can either:\n",
+      "\t- Avoid using `tokenizers` before the fork if possible\n",
+      "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
+      "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+      "To disable this warning, you can either:\n",
+      "\t- Avoid using `tokenizers` before the fork if possible\n",
+      "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
+      "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+      "To disable this warning, you can either:\n",
+      "\t- Avoid using `tokenizers` before the fork if possible\n",
+      "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
+      "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+      "To disable this warning, you can either:\n",
+      "\t- Avoid using `tokenizers` before the fork if possible\n",
+      "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "To https://huggingface.co/oracat/bert-paper-classifier\n",
+      "   915ccf0..862abb7  main -> main\n",
+      "\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+      "To disable this warning, you can either:\n",
+      "\t- Avoid using `tokenizers` before the fork if possible\n",
+      "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
+      "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+      "To disable this warning, you can either:\n",
+      "\t- Avoid using `tokenizers` before the fork if possible\n",
+      "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
+      "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+      "To disable this warning, you can either:\n",
+      "\t- Avoid using `tokenizers` before the fork if possible\n",
+      "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n"
+     ]
+    }
+   ],
+   "source": [
+    "trainer.push_to_hub()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5093aee3-106e-43e9-a9c7-413d059ebb27",
+   "metadata": {},
+   "source": [
+    " "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b1a1029f-543c-409e-9aaf-35bcefe49988",
+   "metadata": {},
+   "source": [
+    "# Inference"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e7b0cd5a-2e17-49f3-b2a9-5ae4e8511969",
+   "metadata": {},
+   "source": [
+    "Теперь попробуем загрузить модель с HF Hub:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "b7fe37b9-61a9-4796-af24-092f6722cd61",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "36afc9d465f54c80ab01698f5a687388",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Downloading (…)okenizer_config.json:   0%|          | 0.00/394 [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "df18b9d22fc14a0c81e8cb557f88a848",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/225k [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "4ba2236cf89d4159bcc9740d4654b16d",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Downloading (…)/main/tokenizer.json:   0%|          | 0.00/679k [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "cae249ea1c2946a89fffdb80ff1d7b7b",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "b860284eb1ff4cb08b5c8d54ab1a33b9",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Downloading (…)lve/main/config.json:   0%|          | 0.00/6.04k [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "3607b2b6f85b49b0a03844df69077d7e",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "inference_tokenizer = AutoTokenizer.from_pretrained(\"oracat/bert-paper-classifier-arxiv\")\n",
+    "inference_model = AutoModelForSequenceClassification.from_pretrained(\"oracat/bert-paper-classifier-arxiv\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "34495235-4dca-4635-b468-5b15647a6682",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "pipe = pipeline(\"text-classification\", model=inference_model, tokenizer=inference_tokenizer, top_k=None)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "052b5070-c1ee-4419-8a6d-127925c95cce",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "def top_pct(preds, threshold=.95):\n",
+    "    preds = sorted(preds, key=lambda x: -x[\"score\"])\n",
+    "    \n",
+    "    cum_score = 0\n",
+    "    for i, item in enumerate(preds):\n",
+    "        cum_score += item[\"score\"]\n",
+    "        if cum_score >= threshold:\n",
+    "            break\n",
+    "\n",
+    "    preds = preds[:(i+1)]\n",
+    "    \n",
+    "    return preds"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "ed3545b6-e043-4dfb-aeb2-7559eac37f7c",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "def format_predictions(preds) -> str:\n",
+    "    \"\"\"\n",
+    "    Prepare predictions and their scores for printing to the user\n",
+    "    \"\"\"\n",
+    "    out = \"\"\n",
+    "    for i, item in enumerate(preds):\n",
+    "        out += f\"{i+1}. {item['label']} (score {item['score']:.2f})\\n\"\n",
+    "    return out"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "870d593a-a298-4d55-87b0-cb2813cc1fad",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "1. cs.LG (score 0.88)\n",
+      "2. cs.AI (score 0.07)\n",
+      "3. cs.NE (score 0.03)\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(\n",
+    "    format_predictions(\n",
+    "        top_pct(\n",
+    "            pipe(\"Attention Is All You Need\\nThe dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration.\")[0]\n",
+    "        )\n",
+    "    )\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "408f015e-be23-46a6-9e91-503fdccecf11",
+   "metadata": {},
+   "source": [
+    " "
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.8"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}