Spaces:

Kirili4ik
/

chat-with-Kirill

Runtime error

App Files Files Community

Kirili4ik commited on Oct 25, 2021

Commit

ae84b44

•

1 Parent(s): c5dc12a

init

Browse files

Files changed (9) hide show

Fine_tune_RuDialoGPT3_on_telegram_chat.ipynb +689 -0
LICENSE +21 -0
README.md +26 -28
app.py +242 -0
how-to-export-chat.jpg +0 -0
how-to-upload-json.jpg +0 -0
requirements.txt +2 -0
sample1.jpg +0 -0
sample2.jpg +0 -0

Fine_tune_RuDialoGPT3_on_telegram_chat.ipynb ADDED Viewed

	@@ -0,0 +1,689 @@

+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "name": "Fine tune RuDialoGPT3 on telegram chat",
+      "provenance": [],
+      "collapsed_sections": [
+        "uPZXtklAd0Cd",
+        "ESogNuUOEmj_",
+        "psXZnJk0Eo3J"
+      ],
+      "toc_visible": true,
+      "include_colab_link": true
+    },
+    "kernelspec": {
+      "display_name": "Python 3",
+      "name": "python3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "view-in-github",
+        "colab_type": "text"
+      },
+      "source": [
+        "<a href=\"https://colab.research.google.com/github/Kirili4ik/ruDialoGpt3-finetune-colab/blob/main/Fine_tune_RuDialoGPT3_on_telegram_chat.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "ocoQoLlek3cb"
+      },
+      "source": [
+        "# Fine-Tuning DialoGPT3 on your telegram chat"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "_ptkarFllCDr"
+      },
+      "source": [
+        "Here is a ready-to-run code for fine-tuning a RuDialoGPT3 model using HuggingFace and PyTorch on **your telegram chat**.\n",
+        "\n",
+        "I used RuDialoGPT-3 trained on forums to fine tune. It was trained by [@Grossmend](https://github.com/Grossmend) on Russian forums. The training process took 12 days using 4x RTX 2080 Ti (2 epochs on 32GB text corpus). The training procedure of GPT-3 for dialogue is described in Grossmend's [blogpost](https://habr.com/ru/company/icl_services/blog/548244/) (in Russian).\n",
+        "\n",
+        "I have created a simple pipeline and fine tuned that model on my own exported telegram chat (~30mb json). It is in fact very easy to get the data from telegram and fine tune a model. Therefore, I made this notebook!"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "GAB9ev-Gd8lH"
+      },
+      "source": [
+        "If you want just to try / to talk to my fine-tuned model than go **straight to the Inference section**."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "uPZXtklAd0Cd"
+      },
+      "source": [
+        "## Uploading your data for fine-tuning"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "VL5BXKmva2-Q"
+      },
+      "source": [
+        "# installing huggingface datasets and accelerate \n",
+        "! pip install datasets transformers[sentencepiece]\n",
+        "! pip install accelerate\n",
+        "\n",
+        "# [optional] Login to google drive to save models\n",
+        "from google.colab import drive\n",
+        "drive.mount('/content/drive')\n",
+        "\n",
+        "# [optional] Login to wandb to track model's behaviour\n",
+        "'''! pip install wandb\n",
+        "! wandb login\n",
+        "wandb.init(project=\"fine tune RuDialoGPT2 on KirArChat\")'''"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "cellView": "form",
+        "id": "Iq78W4qhrYmN"
+      },
+      "source": [
+        "#@title Imports\n",
+        "import sys\n",
+        "import re\n",
+        "import json\n",
+        "\n",
+        "from sklearn.model_selection import train_test_split\n",
+        "from tqdm import tqdm\n",
+        "\n",
+        "import torch\n",
+        "from transformers import TextDataset, DataCollatorForLanguageModeling\n",
+        "from torch.utils.data import DataLoader\n",
+        "\n",
+        "from accelerate import Accelerator\n",
+        "from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "7fRNBMkYnAUV"
+      },
+      "source": [
+        "Next cell downloads model and tokenizer using HuggingFace.\n",
+        "\n",
+        "You can start with my version or @Grossmend's: \"Grossmend/rudialogpt3_medium_based_on_gpt2\". Moreover, you can even start with any different DialoGPT trained on your language (with the notation of |x|y|text)."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "fn9KxEnfaxwo"
+      },
+      "source": [
+        "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
+        "\n",
+        "checkpoint = \"Kirili4ik/ruDialoGpt3-medium-finetuned-telegram\"   \n",
+        "tokenizer =  AutoTokenizer.from_pretrained(checkpoint)\n",
+        "model = AutoModelForCausalLM.from_pretrained(checkpoint)"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "SulpoPQxpJrK",
+        "cellView": "form"
+      },
+      "source": [
+        "#@title Utility functions\n",
+        "def get_length_param(text: str, tokenizer) -> str:\n",
+        "    \"\"\"Maps text to 1 of 4 buckets based on length after encoding.\n",
+        "\n",
+        "    Parameters\n",
+        "    ----------\n",
+        "    text: str\n",
+        "        The text to be given 1 of 4 length parameters.\n",
+        "\n",
+        "    tokenizer: HuggingFace tokenizer \n",
+        "        Tokenizer that used to compute the length of the text after encoding.\n",
+        "        For more info ee https://huggingface.co/transformers/main_classes/tokenizer.html\n",
+        "\n",
+        "    Returns\n",
+        "    -------\n",
+        "    len_param: str\n",
+        "        One of four buckets: \n",
+        "        '1' for short, '2' for medium, '3' for long texts and '-' for all others. \n",
+        "    \"\"\"\n",
+        "    tokens_count = len(tokenizer.encode(text))\n",
+        "    if tokens_count <= 15:\n",
+        "        len_param = '1'\n",
+        "    elif tokens_count <= 50:\n",
+        "        len_param = '2'\n",
+        "    elif tokens_count <= 256:\n",
+        "        len_param = '3'\n",
+        "    else:\n",
+        "        len_param = '-'\n",
+        "    return len_param\n",
+        "\n",
+        "\n",
+        "def get_user_param(text: dict, machine_name_in_chat: str) -> str:\n",
+        "    \"\"\"Maps text by 1/0 for it to be the person or the machine in the dialog\n",
+        "\n",
+        "    Parameters\n",
+        "    ----------\n",
+        "    text: Dict[..., 'from', ...]\n",
+        "        Dict containing field 'from' with the name of the user who sent the message\n",
+        "\n",
+        "    machine_name_in_chat: str\n",
+        "        Str with the name of the machine - it will be predicted\n",
+        "    \"\"\"\n",
+        "    if text['from'] == machine_name_in_chat:\n",
+        "        return '1'  # machine\n",
+        "    else:\n",
+        "        return '0'  # human\n",
+        "\n",
+        "\n",
+        "def build_text_file(data_json: dict, dest_path: str, \n",
+        "                    tokenizer, machine_name_in_chat='Кирилл Гельван'):\n",
+        "    \"\"\"Create a text file for training in special format for ruDialoGPT-3.\n",
+        "\n",
+        "    Parameters\n",
+        "    ----------\n",
+        "    data_json: dict\n",
+        "        Dict containing 'text' (message) and 'from' (user who sent the message)\n",
+        "        \n",
+        "    dest_path: str\n",
+        "        String containing path to write data there\n",
+        "\n",
+        "    tokenizer: HuggingFace tokenizer \n",
+        "        Tokenizer that used to compute the length of the text after encoding.\n",
+        "        For more info ee https://huggingface.co/transformers/main_classes/tokenizer.html\n",
+        "    \"\"\"\n",
+        "    f = open(dest_path, 'w')\n",
+        "    new_data = ''\n",
+        "    for i in range(len(data_json) - 1):\n",
+        "        message, next_message = data_json[i], data_json[i+1]\n",
+        "        if message['text'] == '' or type(message['text']) != str:\n",
+        "            continue\n",
+        "        if next_message['text'] == '' or type(next_message['text']) != str:\n",
+        "            continue\n",
+        "\n",
+        "        user   = get_user_param(message, machine_name_in_chat=machine_name_in_chat)\n",
+        "        length = get_length_param(data_json[i+1]['text'], tokenizer)\n",
+        "        message_text = re.sub(r\"\\n\", \". \", message['text'])\n",
+        "        new_data += f\"|{user}|{length}|{message_text}{tokenizer.eos_token}\" + \"\\n\"\n",
+        "\n",
+        "    f.write(new_data)\n",
+        "\n",
+        "\n",
+        "def load_dataset(train_path, test_path, tokenizer):\n",
+        "    \"\"\"Creates train and test PyTorch datasets and collate_fn using HuggingFace.\n",
+        "\n",
+        "    Parameters\n",
+        "    ----------\n",
+        "    train_path: str\n",
+        "        String containing path to train data\n",
+        "        \n",
+        "    test_path: str\n",
+        "        String containing path to test data\n",
+        "\n",
+        "    tokenizer: HuggingFace tokenizer \n",
+        "        Tokenizer that used to compute the length of the text after encoding.\n",
+        "        For more info ee https://huggingface.co/transformers/main_classes/tokenizer.html\n",
+        "    \"\"\"\n",
+        "    train_dataset = TextDataset(\n",
+        "          tokenizer  = tokenizer,\n",
+        "          file_path  = train_path,\n",
+        "          block_size = 256)\n",
+        "     \n",
+        "    test_dataset = TextDataset(\n",
+        "          tokenizer  = tokenizer,\n",
+        "          file_path  = test_path,\n",
+        "          block_size = 256)   \n",
+        "    \n",
+        "    data_collator = DataCollatorForLanguageModeling(\n",
+        "        tokenizer=tokenizer, mlm=False\n",
+        "    )\n",
+        "    return train_dataset, test_dataset, data_collator"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "wS5aTe48GF_N"
+      },
+      "source": [
+        "1) Export your telegram chat\n",
+        "\n",
+        "![](https://raw.githubusercontent.com/Kirili4ik/ruDialoGpt3-finetune-colab/main/how-to-export-chat.jpg)\n",
+        "\n",
+        "2) Upload it to colab\n",
+        "\n",
+        "![](https://raw.githubusercontent.com/Kirili4ik/ruDialoGpt3-finetune-colab/main/how-to-upload-json.jpg)\n",
+        "\n",
+        "3) Next cell creates train and test set from it\n",
+        "\n",
+        "4) :tada:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "19JKNqTS2Nu7",
+        "cellView": "form"
+      },
+      "source": [
+        "#@markdown Your telegram chat json path 'ChatExport.../YourChatName.json':\n",
+        "path_to_telegram_chat_json = 'example: /content/drive/MyDrive/char27.json' #@param {type : \"string\"}\n",
+        "#@markdown Name of the user to predict by GPT-3:\n",
+        "machine_name_in_chat = 'example: Kirill Gelvan' #@param {type : \"string\"}\n",
+        "\n",
+        "\n",
+        "with open(path_to_telegram_chat_json) as f: data = json.load(f)['messages']\n",
+        "\n",
+        "# test data is first 10% of chat, train - last 90%\n",
+        "train, test = data[int(len(data)*0.1):], data[:int(len(data)*0.1)]\n",
+        "\n",
+        "build_text_file(train, 'train_dataset.txt', tokenizer)\n",
+        "build_text_file(test,  'test_dataset.txt', tokenizer)\n",
+        "\n",
+        "print(\"Train dataset length: \" + str(len(train)) + \"samples\")\n",
+        "print(\"Test dataset length: \"  + str(len(test)) + \"samples\")"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "qO1-aAHF6TxB"
+      },
+      "source": [
+        "# let's look at our data\n",
+        "! head -n 10 train_dataset.txt"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "J6dMhVaeIO8x"
+      },
+      "source": [
+        "Here the first number is the spearker number - '1' for GPT and '0' for the person. \n",
+        "\n",
+        "The second number is the lengths of the expected answer: '1' for short, '2' for medium, '3' for long texts and '-' for all others. \n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "-ty6A-qTzhya"
+      },
+      "source": [
+        "# Create PyTorch Datasets\n",
+        "train_dataset, test_dataset, data_collator = load_dataset('train_dataset.txt', 'test_dataset.txt', tokenizer)\n",
+        "\n",
+        "# Create PyTorch Dataloaders\n",
+        "train_loader = DataLoader(train_dataset, shuffle=True, batch_size=2, collate_fn=data_collator)\n",
+        "test_loader = DataLoader(test_dataset, batch_size=2, collate_fn=data_collator)"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "NWhfc7ElAbkY"
+      },
+      "source": [
+        "# this cell checks 1 forward pass\n",
+        "try:\n",
+        "    for batch in train_loader:\n",
+        "        break\n",
+        "    {k: v.shape for k, v in batch.items()}\n",
+        "\n",
+        "    outputs = model(**batch)\n",
+        "except:\n",
+        "    print(\"Unexpected error:\", sys.exc_info()[0])\n",
+        "    raise"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "ESogNuUOEmj_"
+      },
+      "source": [
+        "## Fine-tuning"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "mZBWIviea2-Y",
+        "cellView": "form"
+      },
+      "source": [
+        "#@title Fine-tuning params\n",
+        "num_epochs = 3 #@param {type:\"integer\"}\n",
+        "optimizer = AdamW(model.parameters(), lr=3e-5) #@param\n",
+        "save_checkpoint_path = 'exmaple: drive/MyDrive/GPT2_checkpoint-more-data-2ep.pt' #@param {type:\"string\"}\n",
+        "\n",
+        "\n",
+        "num_training_steps = num_epochs * len(train_dataset)\n",
+        "lr_scheduler = get_scheduler(\n",
+        "    \"linear\",\n",
+        "    optimizer=optimizer,\n",
+        "    num_warmup_steps=100,\n",
+        "    num_training_steps=num_training_steps\n",
+        ")\n",
+        "\n",
+        "accelerator = Accelerator()\n",
+        "train_dl, test_dl, model, optimizer = accelerator.prepare(\n",
+        "    train_loader, test_loader, model, optimizer\n",
+        ")\n",
+        "# wandb.watch(model, log=\"all\")"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "rEV3EcZOCOhw"
+      },
+      "source": [
+        "progress_bar = tqdm(range(num_training_steps))\n",
+        "\n",
+        "for epoch in range(num_epochs):\n",
+        "    \n",
+        "    ### TRAIN EPOCH\n",
+        "    model.train()\n",
+        "    for batch in train_dl:\n",
+        "        optimizer.zero_grad()\n",
+        "        outputs = model(**batch)\n",
+        "        loss = outputs.loss\n",
+        "        accelerator.backward(loss)\n",
+        "        \n",
+        "        # wandb.log({'train_loss':loss.item()})\n",
+        "        optimizer.step()\n",
+        "        lr_scheduler.step()\n",
+        "        progress_bar.update(1)\n",
+        "\n",
+        "    ### SAVE\n",
+        "    torch.save({\n",
+        "            'model_state_dict': model.state_dict(),\n",
+        "    }, save_checkpoint_path)\n",
+        "    \n",
+        "    ### VALIDATE ONCE\n",
+        "    cum_loss = 0\n",
+        "    model.eval()\n",
+        "    with torch.inference_mode():\n",
+        "        for batch in test_dl:\n",
+        "            outputs = model(**batch)\n",
+        "            cum_loss += float(outputs.loss.item())\n",
+        "    \n",
+        "    print(cum_loss/len(test_loader))\n",
+        "    # wandb.log({'val_mean_loss':cum_loss/len(test_loader)})"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "psXZnJk0Eo3J"
+      },
+      "source": [
+        "## Inference"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "cellView": "form",
+        "id": "3N13Nwd1axA4"
+      },
+      "source": [
+        "#@title Installs and Utility functions\n",
+        "\n",
+        "%%capture\n",
+        "# installing huggingface datasets and accelerate \n",
+        "! pip install datasets transformers[sentencepiece]\n",
+        "! pip install accelerate\n",
+        "\n",
+        "def get_length_param(text: str, tokenizer) -> str:\n",
+        "    \"\"\"Maps text to 1 of 4 buckets based on length after encoding.\n",
+        "\n",
+        "    Parameters\n",
+        "    ----------\n",
+        "    text: str\n",
+        "        The text to be given 1 of 4 length parameters.\n",
+        "\n",
+        "    tokenizer: HuggingFace tokenizer \n",
+        "        Tokenizer that used to compute the length of the text after encoding.\n",
+        "        For more info ee https://huggingface.co/transformers/main_classes/tokenizer.html\n",
+        "\n",
+        "    Returns\n",
+        "    -------\n",
+        "    len_param: str\n",
+        "        One of four buckets: \n",
+        "        '1' for short, '2' for medium, '3' for long texts and '-' for all others. \n",
+        "    \"\"\"\n",
+        "    tokens_count = len(tokenizer.encode(text))\n",
+        "    if tokens_count <= 15:\n",
+        "        len_param = '1'\n",
+        "    elif tokens_count <= 50:\n",
+        "        len_param = '2'\n",
+        "    elif tokens_count <= 256:\n",
+        "        len_param = '3'\n",
+        "    else:\n",
+        "        len_param = '-'\n",
+        "    return len_param\n",
+        "\n",
+        "\n",
+        "def get_user_param(text: dict, machine_name_in_chat: str) -> str:\n",
+        "    \"\"\"Maps text by 1/0 for it to be the person or the machine in the dialogue\n",
+        "\n",
+        "    Parameters\n",
+        "    ----------\n",
+        "    text: Dict[..., 'from', ...]\n",
+        "        Dict containing field 'from' with the name of the user who sent the message\n",
+        "\n",
+        "    machine_name_in_chat: str\n",
+        "        Str with the name of the machine - it will be predicted\n",
+        "    \"\"\"\n",
+        "    if text['from'] == machine_name_in_chat:\n",
+        "        return '1'  # machine\n",
+        "    else:\n",
+        "        return '0'  # human\n",
+        "\n",
+        "\n",
+        "def build_text_file(data_json: dict, dest_path: str, \n",
+        "                    tokenizer, machine_name_in_chat='Кирилл Гельван'):\n",
+        "    \"\"\"Create a text file for training in special format for ruDialoGPT-3.\n",
+        "\n",
+        "    Parameters\n",
+        "    ----------\n",
+        "    data_json: dict\n",
+        "        Dict containing 'text' (message) and 'from' (user who sent the message)\n",
+        "        \n",
+        "    dest_path: str\n",
+        "        String containing path to write data there\n",
+        "\n",
+        "    tokenizer: HuggingFace tokenizer \n",
+        "        Tokenizer that used to compute the length of the text after encoding.\n",
+        "        For more info ee https://huggingface.co/transformers/main_classes/tokenizer.html\n",
+        "    \"\"\"\n",
+        "    f = open(dest_path, 'w')\n",
+        "    new_data = ''\n",
+        "    for i in range(len(data_json) - 1):\n",
+        "        message, next_message = data_json[i], data_json[i+1]\n",
+        "        if message['text'] == '' or type(message['text']) != str:\n",
+        "            continue\n",
+        "        if next_message['text'] == '' or type(next_message['text']) != str:\n",
+        "            continue\n",
+        "\n",
+        "        user   = get_user_param(message, machine_name_in_chat=machine_name_in_chat)\n",
+        "        length = get_length_param(data_json[i+1]['text'], tokenizer)\n",
+        "        message_text = re.sub(r\"\\n\", \". \", message['text'])\n",
+        "        new_data += f\"|{user}|{length}|{message_text}{tokenizer.eos_token}\" + \"\\n\"\n",
+        "\n",
+        "    f.write(new_data)\n",
+        "\n",
+        "\n",
+        "def load_dataset(train_path, test_path, tokenizer):\n",
+        "    \"\"\"Creates train and test PyTorch datasets and collate_fn using HuggingFace.\n",
+        "\n",
+        "    Parameters\n",
+        "    ----------\n",
+        "    train_path: str\n",
+        "        String containing path to train data\n",
+        "        \n",
+        "    test_path: str\n",
+        "        String containing path to test data\n",
+        "\n",
+        "    tokenizer: HuggingFace tokenizer \n",
+        "        Tokenizer that used to compute the length of the text after encoding.\n",
+        "        For more info ee https://huggingface.co/transformers/main_classes/tokenizer.html\n",
+        "    \"\"\"\n",
+        "    train_dataset = TextDataset(\n",
+        "          tokenizer  = tokenizer,\n",
+        "          file_path  = train_path,\n",
+        "          block_size = 256)\n",
+        "     \n",
+        "    test_dataset = TextDataset(\n",
+        "          tokenizer  = tokenizer,\n",
+        "          file_path  = test_path,\n",
+        "          block_size = 256)   \n",
+        "    \n",
+        "    data_collator = DataCollatorForLanguageModeling(\n",
+        "        tokenizer=tokenizer, mlm=False\n",
+        "    )\n",
+        "    return train_dataset, test_dataset, data_collator"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "vvsSRglEA0kt"
+      },
+      "source": [
+        "import torch\n",
+        "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
+        "\n",
+        "# Download checkpoint:\n",
+        "checkpoint = \"Kirili4ik/ruDialoGpt3-medium-finetuned-telegram\"   \n",
+        "tokenizer =  AutoTokenizer.from_pretrained(checkpoint)\n",
+        "model = AutoModelForCausalLM.from_pretrained(checkpoint)\n",
+        "\n",
+        "# [optional] Insert your checkpoint if needed:\n",
+        "'''from google.colab import drive\n",
+        "drive.mount('/content/drive')\n",
+        "checkpoint = torch.load('drive/MyDrive/GPT2_checkpoint.pt', map_location='cpu')\n",
+        "model.load_state_dict(checkpoint['model_state_dict'])'''\n",
+        "\n",
+        "model = model.to('cpu')\n",
+        "model.eval()\n",
+        "print()"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "MGdCxVnOhK_K"
+      },
+      "source": [
+        "### INFERENCE\n",
+        "\n",
+        "chat_history_ids = torch.zeros((1, 0), dtype=torch.int)\n",
+        "\n",
+        "while True:\n",
+        "    \n",
+        "    next_who = input(\"Who's phrase?\\t\")  #input(\"H / G?\")     # Human or GPT\n",
+        "\n",
+        "    # In case Human\n",
+        "    if next_who == \"H\":\n",
+        "        input_user = input(\"===> Human: \")\n",
+        "        \n",
+        "        # encode the new user input, add parameters and return a tensor in Pytorch\n",
+        "        new_user_input_ids = tokenizer.encode(f\"|0|{get_length_param(input_user, tokenizer)}|\" \\\n",
+        "                                              + input_user + tokenizer.eos_token, return_tensors=\"pt\")\n",
+        "        # append the new user input tokens to the chat history\n",
+        "        chat_history_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1)\n",
+        "\n",
+        "    if next_who == \"G\":\n",
+        "\n",
+        "        next_len = input(\"Phrase len? 1/2/3/-\\t\")  #input(\"Exp. len?(-/1/2/3): \")\n",
+        "        # encode the new user input, add parameters and return a tensor in Pytorch\n",
+        "        new_user_input_ids = tokenizer.encode(f\"|1|{next_len}|\", return_tensors=\"pt\")\n",
+        "        # append the new user input tokens to the chat history\n",
+        "        chat_history_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1)\n",
+        "        \n",
+        "        # print(tokenizer.decode(chat_history_ids[-1])) # uncomment to see full gpt input\n",
+        "        \n",
+        "        # save previous len\n",
+        "        input_len = chat_history_ids.shape[-1]\n",
+        "        # generated a response; PS you can read about the parameters at hf.co/blog/how-to-generate\n",
+        "        chat_history_ids = model.generate(\n",
+        "            chat_history_ids,\n",
+        "            num_return_sequences=1,                     # use for more variants, but have to print [i]\n",
+        "            max_length=512,\n",
+        "            no_repeat_ngram_size=3,\n",
+        "            do_sample=True,\n",
+        "            top_k=50,\n",
+        "            top_p=0.9,\n",
+        "            temperature = 0.6,                          # 0 for greedy\n",
+        "            mask_token_id=tokenizer.mask_token_id,\n",
+        "            eos_token_id=tokenizer.eos_token_id,\n",
+        "            unk_token_id=tokenizer.unk_token_id,\n",
+        "            pad_token_id=tokenizer.pad_token_id,\n",
+        "            device='cpu'\n",
+        "        )\n",
+        "        \n",
+        "        # pretty print last ouput tokens from bot\n",
+        "        print(f\"===> GPT-3:  {tokenizer.decode(chat_history_ids[:, input_len:][0], skip_special_tokens=True)}\")"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "mjEQiv5TMjZW"
+      },
+      "source": [
+        ""
+      ],
+      "execution_count": null,
+      "outputs": []
+    }
+  ]
+}

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2021 Kirill Gelvan
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md CHANGED Viewed

@@ -1,37 +1,35 @@
----
-title: Chat With Kirill
-emoji: 🐨
-colorFrom: pink
-colorTo: indigo
-sdk: gradio
-app_file: app.py
-pinned: false
----
-# Configuration
-`title`: _string_
-Display title for the Space
-`emoji`: _string_
-Space emoji (emoji-only character allowed)
-`colorFrom`: _string_
-Color for Thumbnail gradient (red, yellow, green, blue, indigo, purple, pink, gray)
-`colorTo`: _string_
-Color for Thumbnail gradient (red, yellow, green, blue, indigo, purple, pink, gray)
-`sdk`: _string_
-Can be either `gradio` or `streamlit`
-`sdk_version` : _string_
-Only applicable for `streamlit` SDK.
-See [doc](https://hf.co/docs/hub/spaces) for more info on supported versions.
-`app_file`: _string_
-Path to your main application file (which contains either `gradio` or `streamlit` Python code).
-Path is relative to the root of the repository.
-`pinned`: _boolean_
-Whether the Space stays on top of your list.

+# ruDialoGpt3 colab for finetuning on telegram chat
+This is a ready-for-use-colab tutorial for finetuning ruDialoGpt3 model on your telegram chat using HuggingFace and PyTorch.
+- 🤗 [Model page](https://huggingface.co/Kirili4ik/ruDialoGpt3-medium-finetuned-telegram)
+- [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1fnAVURjyZRK9VQg1Co_-SKUQnRES8l9R?usp=sharing) Colab tutorial
+I used RuDialoGPT-3 trained on forums to fine tune. It was trained by [@Grossmend](https://github.com/Grossmend) on Russian forums. The training procedure of the model for dialogue is described in Grossmend's [blogpost](https://habr.com/ru/company/icl_services/blog/548244/) (in Russian). **I have created a simple pipeline and fine tuned that model on my own exported telegram chat (~30mb json, 3 hours of fine tuning**). It is in fact very easy to get the data from telegram and fine tune a model:
+1) Export your telegram chat as JSON
+![](https://raw.githubusercontent.com/Kirili4ik/ruDialoGpt3-finetune-colab/main/how-to-export-chat.jpg)
+2) Upload it to colab
+![](https://raw.githubusercontent.com/Kirili4ik/ruDialoGpt3-finetune-colab/main/how-to-upload-json.jpg)
+3) The code will create a dataset for you
+4) Wait a bit!
+5) :tada: (Inference and smile)
+Or you can just go to google colab and play with my finetuned model!:
+<details>
+  <summary><b>A couple of dialogue samples:</b>
+  </summary>
+  <img src="https://raw.githubusercontent.com/Kirili4ik/ruDialoGpt3-finetune-colab/main/sample1.jpg">
+  <img src="https://raw.githubusercontent.com/Kirili4ik/ruDialoGpt3-finetune-colab/main/sample2.jpg">
+</details>
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1fnAVURjyZRK9VQg1Co_-SKUQnRES8l9R?usp=sharing#scrollTo=psXZnJk0Eo3J) Inference part

app.py ADDED Viewed

	@@ -0,0 +1,242 @@

+import torch
+import gradio as gr
+from transformers import AutoModelForCausalLM, AutoTokenizer
+def get_length_param(text: str, tokenizer) -> str:
+    """Maps text to 1 of 4 buckets based on length after encoding.
+    Parameters
+    ----------
+    text: str
+        The text to be given 1 of 4 length parameters.
+    tokenizer: HuggingFace tokenizer
+        Tokenizer that used to compute the length of the text after encoding.
+        For more info ee https://huggingface.co/transformers/main_classes/tokenizer.html
+    Returns
+    -------
+    len_param: str
+        One of four buckets:
+        '1' for short, '2' for medium, '3' for long texts and '-' for all others.
+    """
+    tokens_count = len(tokenizer.encode(text))
+    if tokens_count <= 15:
+        len_param = '1'
+    elif tokens_count <= 50:
+        len_param = '2'
+    elif tokens_count <= 256:
+        len_param = '3'
+    else:
+        len_param = '-'
+    return len_param
+def get_user_param(text: dict, machine_name_in_chat: str) -> str:
+    """Maps text by 1/0 for it to be the person or the machine in the dialogue
+    Parameters
+    ----------
+    text: Dict[..., 'from', ...]
+        Dict containing field 'from' with the name of the user who sent the message
+    machine_name_in_chat: str
+        Str with the name of the machine - it will be predicted
+    """
+    if text['from'] == machine_name_in_chat:
+        return '1'  # machine
+    else:
+        return '0'  # human
+def build_text_file(data_json: dict, dest_path: str,
+                    tokenizer, machine_name_in_chat='Кирилл Гельван'):
+    """Create a text file for training in special format for ruDialoGPT-3.
+    Parameters
+    ----------
+    data_json: dict
+        Dict containing 'text' (message) and 'from' (user who sent the message)
+    dest_path: str
+        String containing path to write data there
+    tokenizer: HuggingFace tokenizer
+        Tokenizer that used to compute the length of the text after encoding.
+        For more info ee https://huggingface.co/transformers/main_classes/tokenizer.html
+    """
+    f = open(dest_path, 'w')
+    new_data = ''
+    for i in range(len(data_json) - 1):
+        message, next_message = data_json[i], data_json[i+1]
+        if message['text'] == '' or type(message['text']) != str:
+            continue
+        if next_message['text'] == '' or type(next_message['text']) != str:
+            continue
+        user   = get_user_param(message, machine_name_in_chat=machine_name_in_chat)
+        length = get_length_param(data_json[i+1]['text'], tokenizer)
+        message_text = re.sub(r"\n", ". ", message['text'])
+        new_data += f"|{user}|{length}|{message_text}{tokenizer.eos_token}" + "\n"
+    f.write(new_data)
+def load_dataset(train_path, test_path, tokenizer):
+    """Creates train and test PyTorch datasets and collate_fn using HuggingFace.
+    Parameters
+    ----------
+    train_path: str
+        String containing path to train data
+    test_path: str
+        String containing path to test data
+    tokenizer: HuggingFace tokenizer
+        Tokenizer that used to compute the length of the text after encoding.
+        For more info ee https://huggingface.co/transformers/main_classes/tokenizer.html
+    """
+    train_dataset = TextDataset(
+          tokenizer  = tokenizer,
+          file_path  = train_path,
+          block_size = 256)
+    test_dataset = TextDataset(
+          tokenizer  = tokenizer,
+          file_path  = test_path,
+          block_size = 256)
+    data_collator = DataCollatorForLanguageModeling(
+        tokenizer=tokenizer, mlm=False
+    )
+    return train_dataset, test_dataset, data_collator
+def chat_function(message, length_of_the_answer, who_is_next, creativity):   # model, tokenizer
+    input_user = message
+    if length_of_the_answer == 'short':
+        next_len = '1'
+    elif length_of_the_answer == 'medium':
+        next_len = '2'
+    elif length_of_the_answer == 'long':
+        next_len = '3'
+    else:
+        next_len = '-'
+    print(who_is_next)
+    if who_is_next == 'Kirill':
+        next_who = 'G'
+    elif who_is_next == 'Me':
+        next_who = 'H'
+    history = gr.get_state() or []
+    chat_history_ids = torch.zeros((1, 0), dtype=torch.int) if history == [] else torch.tensor(history[-1][2], dtype=torch.long)
+    #########     next_who = input("Who's phrase?\t")  #input("H / G?")     # Human or GPT
+    # In case Human
+    ##### if next_who == "H":
+    ########    input_user = input("===> Human: ")
+    # encode the new user input, add parameters and return a tensor in Pytorch
+    if len(input_user) != 0:
+        new_user_input_ids = tokenizer.encode(f"|0|{get_length_param(input_user, tokenizer)}|" \
+                                              + input_user + tokenizer.eos_token, return_tensors="pt")
+        # append the new user input tokens to the chat history
+        chat_history_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1)
+    else:
+        input_user = '-'
+    if next_who == "G":
+        ######## next_len = input("Phrase len? 1/2/3/-\t")  #input("Exp. len?(-/1/2/3): ")
+        # encode the new user input, add parameters and return a tensor in Pytorch
+        new_user_input_ids = tokenizer.encode(f"|1|{next_len}|", return_tensors="pt")
+        # append the new user input tokens to the chat history
+        chat_history_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1)
+        print(tokenizer.decode(chat_history_ids[-1])) # uncomment to see full gpt input
+        # save previous len
+        input_len = chat_history_ids.shape[-1]
+        # generated a response; PS you can read about the parameters at hf.co/blog/how-to-generate
+        chat_history_ids = model.generate(
+            chat_history_ids,
+            num_return_sequences=1,                     # use for more variants, but have to print [i]
+            max_length=512,
+            no_repeat_ngram_size=3,
+            do_sample=True,
+            top_k=50,
+            top_p=0.9,
+            temperature = float(creativity),                          # 0 for greedy
+            mask_token_id=tokenizer.mask_token_id,
+            eos_token_id=tokenizer.eos_token_id,
+            unk_token_id=tokenizer.unk_token_id,
+            pad_token_id=tokenizer.pad_token_id,
+            device='cpu'
+        )
+        response = tokenizer.decode(chat_history_ids[:, input_len:][0], skip_special_tokens=True)
+    else:
+        response = '-'
+    history.append((input_user, response, chat_history_ids.tolist()))
+    gr.set_state(history)
+    html = "<div class='chatbot'>"
+    for user_msg, resp_msg, _ in history:
+        if user_msg != '-':
+            html += f"<div class='user_msg'>{user_msg}</div>"
+        if resp_msg != '-':
+            html += f"<div class='resp_msg'>{resp_msg}</div>"
+    html += "</div>"
+    return html
+# Download checkpoint:
+checkpoint = "Kirili4ik/ruDialoGpt3-medium-finetuned-telegram"
+tokenizer =  AutoTokenizer.from_pretrained(checkpoint)
+model = AutoModelForCausalLM.from_pretrained(checkpoint)
+model = model.eval()
+checkbox_group = gr.inputs.CheckboxGroup(['Kirill', 'Me'], default=['Kirill'], type="value", label=None)
+inputs = gr.inputs.Textbox(lines=1, label="???")
+outputs =  gr.outputs.Textbox(label="Kirill (GPT-2):")
+title = "Chat with Kirill (in Russian)"
+description = "Тут можно поболтать со мной. Но вместо меня бот. Оставь message пустым, чтобы Кирилл продолжил говорить. Подбробнее о технике по ссылке внизу."
+article = "<p style='text-align: center'><a href='https://github.com/Kirili4ik/ruDialoGpt3-finetune-colab'>Github with fine-tuning GPT-2 on your chat</a></p>"
+examples = [
+            ["Привет, как дела?", 'medium', 'Kirill', 0.6],
+            ["Сколько тебе лет?", 'medium', 'Kirill', 0.3],
+]
+iface = gr.Interface(chat_function,
+                     [
+                         "text",
+                         gr.inputs.Radio(["short", "medium", "long"], default='medium'),
+                         gr.inputs.Radio(["Kirill", "Me"], default='Kirill'),
+                         gr.inputs.Slider(0, 1, default=0.6)
+                     ],
+                     "html",
+                     title=title, description=description, article=article, examples=examples,
+                     css= """
+                            .chatbox {display:flex;flex-direction:column}
+                            .user_msg, .resp_msg {padding:4px;margin-bottom:4px;border-radius:4px;width:80%}
+                            .user_msg {background-color:cornflowerblue;color:white;align-self:start}
+                            .resp_msg {background-color:lightgray;align-self:self-end}
+                          """,
+                     allow_screenshot=True,
+                     allow_flagging=False
+                    )
+iface.launch()

how-to-export-chat.jpg ADDED Viewed

how-to-upload-json.jpg ADDED Viewed

requirements.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ transformers
2	+ torch

sample1.jpg ADDED Viewed

sample2.jpg ADDED Viewed