{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "Gq9-Z9DSkT14" }, "source": [ "In this notebook, I show how to fine-tune a NLLB-200 machine translation model for a new language.\n", "\n", "The new language will be [Tyvan](https://en.wikipedia.org/wiki/Tuvan_language), and I will use a Tyvan-Russian parallel corpus as the training data.\n", "\n", "I am running this notebook on Google Colab with a T4 GPU that has 15Gb of memory. If you run it elsewhere, you may want to adjust the batch size, so that there are no OOM errors, but the GPU is well utilized." ] }, { "cell_type": "markdown", "metadata": { "id": "_iBrOtwcjnml" }, "source": [ "# 0. Preliminaries" ] }, { "cell_type": "markdown", "metadata": { "id": "CZgBsg-Xjpu8" }, "source": [ "I run this notebook in Google Colab (which is ephemeral), and to read the dataset and to write the resulting model I use Google Drive, which I mount in the cell below." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "XuO8k6JIwK5c" }, "outputs": [], "source": [ "# from google.colab import drive\n", "# import os\n", "# if not os.path.exists('/gd'):\n", "# drive.mount('/gd')" ] }, { "cell_type": "code", "source": [ "from google.colab import drive\n", "drive.mount('/content/drive')" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Kt2LZHZ41oTC", "outputId": "488ef1dd-bd31-4519-bed5-11f29ed0b746" }, "execution_count": 2, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Mounted at /content/drive\n" ] } ] }, { "cell_type": "code", "source": [ "import os\n", "os.chdir('/content/drive/MyDrive/MT_Research')" ], "metadata": { "id": "euW7z2h_2EQy" }, "execution_count": 3, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "dc8NcXYHj2Zj" }, "source": [ "Installing dependencies:\n", "* `transformers`, as a neural network framework\n", "* `sentencepiece`, a backend for my tokenizer (the algorithm for converting a text into symbols from the model's vocabulary)\n", "* `sacremoses`, a package required for text preprocessing with which NLLB models were pretrained.\n", "* `sacrebleu`, a package for evaluating translation models" ] }, { "cell_type": "code", "source": [], "metadata": { "id": "rlfoDgf-16XW" }, "execution_count": 3, "outputs": [] }, { "cell_type": "code", "source": [ "# import locale\n", "# def gpe(x=None):\n", "# return \"UTF-8\"\n", "# locale.getpreferredencoding = gpe" ], "metadata": { "id": "qPjx54id5ko8" }, "execution_count": 4, "outputs": [] }, { "cell_type": "code", "execution_count": 5, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "xu8BrYo292Nx", "outputId": "60fe97e6-9f0b-41d7-c9f9-9152c140b54c" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m119.9/119.9 kB\u001b[0m \u001b[31m7.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m58.0/58.0 kB\u001b[0m \u001b[31m4.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m7.6/7.6 MB\u001b[0m \u001b[31m60.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m547.8/547.8 kB\u001b[0m \u001b[31m26.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m897.5/897.5 kB\u001b[0m \u001b[31m43.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m106.7/106.7 kB\u001b[0m \u001b[31m9.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m116.3/116.3 kB\u001b[0m \u001b[31m6.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m39.9/39.9 MB\u001b[0m \u001b[31m22.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m64.9/64.9 kB\u001b[0m \u001b[31m4.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m7.8/7.8 MB\u001b[0m \u001b[31m58.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m134.8/134.8 kB\u001b[0m \u001b[31m9.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m194.1/194.1 kB\u001b[0m \u001b[31m14.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25h\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", "cudf-cu12 24.4.1 requires pyarrow<15.0.0a0,>=14.0.1, but you have pyarrow 17.0.0 which is incompatible.\n", "google-colab 1.0.0 requires requests==2.31.0, but you have requests 2.32.3 which is incompatible.\n", "ibis-framework 8.0.0 requires pyarrow<16,>=2, but you have pyarrow 17.0.0 which is incompatible.\u001b[0m\u001b[31m\n", "\u001b[0m" ] } ], "source": [ "!pip install sentencepiece transformers==4.33 datasets sacremoses sacrebleu -q" ] }, { "cell_type": "markdown", "metadata": { "id": "OqdSSIVLlCir" }, "source": [ "# 1. Exploring the data\n", "\n", "In this section, I try to understand what is the training data that I have, and how suitable it is for fine-tuning a NLLB model." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "id": "CTyDFaZf984A" }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "A6zKA3Fd-M39", "outputId": "2161868a-bcd6-4ae4-b966-d48eebcd75c9" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "(15021, 2)\n", "Index(['English', 'Yoruba'], dtype='object')\n" ] } ], "source": [ "\n", "train = pd.read_csv('train.csv')\n", "test= pd.read_csv('test.csv',)\n", "print(train.shape)\n", "print(train.columns)" ] }, { "cell_type": "code", "source": [ "pd.options.display.max_colwidth = 100" ], "metadata": { "id": "VrJAf8LlfU8S" }, "execution_count": 8, "outputs": [] }, { "cell_type": "code", "source": [ "print(test.shape)\n", "print(test.columns)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Ul5Q3UON4ym4", "outputId": "52bbcbc6-3bcc-4881-8e87-d4829013bf2b" }, "execution_count": 9, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "(1669, 2)\n", "Index(['English', 'Yoruba'], dtype='object')\n" ] } ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "id": "J6ZCvaV1lkkd" }, "outputs": [], "source": [ "# trans_df.sample(10)" ] }, { "cell_type": "code", "source": [ "train.isnull().sum()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "XKQJ1VPEQ8Kt", "outputId": "24ef5ba5-7ff1-44e7-af55-b82ff55b6f39" }, "execution_count": 11, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "English 0\n", "Yoruba 0\n", "dtype: int64" ] }, "metadata": {}, "execution_count": 11 } ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "id": "3sNJergv-ap2", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "d431d280-c41b-4581-d8ed-bbf15d953459" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "English 0\n", "Yoruba 0\n", "dtype: int64" ] }, "metadata": {}, "execution_count": 12 } ], "source": [ "test.isnull().sum()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "id": "nKH5EGKY-c0q" }, "outputs": [], "source": [ "# df_train = trans_df[trans_df.split=='train'].copy() # 49000 items\n", "# df_dev = trans_df[trans_df.split=='dev'].copy() # 500 items\n", "# df_test = trans_df[trans_df.split=='test'].copy() # 500 items" ] }, { "cell_type": "markdown", "source": [ "Data Cleaning" ], "metadata": { "id": "nFXYtDweKmlr" } }, { "cell_type": "markdown", "metadata": { "id": "K6qHP-DAA4YD" }, "source": [ "# 2. How well does the data fit into a NLLB tokenizer?" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "id": "2xL261VQtyLl" }, "outputs": [], "source": [ "from transformers import NllbTokenizer\n", "from tqdm.auto import tqdm, trange" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "id": "05GfWpzKtvcz", "colab": { "base_uri": "https://localhost:8080/", "height": 272, "referenced_widgets": [ "a3cce029e48b451ab7e3b3cd6de9fbb5", "df100e6cd14344d9850db922a5e13450", "48c6fab6b4c6428abcc5208231fbe054", "caa478c43f7c4034b483a2e2697265fb", "e99da02112534d7c9a1eac3bb414ece1", "ff797d41acda4a5f930c379d86514042", "5a0058193e6e4ad19169f63a65f3bf51", "26469d725ff040279e89d6553ca51218", "ff9672ff7b154d1f811984b723c497d1", "fa16d40023fe4587986cfee853426a48", "665cc2bf3da6485aac757b3db5e0472c", "a092f5d3de9b4de183c6c24bbb94fd61", "0b3e1204ef4049e492b98f6a04f23800", "40bdb135046f4cdea1bfca440fec082b", "1fdff1ba87df48338d0d5eb3b47ec3a0", "9deec01ab0ce44d890a3195b8ae2a3ab", "45ffe8bd2c21400187966e6fa7fb1aa9", "ce35a79a4256487f9c8129cb88696fc1", "7b74fe6ab70a4dfd8ca86fc41d4aa11c", "0b3ac9ae5cec4aefbd73834d3e5c6e7e", "c999c707f6174d0f8d1d7c4488dc06eb", "2fad97512f17469aa5c6ed607e72247b", "23f7497f85ee46d5ba558b61c883b431", "2fea688288aa4b0e9ce4b9a3d26d2d6a", "917536ea52a84e1ea2e708a5786bc79d", "debb6444da7c4410b2a137fb2f2db62f", "b465d4f10c10450baa1ed660487b3929", "7ced34a412a74e0ea53e07e6624d3b85", "0492f2a503284dbdb105b8c443dc53b0", "fd5e94baf2e445e89e6ea77f14e22647", "f5674c11b43e4eec9964626fb3ba60e8", "e1af2035af3c4c2dba8ae5da974978e9", "08f92e34af284860a3d72c6c24899aec" ] }, "outputId": "e37a9422-4f71-4504-ccea-92e7d40729c8" }, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n", " warnings.warn(\n", "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning: \n", "The secret `HF_TOKEN` does not exist in your Colab secrets.\n", "To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.\n", "You will be able to reuse this secret in all of your notebooks.\n", "Please note that authentication is recommended but still optional to access public models or datasets.\n", " warnings.warn(\n" ] }, { "output_type": "display_data", "data": { "text/plain": [ "sentencepiece.bpe.model: 0%| | 0.00/4.85M [00:00\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Yorubayor_wordsyor_toksEnglisheng_wordseng_toks
14800Ó dára láti “ya” owó láti inu o kò-owó rẹ, ṣugbọn ní gbàti o bá ṣe eyi, ri daju pe o gbá èlé lo...[Ó, dára, láti, “, ya, ”, owó, láti, inu, o, kò, -, owó, rẹ, ,, ṣugbọn, ní, gbàti, o, bá, ṣe, ey...[▁Ó, ▁dára, ▁láti, ▁, “, ya, ”, ▁owó, ▁láti, ▁inu, ▁o, ▁kò, -, o, wó, ▁rẹ, ,, ▁ṣugbọn, ▁ní, ▁gbà...It is ok to “borrow” cash from your business, but when you do, charge yourself interest.[It, is, ok, to, “, borrow, ”, cash, from, your, business, ,, but, when, you, do, ,, charge, you...[▁It, ▁is, ▁ok, ▁to, ▁, “, bor, row, ”, ▁cash, ▁from, ▁your, ▁business, ,, ▁but, ▁when, ▁you, ▁d...
4960Ìwé àkọsílẹ̀ ìṣura ni: àṣùwọ̀n owó tí a ní sílẹ̀ àti owó tí a fi léde tí a fi sínu owó tí ó wà n...[Ìwé, àkọsílè, ̣, ìṣura, ni, :, àṣùwò, ̣, n, owó, tí, a, ní, sílè, ̣, àti, owó, tí, a, fi, léde,...[▁Ìwé, ▁àkọ, sí, lè, ̣, ▁ìṣ, ura, ▁ni, :, ▁à, ṣù, wò, ̣, n, ▁owó, ▁tí, ▁a, ▁ní, ▁sílè, ̣, ▁àti, ...The journal entry is: Capital/Drawings Account Dr To Cash/Bank 15[The, journal, entry, is, :, Capital, /, Drawings, Account, Dr, To, Cash, /, Bank, 15][▁The, ▁journal, ▁entry, ▁is, :, ▁Capital, /, D, raw, ings, ▁Account, ▁Dr, ▁To, ▁Cash, /, Bank, ...
7556Moniepoint\\nTí a mọ tẹlẹ̀ sí TeamApt, Moniepoint tún jẹ ilé-ìfowópamọ́ ìnáwó kékeré tí ó ní ìwé ...[Moniepoint, Tí, a, mọ, tẹlẹ, ̀, sí, TeamApt, ,, Moniepoint, tún, jẹ, ilé, -, ìfowópamọ, ́, ìnáw...[▁Mon, ie, point, ▁Tí, ▁a, ▁mọ, ▁tẹlẹ, ̀, ▁sí, ▁Team, A, pt, ,, ▁Mon, ie, point, ▁tún, ▁jẹ, ▁ilé...Moniepoint\\nFormerly known as TeamApt, Moniepoint is also a CBN-licensed microfinance bank deliv...[Moniepoint, Formerly, known, as, TeamApt, ,, Moniepoint, is, also, a, CBN, -, licensed, microfi...[▁Mon, ie, point, ▁For, mer, ly, ▁known, ▁as, ▁Team, A, pt, ,, ▁Mon, ie, point, ▁is, ▁also, ▁a, ...
10467Ìjọba le fún wọn ní àkókò nítorí èyí lè gbà tó ọdún mẹ́ta láti ṣe yọrí.[Ìjọba, le, fún, wọn, ní, àkókò, nítorí, èyí, lè, gbà, tó, ọdún, mẹ, ́, ta, láti, ṣe, yọrí, .][▁Ìjọba, ▁le, ▁fún, ▁wọn, ▁ní, ▁àkókò, ▁nítorí, ▁èyí, ▁lè, ▁gbà, ▁tó, ▁ọdún, ▁mẹ, ́, ta, ▁láti, ...The government can give them a timeline because this could take up to three years to accomplish.[The, government, can, give, them, a, timeline, because, this, could, take, up, to, three, years...[▁The, ▁government, ▁can, ▁give, ▁them, ▁a, ▁tim, eline, ▁because, ▁this, ▁could, ▁take, ▁up, ▁t...
313Iṣẹ́ rere náà wá ní ẹ̀yìn ọdún iye owó ńlá níbi tí àwọn okòwò àti àwọn oníbàárà wọn ti ní láti k...[Iṣẹ, ́, rere, náà, wá, ní, ẹ, ̀, yìn, ọdún, iye, owó, ńlá, níbi, tí, àwọn, okòwò, àti, àwọn, on...[▁I, ṣẹ, ́, ▁rere, ▁náà, ▁wá, ▁ní, ▁ẹ, ̀, yìn, ▁ọdún, ▁iye, ▁owó, ▁ńlá, ▁níbi, ▁tí, ▁àwọn, ▁ok, ...The positive performance came on the back of an inflationary year where businesses and their con...[The, positive, performance, came, on, the, back, of, an, inflationary, year, where, businesses,...[▁The, ▁positive, ▁performance, ▁came, ▁on, ▁the, ▁back, ▁of, ▁an, ▁inf, lation, ary, ▁year, ▁wh...
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "\n", "
\n", " \n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "summary": "{\n \"name\": \"smpl\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"Yoruba\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"\\u00ccw\\u00e9 \\u00e0k\\u1ecds\\u00edl\\u00e8\\u0323 \\u00ec\\u1e63ura ni: \\u00e0\\u1e63\\u00f9w\\u00f2\\u0323n ow\\u00f3 t\\u00ed a n\\u00ed s\\u00edl\\u00e8\\u0323 \\u00e0ti ow\\u00f3 t\\u00ed a fi l\\u00e9de t\\u00ed a fi s\\u00ednu ow\\u00f3 t\\u00ed \\u00f3 w\\u00e0 n\\u00edl\\u00e9 \\u00ecfow\\u00f3pam\\u00f3\\u0323\",\n \"I\\u1e63\\u1eb9\\u0301 rere n\\u00e1\\u00e0 w\\u00e1 n\\u00ed \\u1eb9\\u0300y\\u00ecn \\u1ecdd\\u00fan iye ow\\u00f3 \\u0144l\\u00e1 n\\u00edbi t\\u00ed \\u00e0w\\u1ecdn ok\\u00f2w\\u00f2 \\u00e0ti \\u00e0w\\u1ecdn on\\u00edb\\u00e0\\u00e1r\\u00e0 w\\u1ecdn ti n\\u00ed l\\u00e1ti koj\\u00fa iye ow\\u00f3 \\u1ecdj\\u00e0 \\u00e0ti i\\u1e63\\u1eb9\\u0301 t\\u00f3 \\u0144 p\\u1ecd\\u0300 s\\u00ed i. \",\n \"Moniepoint\\nT\\u00ed a m\\u1ecd t\\u1eb9l\\u1eb9\\u0300 s\\u00ed TeamApt, Moniepoint t\\u00fan j\\u1eb9 il\\u00e9-\\u00ecfow\\u00f3pam\\u1ecd\\u0301 \\u00ecn\\u00e1w\\u00f3 k\\u00e9ker\\u00e9 t\\u00ed \\u00f3 n\\u00ed \\u00ecw\\u00e9 \\u00e0\\u1e63\\u1eb9 CBN t\\u00ed \\u00f3 p\\u00e8s\\u00e8 \\u00e0w\\u1ecdn i\\u1e63\\u1eb9 ow\\u00f3 n\\u00ed on\\u00ed-n\\u1ecd\\u0301m\\u0301b\\u00e0\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"yor_words\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"yor_toks\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"English\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"The journal entry is: Capital/Drawings Account Dr To Cash/Bank 15\",\n \"The positive performance came on the back of an inflationary year where businesses and their consumers have had to deal with the rising cost of goods and services.\",\n \"Moniepoint\\nFormerly known as TeamApt, Moniepoint is also a CBN-licensed microfinance bank delivering financial services digitally.\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"eng_words\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"eng_toks\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 18 } ], "source": [ "smpl.sample(5)[['Yoruba', 'yor_words', 'yor_toks', 'English', 'eng_words', 'eng_toks']]" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 300 }, "id": "EbgRYDlTuC9z", "outputId": "9b1f8e90-80f0-469e-ccbc-f24e31a61c66" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " eng_toks yor_toks eng_words yor_words\n", "count 5000.000000 5000.00000 5000.000000 5000.000000\n", "mean 26.342200 47.60500 22.071400 38.320000\n", "std 9.735443 22.63276 8.095913 24.581986\n", "min 1.000000 3.00000 1.000000 1.000000\n", "25% 20.000000 33.00000 17.000000 24.000000\n", "50% 26.000000 44.00000 21.000000 33.000000\n", "75% 32.000000 58.00000 27.000000 45.000000\n", "max 153.000000 297.00000 123.000000 413.000000" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
eng_toksyor_tokseng_wordsyor_words
count5000.0000005000.000005000.0000005000.000000
mean26.34220047.6050022.07140038.320000
std9.73544322.632768.09591324.581986
min1.0000003.000001.0000001.000000
25%20.00000033.0000017.00000024.000000
50%26.00000044.0000021.00000033.000000
75%32.00000058.0000027.00000045.000000
max153.000000297.00000123.000000413.000000
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "\n", "
\n", " \n", " \n", " \n", "
\n", "\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "variable_name": "stats", "summary": "{\n \"name\": \"stats\",\n \"rows\": 8,\n \"fields\": [\n {\n \"column\": \"eng_toks\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1754.8796634960947,\n \"min\": 1.0,\n \"max\": 5000.0,\n \"num_unique_values\": 8,\n \"samples\": [\n 26.3422,\n 26.0,\n 5000.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"yor_toks\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1744.7443582140572,\n \"min\": 3.0,\n \"max\": 5000.0,\n \"num_unique_values\": 8,\n \"samples\": [\n 47.605,\n 44.0,\n 5000.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"eng_words\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1757.1152600657247,\n \"min\": 1.0,\n \"max\": 5000.0,\n \"num_unique_values\": 8,\n \"samples\": [\n 22.0714,\n 21.0,\n 5000.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"yor_words\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1743.7977550957332,\n \"min\": 1.0,\n \"max\": 5000.0,\n \"num_unique_values\": 8,\n \"samples\": [\n 38.32,\n 33.0,\n 5000.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 19 } ], "source": [ "stats = smpl[['eng_toks', 'yor_toks', 'eng_words', 'yor_words']].applymap(len).describe()\n", "stats" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "WUJQQzYDuEc5", "outputId": "ee702df0-e960-48d0-f869-95cdde55f9fe" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "1.1934992796107178\n", "1.2423016701461378\n" ] } ], "source": [ "print(stats.eng_toks['mean'] / stats.eng_words['mean'])\n", "print(stats.yor_toks['mean'] / stats.yor_words['mean'])" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "iUXEaJlbuqJf", "outputId": "6961ff9c-190f-44a6-e2d4-d525c1dc7549" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ " 3\n" ] } ], "source": [ "print(tokenizer.unk_token, tokenizer.unk_token_id)" ] }, { "cell_type": "markdown", "metadata": { "id": "RdLDZ9L6uGin" }, "source": [ "Good news: both for Russian and Tyvan, the NLLB tokenizer seems to produce around 2 tokens per word (more precisely, 2.3 and 1.8), which means that the translation quality of fine-tuned NLLB may be decent even without vocabulary extension." ] }, { "cell_type": "markdown", "metadata": { "id": "27BIJ7HGvKs-" }, "source": [ "One more check: how often does the token happen in the tokenizer output for Tyvan? If this is too often, we need to fix it somehow" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 66, "referenced_widgets": [ "38a8a1febe7a4c15be012c046d943b7b", "363bc9e667054eb1b3697e627c87373f", "04ab7ee6804a4faab6eb88a91915ea88", "4506edf6e191456b833683bdcf99ebd5", "5ef9c560df0a46eb995d1ecba63ac8c5", "ec638390ce3a4a509ad29a8d8c71b63e", "301bff8171df4572bc8ec9f969110a69", "75ca7cf36acc4ef4ab29217dae170984", "e849b6a3c8bb4b73ab742c8db6ad2323", "5dee707988dc4a1b885c785f1c5fdaab", "aa8d19e5f70d4a5a98ac828a4c5eb55f" ] }, "id": "nAEe9lYNu6kv", "outputId": "0d6cda77-1cec-45ed-c445-61df348965ee" }, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ " 0%| | 0/15021 [00:00 tp.Callable[[str], str]:\n", " non_printable_map = {\n", " ord(c): replace_by\n", " for c in (chr(i) for i in range(sys.maxunicode + 1))\n", " # same as \\p{C} in perl\n", " # see https://www.unicode.org/reports/tr44/#General_Category_Values\n", " if unicodedata.category(c) in {\"C\", \"Cc\", \"Cf\", \"Cs\", \"Co\", \"Cn\"}\n", " }\n", "\n", " def replace_non_printing_char(line) -> str:\n", " return line.translate(non_printable_map)\n", "\n", " return replace_non_printing_char\n", "\n", "replace_nonprint = get_non_printing_char_replacer(\" \")\n", "\n", "def preproc(text):\n", " clean = mpn.normalize(text)\n", " clean = replace_nonprint(clean)\n", " # replace 𝓕𝔯𝔞𝔫𝔠𝔢𝔰𝔠𝔞 by Francesca\n", " clean = unicodedata.normalize(\"NFKC\", clean)\n", " return clean" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 66, "referenced_widgets": [ "b358392d266c451694a78d1c9303b4fc", "160ccde1b33044c59b0cd27f81e3a358", "40fe8b64f2414c8eaffc672494f8e115", "1a265a8a39e5445183f1d73147715aa8", "e139fa59521b49f6908081473c20b9f8", "6e5db85fe0ca462da2a32b7bd32c7ec2", "f3f7440ec22c4c8483d4e7744395980b", "53e3a3e3444c41d4a3f5f1941bc0c19f", "1589724f361a42b0990b83d9e8d12aa5", "219fe916d2094b56a81087231bcd7604", "85d483f550da4b8481106489f44a794d" ] }, "id": "3MJp75LAv6Wo", "outputId": "9beb1aef-3291-4087-d00b-9ae63b7170d1" }, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ " 0%| | 0/389 [00:00']\n" ] } ], "source": [ "tokenizer = NllbTokenizer.from_pretrained('facebook/nllb-200-distilled-600M')\n", "print(len(tokenizer))\n", "print(tokenizer.convert_ids_to_tokens([256202, 256203]))" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "id": "d02fbR_L-nCh" }, "outputs": [], "source": [ "def fix_tokenizer(tokenizer, new_lang='yor_imo'):\n", " \"\"\"\n", " Add a new language token to the tokenizer vocabulary\n", " (this should be done each time after its initialization)\n", " \"\"\"\n", " old_len = len(tokenizer) - int(new_lang in tokenizer.added_tokens_encoder)\n", " tokenizer.lang_code_to_id[new_lang] = old_len-1\n", " tokenizer.id_to_lang_code[old_len-1] = new_lang\n", " # always move \"mask\" to the last position\n", " tokenizer.fairseq_tokens_to_ids[\"\"] = len(tokenizer.sp_model) + len(tokenizer.lang_code_to_id) + tokenizer.fairseq_offset\n", "\n", " tokenizer.fairseq_tokens_to_ids.update(tokenizer.lang_code_to_id)\n", " tokenizer.fairseq_ids_to_tokens = {v: k for k, v in tokenizer.fairseq_tokens_to_ids.items()}\n", " if new_lang not in tokenizer._additional_special_tokens:\n", " tokenizer._additional_special_tokens.append(new_lang)\n", " # clear the added token encoder; otherwise a new token may end up there by mistake\n", " tokenizer.added_tokens_encoder = {}\n", " tokenizer.added_tokens_decoder = {}" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "id": "jZ7YPnHQ-pDT" }, "outputs": [], "source": [ "fix_tokenizer(tokenizer)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "ppwnJUrj-rLu", "outputId": "37ae072b-b4e8-4e36-bb41-eb3c1c581c2c" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "['zul_Latn', 'yor_imo', '']\n", "[256202, 256203, 256204]\n" ] } ], "source": [ "print(tokenizer.convert_ids_to_tokens([256202, 256203, 256204])) # ['zul_Latn', 'tyv_Cyrl', '']\n", "print(tokenizer.convert_tokens_to_ids(['zul_Latn', 'yor_imo', ''])) # [256202, 256203, 256204]\n", "# this is consistent now, wow!" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "ktO8outV-xws", "outputId": "c17e1501-1fd8-454f-8e9a-db7552253aa8" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "256203 256198\n" ] } ], "source": [ "added_token_id = tokenizer.convert_tokens_to_ids('yor_imo')\n", "similar_lang_id = tokenizer.convert_tokens_to_ids('yor_Latn')\n", "print(added_token_id, similar_lang_id)" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 272, "referenced_widgets": [ "e95932f91bb6487585b1dac0e17af7b8", "b556be65181f436f995eb21e7f458b00", "114aaf3174eb4f838e50c19558383b04", "6bdba58bd25444f6b8a61e4958459662", "2c71903d089f478eac895388be3f789e", "1b69592b3eee4327a4aa87b9b1351eee", "14aa13da61c740b5ad53eaf1780f083b", "f66db39bad81474aaf634825fd88ee36", "eb86a7722b894e978400e65198a1c7f7", "e7dfbbced006476dae5c160f1d363a85", "3a3374913d8c440794587bb7a2859154", "18eb8063745a4566abee6f69eb7bc2b9", "752a364aaaff4cd3a54fb256bb8b5f65", "b1b5b3c9ed3e4aa08db37615af52a291", "518b3c406c2346d7ad279bcdc2243d70", "fb75663870b346ac930e68431c8810d8", "bba9a22a4b6c49a7b725d58fe8bf4414", "5407739c7a044787a388cbee927136d0", "c5333d0501d748d7bd99ca3f3358751d", "0223df94c49445b0abc6fc7986f8c749", "5941bc78a34c4cf5a9d22a811380c9c7", "2a50c218d62b4e279f4bd470c8ceb772", "5f6ba05a2caf43658644638b293e5dd4", "6e0c3d6f87774c528ba452c8a07ba3a1", "d7e06ccdd1bd4315a16392b34ea8bf52", "8487a78750d841cd9c402ba016be5b09", "7838363f24944d1184758a7745cdb056", "4dace0ae80cb47ce9879a94028826e35", "8fc2ec641a7d446ca9bac60c82d557dd", "87c2086cf6b14c198da5fe35b59b6dcc", "bee75c5460ea459e960e4c56af75c35d", "1b8f53ff04434cb59a4b4ab3adecfd38", "4b5a8dd1723d41a2b74e37eff62a7395" ] }, "id": "tLlwR3_R-tDL", "outputId": "40c60819-7a78-4c04-9025-1240a52b4a3f" }, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "config.json: 0%| | 0.00/846 [00:00 0:\n", " model.save_pretrained(MODEL_SAVE_PATH)\n", " tokenizer.save_pretrained(MODEL_SAVE_PATH)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 430 }, "id": "xXXT9pcd_9Au", "outputId": "58658ffc-f3d0-4a85-8884-cdca6ba08e17" }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "pd.Series(losses).ewm(100).mean().plot();" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "6MGVf4Vc_fS4" }, "outputs": [], "source": [ "def translate(text, src_lang='rus_Cyrl', tgt_lang='eng_Latn', a=16, b=1.5, max_input_length=1024, **kwargs):\n", " tokenizer.src_lang = src_lang\n", " tokenizer.tgt_lang = tgt_lang\n", " inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=max_input_length)\n", " result = model.generate(\n", " **inputs.to(model.device),\n", " forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),\n", " max_new_tokens=int(a + b * inputs.input_ids.shape[1]),\n", " **kwargs\n", " )\n", " #print(inputs.input_ids.shape[1], result.shape[1])\n", " return tokenizer.batch_decode(result, skip_special_tokens=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "c69XqtpbAgjN", "outputId": "2b963659-10e1-4cfc-fe20-ef136aef75e8" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['бир-ле черден соок үрүп тур']\n", "['откуда-то несёт холодом']\n", "['откуда-то дует холод']\n" ] } ], "source": [ "xx, yy, lang1, lang2 = get_batch_pairs(1, data=df_dev)\n", "print(xx)\n", "print(yy)\n", "model.eval()\n", "print(translate(xx[0], lang1, lang2, no_repeat_ngram_size=3, num_beams=5))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "aCZR50GxAiPJ", "outputId": "4815110a-b8eb-4bc5-9453-977cb14d146d" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "total 2.3G\n", "1.0K -rw------- 1 root root 896 Sep 30 07:40 config.json\n", " 512 -rw------- 1 root root 184 Sep 30 07:40 generation_config.json\n", "2.3G -rw------- 1 root root 2.3G Sep 30 07:41 pytorch_model.bin\n", "4.7M -rw------- 1 root root 4.7M Sep 30 07:41 sentencepiece.bpe.model\n", "3.5K -rw------- 1 root root 3.5K Sep 30 07:41 special_tokens_map.json\n", "1.0K -rw------- 1 root root 570 Sep 30 07:41 tokenizer_config.json\n" ] } ], "source": [ "!ls -alsh $MODEL_SAVE_PATH" ] }, { "cell_type": "markdown", "source": [ "# 6. Using the model" ], "metadata": { "id": "0qubmjZNAxJB" } }, { "cell_type": "code", "source": [ "import pandas as pd\n", "from sklearn.model_selection import train_test_split\n", "from transformers import NllbTokenizer, AutoModelForSeq2SeqLM, AutoConfig\n", "from tqdm.auto import tqdm, trange" ], "metadata": { "id": "PKGZ8zuN2mV6" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "trans_df = pd.read_csv('/gd/MyDrive/datasets/nlp/tyvan/for_translator.csv')\n", "trans_df.dropna(subset=['ru', 'tyv'], inplace=True)\n", "df_train, df_devtest = train_test_split(trans_df, test_size=1000, random_state=1)\n", "df_dev, df_test = train_test_split(df_devtest, test_size=0.5, random_state=1)" ], "metadata": { "id": "hag683KM2qxZ" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "# this code is adapted from the Stopes repo of the NLLB team\n", "# https://github.com/facebookresearch/stopes/blob/main/stopes/pipelines/monolingual/monolingual_line_processor.py#L214\n", "\n", "import re\n", "import sys\n", "import typing as tp\n", "import unicodedata\n", "from sacremoses import MosesPunctNormalizer\n", "\n", "\n", "mpn = MosesPunctNormalizer(lang=\"en\")\n", "mpn.substitutions = [\n", " (re.compile(r), sub) for r, sub in mpn.substitutions\n", "]\n", "\n", "\n", "def get_non_printing_char_replacer(replace_by: str = \" \") -> tp.Callable[[str], str]:\n", " non_printable_map = {\n", " ord(c): replace_by\n", " for c in (chr(i) for i in range(sys.maxunicode + 1))\n", " # same as \\p{C} in perl\n", " # see https://www.unicode.org/reports/tr44/#General_Category_Values\n", " if unicodedata.category(c) in {\"C\", \"Cc\", \"Cf\", \"Cs\", \"Co\", \"Cn\"}\n", " }\n", "\n", " def replace_non_printing_char(line) -> str:\n", " return line.translate(non_printable_map)\n", "\n", " return replace_non_printing_char\n", "\n", "replace_nonprint = get_non_printing_char_replacer(\" \")\n", "\n", "def preproc(text):\n", " clean = mpn.normalize(text)\n", " clean = replace_nonprint(clean)\n", " # replace 𝓕𝔯𝔞𝔫𝔠𝔢𝔰𝔠𝔞 by Francesca\n", " clean = unicodedata.normalize(\"NFKC\", clean)\n", " return clean" ], "metadata": { "id": "0AXtm-Qf2wCR" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "def fix_tokenizer(tokenizer, new_lang='tyv_Cyrl'):\n", " \"\"\" Add a new language token to the tokenizer vocabulary (this should be done each time after its initialization) \"\"\"\n", " old_len = len(tokenizer) - int(new_lang in tokenizer.added_tokens_encoder)\n", " tokenizer.lang_code_to_id[new_lang] = old_len-1\n", " tokenizer.id_to_lang_code[old_len-1] = new_lang\n", " # always move \"mask\" to the last position\n", " tokenizer.fairseq_tokens_to_ids[\"\"] = len(tokenizer.sp_model) + len(tokenizer.lang_code_to_id) + tokenizer.fairseq_offset\n", "\n", " tokenizer.fairseq_tokens_to_ids.update(tokenizer.lang_code_to_id)\n", " tokenizer.fairseq_ids_to_tokens = {v: k for k, v in tokenizer.fairseq_tokens_to_ids.items()}\n", " if new_lang not in tokenizer._additional_special_tokens:\n", " tokenizer._additional_special_tokens.append(new_lang)\n", " # clear the added token encoder; otherwise a new token may end up there by mistake\n", " tokenizer.added_tokens_encoder = {}\n", " tokenizer.added_tokens_decoder = {}" ], "metadata": { "id": "Wwb6ck8P25ZQ" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "model_load_name = '/gd/MyDrive/models/nllb-rus-tyv-v1'\n", "model = AutoModelForSeq2SeqLM.from_pretrained(model_load_name).cuda()\n", "tokenizer = NllbTokenizer.from_pretrained(model_load_name)\n", "fix_tokenizer(tokenizer)" ], "metadata": { "id": "uY7nUGsX3NOM", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "84976f43-9775-443d-ba5e-7da564be2ed4" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n" ] } ] }, { "cell_type": "code", "source": [ "def translate(text, src_lang='rus_Cyrl', tgt_lang='eng_Latn', a=32, b=3, max_input_length=1024, num_beams=4, **kwargs):\n", " tokenizer.src_lang = src_lang\n", " tokenizer.tgt_lang = tgt_lang\n", " inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=max_input_length)\n", " result = model.generate(\n", " **inputs.to(model.device),\n", " forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),\n", " max_new_tokens=int(a + b * inputs.input_ids.shape[1]),\n", " num_beams=num_beams,\n", " **kwargs\n", " )\n", " return tokenizer.batch_decode(result, skip_special_tokens=True)" ], "metadata": { "id": "ZIsPI6YT3UG0" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "t = \"мөңгүн үр чыткаш карарар\"\n", "print(translate(t, 'tyv_Cyrl', 'rus_Cyrl'))\n", "# ['серебро от времени чернеет']" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "UJwLBH8M9XWW", "outputId": "8cd3007f-6b6e-4364-ca99-991efe0d719e" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "['серебро от времени чернеет']\n" ] } ] }, { "cell_type": "code", "source": [ "translate(t, 'tyv_Cyrl', 'rus_Cyrl', do_sample=True, num_beams=1, temperature=1.5)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "o9JFXvBS9xY7", "outputId": "09a8e62c-d727-4f72-8915-bed8a0e4498c" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "['серебро от ходения цвета']" ] }, "metadata": {}, "execution_count": 11 } ] }, { "cell_type": "code", "source": [ "def batched_translate(texts, batch_size=16, **kwargs):\n", " \"\"\"Translate texts in batches of similar length\"\"\"\n", " idxs, texts2 = zip(*sorted(enumerate(texts), key=lambda p: len(p[1]), reverse=True))\n", " results = []\n", " for i in trange(0, len(texts2), batch_size):\n", " results.extend(translate(texts2[i: i+batch_size], **kwargs))\n", " return [p for i, p in sorted(zip(idxs, results))]" ], "metadata": { "id": "JoWvizFCRngQ" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "rus_translated = batched_translate(df_dev.tyv, src_lang='tyv_Cyrl', tgt_lang='rus_Cyrl')" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 49, "referenced_widgets": [ "ef092016e6e64903b967170f1865e82f", "363e0d43df564854af2e6caa053e56c5", "7f363c96320042788c50184e91c8f154", "eb8a113b5f2c4d1997adafe1ad44946e", "9c7ab640a7e64a9db9e912a315f8929e", "3173699ecb454b75be1d493c0681beca", "5efff913d4dc4631939a6642c2343cb7", "cf9bdd81f6574f5bad3f9f878a6ad847", "48d4597749b5459b93f2480064e8015e", "2b40ca2c5862407ea4dfff008dbb6649", "1743e3e64cf540ac902d423a05b94305" ] }, "id": "2-wPl4fTRlv2", "outputId": "279b6d2f-7786-4525-e72b-35780643785b" }, "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ " 0%| | 0/16 [00:00\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tyvrutyv_translatedrus_translated
row_id
37635ыяк сактып алырнамотать на усаксынга өттүрертвёрдо запомнить
42465чиик үлетпүрлёгкая промышленностьчиик үлетпүрлёгкая промышленность
27916аяк бусту бергенчашка разбиласьаяк бусту бергенчашка лопнула
78401оларны маш-салатче салырвыложить их на маш-салатоларны маш-салатче салырвыложить их в маш-салат
74168соус хоюг кылдыр хайынмаан шаандапока соус не заварится густосоус хоюг хайынмаан шаандапока соус не сварится густо
61972шивиельшивиеловая кобыла
8377соондан арай боорда шимчээртащиться вследсоондан чүткүүреле двинуться вслед
109242АКШСШААКШСША
38832бир-ле черден соок үрүп туроткуда-то несёт холодомбир-ле черден соок аппар чыдыроткуда-то течёт холод
54814быжыктырарыукреплениешивээукрепление
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", " \n" ] }, "metadata": {}, "execution_count": 64 } ] }, { "cell_type": "code", "source": [ "print((df_dev.ru == df_dev.rus_translated).mean())\n", "print((df_dev.tyv == df_dev.tyv_translated).mean())" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "uCEO4dSgJYKE", "outputId": "a1e96e0d-4760-4749-c54b-81bd272a205f" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "0.26\n", "0.25\n" ] } ] }, { "cell_type": "code", "source": [ "!pip install editdistance" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "9NEfm2fmJm1S", "outputId": "e41a7dd6-4b67-4bcc-ba9b-61b01f319631" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Requirement already satisfied: editdistance in /usr/local/lib/python3.10/dist-packages (0.6.2)\n" ] } ] }, { "cell_type": "code", "source": [ "import editdistance\n", "\n", "def ed_similarity(text1, text2):\n", " return max(0, 1 - editdistance.eval(text1, text2) / min(len(text1), len(text2)))\n", "\n", "print(ed_similarity('кот', 'собака'))\n", "print(ed_similarity('кот', 'кит'))" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "NxxFrdC7JrS6", "outputId": "52a5e77e-34e4-4733-aead-6f83d14be049" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "0\n", "0.6666666666666667\n" ] } ] }, { "cell_type": "code", "source": [ "pd.Series([ed_similarity(row.ru, row.rus_translated) for row in df_dev.itertuples()]).describe()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "EjkvAyN9JyXg", "outputId": "6df153cb-e719-4bbf-d7aa-78311cfdb05c" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "count 500.000000\n", "mean 0.516367\n", "std 0.392761\n", "min 0.000000\n", "25% 0.116013\n", "50% 0.507009\n", "75% 1.000000\n", "max 1.000000\n", "dtype: float64" ] }, "metadata": {}, "execution_count": 74 } ] }, { "cell_type": "code", "source": [ "pd.Series([ed_similarity(row.tyv, row.tyv_translated) for row in df_dev.itertuples()]).describe()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "NXm7oczQKknv", "outputId": "ff5566f1-2f21-4991-8bc9-7e3940c83a05" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "count 500.000000\n", "mean 0.506007\n", "std 0.382357\n", "min 0.000000\n", "25% 0.111111\n", "50% 0.504902\n", "75% 0.979730\n", "max 1.000000\n", "dtype: float64" ] }, "metadata": {}, "execution_count": 75 } ] }, { "cell_type": "code", "source": [ "df_dev.index.name = \"row_id\"" ], "metadata": { "id": "-OL-s6bK6UIE" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "df_dev.to_csv(model_load_name + \"/dev_set_translated.tsv\", sep=\"\\t\")" ], "metadata": { "id": "DjmFAfB355Ss" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Evaluating another model (with extended vocabulary)" ], "metadata": { "id": "66sAzGnX6clC" } }, { "cell_type": "code", "source": [ "model_load_name = '/gd/MyDrive/models/nllb-rus-tyv-v2-extvoc'" ], "metadata": { "id": "CsCfoNc26fhi" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "cfg = AutoConfig.from_pretrained(model_load_name)\n", "model = AutoModelForSeq2SeqLM.from_pretrained(model_load_name + \"/pytorch_model_60k.bin\", config=cfg).cuda()" ], "metadata": { "id": "7Ic0aPUj6kS5" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "tokenizer = NllbTokenizer.from_pretrained(model_load_name)\n", "fix_tokenizer(tokenizer)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "x0QRieQi6pMA", "outputId": "7485184b-576d-4e99-d7a4-c709ce37c585" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n" ] } ] }, { "cell_type": "code", "source": [ "df_dev['rus_translated2'] = [translate(t, 'tyv_Cyrl', 'rus_Cyrl')[0] for t in tqdm(df_dev.tyv)]" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 49, "referenced_widgets": [ "fa02790d3658464d85ba4fb817598142", "009f9de0888749509335e97d828bd8b8", "98f0b0ff99004eccbabba4d9859e7191", "7204e6c548c9423aadbecc7656976f0d", "4d25116204ee4c5999dac8e2281a1962", "8e7acded176c4fefa122b50b0d3126c6", "a4ff5a65bd224cb5bc201da43cab714f", "31f1fa84b19e4e6bb3e0391290539334", "c2928804b48a423a8c5d85dc8fd0d624", "c344baeff36447e49c4d99cadf39b721", "6c5aee3f42cb467e904e6c70e1123b4d" ] }, "id": "5-s-VKmn7Lzk", "outputId": "a6bbec1b-a0e3-409e-8a11-c157674dc9ad" }, "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ " 0%| | 0/500 [00:00rus | rus->tyv\n", "Model v1 (no vocabulary update): |\n", " no beam search | 23.21 | 22.03\n", " num_beams = 4 | 24.14 | 23.41\n", "Model v2 (with vocabulary update):|\n", " no beam search | 24.08 | 22.50\n", " num_beams = 4 | 25.18 | 23.22\n", "```" ], "metadata": { "id": "ywUwR9KfP0yt" } }, { "cell_type": "code", "source": [ "df_dev.to_csv(model_load_name + \"/dev_set_translated.tsv\", sep=\"\\t\")" ], "metadata": { "id": "2JNisawc-hU4" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Here are some examples of how translation has changed:" ], "metadata": { "id": "7VVicYZO9P3U" } }, { "cell_type": "code", "source": [ "df_dev.sample(5, random_state=1)[['tyv', 'ru', 'rus_translated']]" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 289 }, "id": "vf6rJm2PHcPZ", "outputId": "4accff57-b069-4e54-9f98-3f51473c308a" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " tyv \\\n", "row_id \n", "5442 транспорт херекселдерин ажыглаарының база шимчээшкинниң айыыл чок чоруунуң дүрүмнери \n", "57777 аъш-чем садыы \n", "104130 Бүгү чүве төнген, бойлаан. \n", "49344 фокуска кирбес \n", "28319 рекорд тургузар \n", "\n", " ru \\\n", "row_id \n", "5442 правила безопасности движения и эксплуатации транспортных средств \n", "57777 продовольственный магазин \n", "104130 Все было кончено, потеряно. \n", "49344 не попасть в фокус \n", "28319 установить рекорд \n", "\n", " rus_translated \n", "row_id \n", "5442 правила безопасности движения и эксплуатации транспортных средств \n", "57777 продовольственный магазин \n", "104130 Все было кончено, самостоятельно. \n", "49344 не попасть в фокус \n", "28319 поставить рекорд " ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tyvrurus_translated
row_id
5442транспорт херекселдерин ажыглаарының база шимчээшкинниң айыыл чок чоруунуң дүрүмнериправила безопасности движения и эксплуатации транспортных средствправила безопасности движения и эксплуатации транспортных средств
57777аъш-чем садыыпродовольственный магазинпродовольственный магазин
104130Бүгү чүве төнген, бойлаан.Все было кончено, потеряно.Все было кончено, самостоятельно.
49344фокуска кирбесне попасть в фокусне попасть в фокус
28319рекорд тургузарустановить рекордпоставить рекорд
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ] }, "metadata": {}, "execution_count": 57 } ] }, { "cell_type": "code", "source": [ "df_dev.sample(20, random_state=1)[[\n", " 'tyv', 'tyv_translated', 'tyv_translated2', 'tyv_translated3', 'tyv2eng',\n", " 'ru', 'rus_translated', 'rus_translated2', 'rus_translated3', 'rus2eng',\n", "]]" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 1000 }, "id": "rrsuxzq9-kWl", "outputId": "d975f6b8-0dba-4deb-db89-31c44d932046" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " tyv \\\n", "row_id \n", "5442 транспорт херекселдерин ажыглаарының база шимчээшкинниң айыыл чок чоруунуң дүрүмнери \n", "57777 аъш-чем садыы \n", "104130 Бүгү чүве төнген, бойлаан. \n", "49344 фокуска кирбес \n", "28319 рекорд тургузар \n", "43534 чурукту делгээр \n", "37159 ылап хөделир \n", "36993 колдуктап алгаш чоруур \n", "116009 Копривничко-Крижевачка \n", "113178 Лампаң \n", "106939 Ону кым-даа томаартып чадап каан чүве-дир. \n", "86499 Самбайлык \n", "20503 чалбыыш ыяштарны өрттендир чипкен \n", "1784 бүзүредилге \n", "34052 шын эвес адаар \n", "83751 Калдар аът \n", "68723 соора дыңнаар \n", "84267 Коңга эдипкен \n", "14467 мен силерге кашты төлээр ужурлуг мен? \n", "54719 кончуг берге байдалдар \n", "\n", " tyv_translated \\\n", "row_id \n", "5442 шимчээшкинниң болгаш транспорт аймаан ажыглаарының айыыл чок чоруунуң дүрүмнери \n", "57777 аъш-чем садыы \n", "104130 Шупту чүве доозулган, читкен. \n", "49344 илби-шидиге алзыр арга чок \n", "28319 рекорд тургузар \n", "43534 чурукту делгээр \n", "37159 ылап хөделир \n", "36993 шыңганның адаанга көдүрүп алгаш чоруур \n", "116009 Копривничко-Крижевачка \n", "113178 Лампанг \n", "106939 Кым-даа ону оожургадып шыдаваан. \n", "86499 Самбайлык \n", "20503 чалбыыш ыяштарны чиртипкен \n", "1784 бүзүредилге \n", "34052 шын эвес адаар \n", "83751 Мухортай аът \n", "68723 дыңнаалаар \n", "84267 Коңгулуур кыңгырт дээн \n", "14467 силерге чеже өрелиг мен? \n", "54719 кончуг берге байдал \n", "\n", " tyv_translated2 \\\n", "row_id \n", "5442 транспорт аймаан шимчээшкининиң болгаш ажыглаарының айыыл чок чоруунуң дүрүмнери \n", "57777 аъш-чем садыы \n", "104130 Шупту чүве төнген, читкен. \n", "49344 илбиге алыспас \n", "28319 рекорд тургузар \n", "43534 чурук делгээр \n", "37159 бүзүрелдиг хөделир \n", "36993 колдук адаанга аппаар \n", "116009 Копривничко-Крижевачка \n", "113178 Лампаң \n", "106939 Ону кым-даа оожургадып шыдаваан. \n", "86499 Самбайлык \n", "20503 чалбыыш ыяштарны кыпкан \n", "1784 бүзүрээри \n", "34052 шын эвес адаар \n", "83751 Дыйлагар аът \n", "68723 дыңнаалаар \n", "84267 Куңгур кызаш дээн \n", "14467 силерге чеже хире өренир мен? \n", "54719 аажок кортканы \n", "\n", " tyv_translated3 \\\n", "row_id \n", "5442 транспорт аймаан шимчээшкининиң болгаш транспорт аймаан ажыглаарының айыыл чок чоруунуң дүрүмнери \n", "57777 аъш-чем садыы \n", "104130 Бүгү чүве бойлаан, читкен. \n", "49344 илби-шидиге туттурбас \n", "28319 рекорд тургузар \n", "43534 чурукту делгээр \n", "37159 ылап хөделир \n", "36993 колдуктап алгаш чоруур \n", "116009 Копривничко-Крижевачка \n", "113178 Лампанг \n", "106939 Оглумну кым-даа оожургап шыдаваан. \n", "86499 Самбайлык \n", "20503 чалбырааш ыяштарны хөме апкан \n", "1784 бүзүрел \n", "34052 соора адаар \n", "83751 Мухортая аът \n", "68723 дыңнап алыр \n", "84267 Коңгурааш, кыңгырт дээн \n", "14467 силерге чеже хире өртектиг мен? \n", "54719 аажок коргары \n", "\n", " tyv2eng \\\n", "row_id \n", "5442 ң болгаш транспорт аймаан ажыглаарының айыыл чок чоруунуң дүрүмнери \n", "57777 садыы \n", "104130 -даа, читкен-даа. \n", "49344 гге күш четпес \n", "28319 г тургузар \n", "43534 ң чурукту делгээр \n", "37159 г хөделир \n", "36993 г шыгжаар \n", "116009 чко-Крижевачка \n", "113178 ң \n", "106939 кым-даа оожургап шыдаваан. \n", "86499 ң \n", "20503 г чалбырааш ыяштарны чиртипкен \n", "1784 \n", "34052 ң адаар \n", "83751 аът \n", "68723 дыңнаар \n", "84267 ң кыңгырткайны-ла берген \n", "14467 каш хире өрелиг мен силерге? \n", "54719 \n", "\n", " ru \\\n", "row_id \n", "5442 правила безопасности движения и эксплуатации транспортных средств \n", "57777 продовольственный магазин \n", "104130 Все было кончено, потеряно. \n", "49344 не попасть в фокус \n", "28319 установить рекорд \n", "43534 выставлять картину \n", "37159 действовать наверняка \n", "36993 нести под мышкой \n", "116009 Копривничко-Крижевачка \n", "113178 Лампанг \n", "106939 И никто не в силах был укротить его. \n", "86499 Самбайлык \n", "20503 пламя опалило деревья \n", "1784 доверенность \n", "34052 неправильно произносить \n", "83751 Мухортая лошадь \n", "68723 прослушать \n", "84267 Звонок прозвенел \n", "14467 сколько я вам должен? \n", "54719 ужас \n", "\n", " rus_translated \\\n", "row_id \n", "5442 правила безопасности движения и эксплуатации транспортных средств \n", "57777 продовольственный магазин \n", "104130 Все было кончено, самостоятельно. \n", "49344 не попасть в фокус \n", "28319 поставить рекорд \n", "43534 выставлять картину \n", "37159 действовать аккуратно \n", "36993 нести под мышками \n", "116009 Копривничко-Крижевачка \n", "113178 Лампанг \n", "106939 И никто не мог оглянуться на него. \n", "86499 Самбайлык \n", "20503 пламясъегло деревья \n", "1784 доверие \n", "34052 неправильно произнести \n", "83751 Верховая лошадь \n", "68723 ослышаться \n", "84267 Прикованный к конга \n", "14467 сколько я вам должен заплатить? \n", "54719 ужасы \n", "\n", " rus_translated2 \\\n", "row_id \n", "5442 правила эксплуатации транспортных средств и безопасности движения \n", "57777 продовольственный магазин \n", "104130 Все кончилось, разошёлся. \n", "49344 не попасться в фокусы \n", "28319 установить рекорд \n", "43534 развернуть картину \n", "37159 действовать наверняка \n", "36993 носить под мышкой \n", "116009 Копривничко-Крижевачка \n", "113178 Лампанг \n", "106939 Никто не мог на это остановить его. \n", "86499 Самбайлык \n", "20503 пламя разъедало деревья \n", "1784 убеждение \n", "34052 неправильно произнести \n", "83751 Верхняя лошадь \n", "68723 прослушать \n", "84267 Конга поправленный \n", "14467 сколько я вам обязан? \n", "54719 ужасы \n", "\n", " rus_translated3 \\\n", "row_id \n", "5442 правила безопасности эксплуатации транспортных средств и движения \n", "57777 продовольственный магазин \n", "104130 Все кончено, кончено. \n", "49344 не попасть на фокус \n", "28319 установить рекорд \n", "43534 экспонировать картину \n", "37159 действовать наверняка \n", "36993 нести под мышкой \n", "116009 Копривничко-Крижевачка \n", "113178 Лампанг \n", "106939 Никто не мог его обуздать. \n", "86499 Самбайлык \n", "20503 пламя выжгло деревья \n", "1784 заверения \n", "34052 неправильно произносить \n", "83751 Рысистая лошадь \n", "68723 ослышаться \n", "84267 звонок исправил \n", "14467 сколько я вам должен платить? \n", "54719 чёрные условия \n", "\n", " rus2eng \n", "row_id \n", "5442 дүрүмнер транспорт херекселдерин ажыглаарының болгаш шимчээшкинниң айыыл чок чоруунуң дугайында \n", "57777 садыы \n", "104130 -ла, бүгү чүве кончилось. \n", "49344 г \n", "28319 г тургузар \n", "43534 чурукту делгередип чуруур \n", "37159 г хөделир \n", "36993 алгаш чоруур \n", "116009 вничко-Крижевачка \n", "113178 \n", "106939 -ла, чүге дээрге кым-даа ону таарыштырып шыдаваан. \n", "86499 г \n", "20503 г ыяштарны чиртип каапкан \n", "1784 \n", "34052 \n", "83751 аът \n", "68723 дыңнаар \n", "84267 берген \n", "14467 ң дээш каш хире төлээр ужурлуг мен силерге? \n", "54719 ть берге байдалдар " ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tyvtyv_translatedtyv_translated2tyv_translated3tyv2engrurus_translatedrus_translated2rus_translated3rus2eng
row_id
5442транспорт херекселдерин ажыглаарының база шимчээшкинниң айыыл чок чоруунуң дүрүмнеришимчээшкинниң болгаш транспорт аймаан ажыглаарының айыыл чок чоруунуң дүрүмнеритранспорт аймаан шимчээшкининиң болгаш ажыглаарының айыыл чок чоруунуң дүрүмнеритранспорт аймаан шимчээшкининиң болгаш транспорт аймаан ажыглаарының айыыл чок чоруунуң дүрүмнериң болгаш транспорт аймаан ажыглаарының айыыл чок чоруунуң дүрүмнериправила безопасности движения и эксплуатации транспортных средствправила безопасности движения и эксплуатации транспортных средствправила эксплуатации транспортных средств и безопасности движенияправила безопасности эксплуатации транспортных средств и движениядүрүмнер транспорт херекселдерин ажыглаарының болгаш шимчээшкинниң айыыл чок чоруунуң дугайында
57777аъш-чем садыыаъш-чем садыыаъш-чем садыыаъш-чем садыысадыыпродовольственный магазинпродовольственный магазинпродовольственный магазинпродовольственный магазинсадыы
104130Бүгү чүве төнген, бойлаан.Шупту чүве доозулган, читкен.Шупту чүве төнген, читкен.Бүгү чүве бойлаан, читкен.-даа, читкен-даа.Все было кончено, потеряно.Все было кончено, самостоятельно.Все кончилось, разошёлся.Все кончено, кончено.-ла, бүгү чүве кончилось.
49344фокуска кирбесилби-шидиге алзыр арга чокилбиге алыспасилби-шидиге туттурбасгге күш четпесне попасть в фокусне попасть в фокусне попасться в фокусыне попасть на фокусг
28319рекорд тургузаррекорд тургузаррекорд тургузаррекорд тургузарг тургузарустановить рекордпоставить рекордустановить рекордустановить рекордг тургузар
43534чурукту делгээрчурукту делгээрчурук делгээрчурукту делгээрң чурукту делгээрвыставлять картинувыставлять картинуразвернуть картинуэкспонировать картинучурукту делгередип чуруур
37159ылап хөделирылап хөделирбүзүрелдиг хөделирылап хөделирг хөделирдействовать навернякадействовать аккуратнодействовать навернякадействовать навернякаг хөделир
36993колдуктап алгаш чорууршыңганның адаанга көдүрүп алгаш чоруурколдук адаанга аппаарколдуктап алгаш чоруург шыгжаарнести под мышкойнести под мышкаминосить под мышкойнести под мышкойалгаш чоруур
116009Копривничко-КрижевачкаКопривничко-КрижевачкаКопривничко-КрижевачкаКопривничко-Крижевачкачко-КрижевачкаКопривничко-КрижевачкаКопривничко-КрижевачкаКопривничко-КрижевачкаКопривничко-Крижевачкавничко-Крижевачка
113178ЛампаңЛампангЛампаңЛампангңЛампангЛампангЛампангЛампанг
106939Ону кым-даа томаартып чадап каан чүве-дир.Кым-даа ону оожургадып шыдаваан.Ону кым-даа оожургадып шыдаваан.Оглумну кым-даа оожургап шыдаваан.кым-даа оожургап шыдаваан.И никто не в силах был укротить его.И никто не мог оглянуться на него.Никто не мог на это остановить его.Никто не мог его обуздать.-ла, чүге дээрге кым-даа ону таарыштырып шыдаваан.
86499СамбайлыкСамбайлыкСамбайлыкСамбайлыкңСамбайлыкСамбайлыкСамбайлыкСамбайлыкг
20503чалбыыш ыяштарны өрттендир чипкенчалбыыш ыяштарны чиртипкенчалбыыш ыяштарны кыпканчалбырааш ыяштарны хөме апканг чалбырааш ыяштарны чиртипкенпламя опалило деревьяпламясъегло деревьяпламя разъедало деревьяпламя выжгло деревьяг ыяштарны чиртип каапкан
1784бүзүредилгебүзүредилгебүзүрээрибүзүрелдоверенностьдовериеубеждениезаверения
34052шын эвес адааршын эвес адааршын эвес адаарсоора адаарң адаарнеправильно произноситьнеправильно произнестинеправильно произнестинеправильно произносить
83751Калдар аътМухортай аътДыйлагар аътМухортая аътаътМухортая лошадьВерховая лошадьВерхняя лошадьРысистая лошадьаът
68723соора дыңнаардыңнаалаардыңнаалаардыңнап алырдыңнаарпрослушатьослышатьсяпрослушатьослышатьсядыңнаар
84267Коңга эдипкенКоңгулуур кыңгырт дээнКуңгур кызаш дээнКоңгурааш, кыңгырт дээнң кыңгырткайны-ла бергенЗвонок прозвенелПрикованный к конгаКонга поправленныйзвонок исправилберген
14467мен силерге кашты төлээр ужурлуг мен?силерге чеже өрелиг мен?силерге чеже хире өренир мен?силерге чеже хире өртектиг мен?каш хире өрелиг мен силерге?сколько я вам должен?сколько я вам должен заплатить?сколько я вам обязан?сколько я вам должен платить?ң дээш каш хире төлээр ужурлуг мен силерге?
54719кончуг берге байдалдаркончуг берге байдалаажок кортканыаажок коргарыужасужасыужасычёрные условиять берге байдалдар
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ] }, "metadata": {}, "execution_count": 54 } ] }, { "cell_type": "code", "source": [ "cols = ['ind', 'tyv', 'ru']\n", "splits = {'train': df_train[df_train.index<=49_454], 'test': df_test, 'dev': df_dev}\n", "df_joint = []\n", "for k, v in splits.items():\n", " v = v[cols].copy()\n", " v.index.name = \"row_id\"\n", " v['split'] = k\n", " df_joint.append(v)\n", "df_joint = pd.concat(df_joint)\n", "df_joint.shape" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "gTHS_RygGLgV", "outputId": "640f2f2f-20d9-4a75-d3b8-ded11ab0b7a9" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "(50000, 4)" ] }, "metadata": {}, "execution_count": 70 } ] }, { "cell_type": "code", "source": [ "df_joint.sample(5)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 237 }, "id": "Yz6ShvR3HGk3", "outputId": "216b9e1d-90b7-40a7-a976-207fda948148" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " ind \\\n", "row_id \n", "314 328 \n", "4376 4390 \n", "13377 13392 \n", "91144 97279 \n", "307 321 \n", "\n", " tyv \\\n", "row_id \n", "314 Өг-бүле бүрүзү эвээш дээрге-ле 500-600 ивини азырап өстүрзүн. \n", "4376 Өрээл аяннаны берген \n", "13377 кым-бир кижи-биле силер деп чугаалажыр \n", "91144 Идээледир чемнениринде база эки чүү-даа чүве чок, ынчангаш арай эвээшти чиңер. \n", "307 Оларның аразында 14 суурда 500 четпес, а 8 суурда 250 хире чурттакчы бар. \n", "\n", " ru split \n", "row_id \n", "314 Пусть на каждую семью было хотя бы по 500-600 оленей. train \n", "4376 Комната приняла хороший вид train \n", "13377 быть на вы с кем-либо train \n", "91144 Ничего хорошего нет и в переедании, так что ешьте поменьше. dev \n", "307 Среди них в 14 селах менее 500, в восьми - менее 250 человек. train " ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
indtyvrusplit
row_id
314328Өг-бүле бүрүзү эвээш дээрге-ле 500-600 ивини азырап өстүрзүн.Пусть на каждую семью было хотя бы по 500-600 оленей.train
43764390Өрээл аяннаны бергенКомната приняла хороший видtrain
1337713392кым-бир кижи-биле силер деп чугаалажырбыть на вы с кем-либоtrain
9114497279Идээледир чемнениринде база эки чүү-даа чүве чок, ынчангаш арай эвээшти чиңер.Ничего хорошего нет и в переедании, так что ешьте поменьше.dev
307321Оларның аразында 14 суурда 500 четпес, а 8 суурда 250 хире чурттакчы бар.Среди них в 14 селах менее 500, в восьми - менее 250 человек.train
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ] }, "metadata": {}, "execution_count": 76 } ] }, { "cell_type": "code", "source": [ "df_joint.to_csv(\"/gd/MyDrive/datasets/nlp/tyvan/rus_tyv_parallel_50k.tsv\", sep=\"\\t\")" ], "metadata": { "id": "9qfu9FPwGSQu" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "# Publishing the model to HF" ], "metadata": { "id": "Z1mVG4Gy9KYK" } }, { "cell_type": "code", "source": [ "!huggingface-cli login" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "r5PuJmlJ994_", "outputId": "ebfdfb8f-085a-43ad-ad86-dc1090ddddc3" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "\n", " _| _| _| _| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _|_|_|_| _|_| _|_|_| _|_|_|_|\n", " _| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _|\n", " _|_|_|_| _| _| _| _|_| _| _|_| _| _| _| _| _| _|_| _|_|_| _|_|_|_| _| _|_|_|\n", " _| _| _| _| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _|\n", " _| _| _|_| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _| _| _| _|_|_| _|_|_|_|\n", " \n", " A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.\n", " Setting a new token will erase the existing one.\n", " To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .\n", "Token: \n", "Add token as git credential? (Y/n) Y\n", "Token is valid (permission: write).\n", "Your token has been saved in your configured git credential helpers (store).\n", "Your token has been saved to /root/.cache/huggingface/token\n", "Login successful\n" ] } ] }, { "cell_type": "code", "source": [ "from transformers import NllbTokenizer, AutoModelForSeq2SeqLM, AutoConfig" ], "metadata": { "id": "vtrBFlQp9hSb" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "def fix_tokenizer(tokenizer, new_lang='tyv_Cyrl'):\n", " \"\"\" Add a new language token to the tokenizer vocabulary (this should be done each time after its initialization) \"\"\"\n", " old_len = len(tokenizer) - int(new_lang in tokenizer.added_tokens_encoder)\n", " tokenizer.lang_code_to_id[new_lang] = old_len-1\n", " tokenizer.id_to_lang_code[old_len-1] = new_lang\n", " # always move \"mask\" to the last position\n", " tokenizer.fairseq_tokens_to_ids[\"\"] = len(tokenizer.sp_model) + len(tokenizer.lang_code_to_id) + tokenizer.fairseq_offset\n", "\n", " tokenizer.fairseq_tokens_to_ids.update(tokenizer.lang_code_to_id)\n", " tokenizer.fairseq_ids_to_tokens = {v: k for k, v in tokenizer.fairseq_tokens_to_ids.items()}\n", " if new_lang not in tokenizer._additional_special_tokens:\n", " tokenizer._additional_special_tokens.append(new_lang)\n", " # clear the added token encoder; otherwise a new token may end up there by mistake\n", " tokenizer.added_tokens_encoder = {}\n", " tokenizer.added_tokens_decoder = {}" ], "metadata": { "id": "RmGVeHIzFiuA" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "model_load_name = '/gd/MyDrive/models/nllb-rus-tyv-v1'\n", "model = AutoModelForSeq2SeqLM.from_pretrained(model_load_name)\n", "tokenizer = NllbTokenizer.from_pretrained(model_load_name)\n", "fix_tokenizer(tokenizer)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "r6OR05X89a71", "outputId": "8b2a98f7-09c8-4461-9e8e-191c32d0d9d1" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n" ] } ] }, { "cell_type": "code", "source": [ "upload_repo = \"slone/nllb-rus-tyv-v1\"\n", "tokenizer.push_to_hub(upload_repo)\n", "model.push_to_hub(upload_repo)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 150, "referenced_widgets": [ "79d5f10ea79041c1bac7288656bca516", "ecb3db3792174c1a981da47b05bcf803", "ce3b6707fde344c2a716878a8cde3402", "cb9e3644d0fc4f3fa45efaa34cfc13c4", "57040992a43947ffa6ead81e184f7036", "5f244ff98c44422384025a25ad8fc779", "37bade299f834601b876cc1f200063fd", "aef8675be22c4b688533791c58d84dbb", "b9eb136229e949da91c0a03ad0f7f7af", "bc082bba84164c6e855253cc71801f3c", "6731bcd3e5104732b6c6b3917a5e2f90", "6cd6374c78f244548b79c11bcd089d16", "6902215d7f92404f99489865748286e3", "99389cae0f304a03819cd882f4ade40d", "ddc9b3ac87cb40b8b41a5d4bca98b2af", "310ac5268d70468a8ff4b53a938553af", "c687e45084444bc9a4bed6abe593f2ba", "9cf39baa725b4e0784c0113644efb60f", "b5bd0ea3f2ca4ee6b7a1f342274ff63a", "5a421731b36648168c7c417388219dad", "226a4c8a97ef4a359858d2cbbd13dcea", "bcaf9ac3a03d436b9322a14ce025611e" ] }, "id": "zf0U6Vgf9qu-", "outputId": "3e80561f-a03a-4551-e6ad-3b81ee8253f6" }, "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "sentencepiece.bpe.model: 0%| | 0.00/4.85M [00:00']\n", "print(tokenizer.convert_tokens_to_ids(['zul_Latn', 'tyv_Cyrl', ''])) # [256202, 256203, 256204]\n", "# this is consistent now, wow!" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "3zygd7HmFrLJ", "outputId": "e711c9d7-02b8-4886-911b-994915aac481" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "['zul_Latn', '', 'tyv_Cyrl']\n", "[256202, 256204, 256203]\n" ] } ] }, { "cell_type": "markdown", "source": [ "Testing that it works" ], "metadata": { "id": "qUQCkKIzO7EB" } }, { "cell_type": "code", "source": [ "MODEL_URL = 'slone/nllb-rus-tyv-v1'\n", "model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_URL)\n", "tokenizer = NllbTokenizer.from_pretrained(MODEL_URL, force_download=True)\n", "fix_tokenizer(tokenizer)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 246, "referenced_widgets": [ "50a0ef4daefd47b7b7b64009f358dc41", "24abf099591e4a899885de04b3fd8321", "6859e3c6c0984676b917f79b21b56dc8", "c8a6ec22038d400aa37d98c894dea92a", "f3e7dd3724cb4ad49a8bac488e245698", "dd4bc071514a47ef919e5590845b2a78", "73d6676d04a448ebbfc80102eb3a0e58", "1b0e75b61e574f678b9e3d36e0802ac3", "b17359c60aa24e029e112fddc7f28826", "90b1696cb54a4bcbba6c7402bba4b649", "7a6767005260467bbe539944b8b6e369", "1423cc55b5814e4c895bdee24201f0ec", "d545c961cdea43ad85d78b18691d72ab", "a67208795d664dcb9effab4a65e3b79e", "507cc289eb8a4fc19864d163d95ba5fc", "6e82afd16a3a4c9797fd8382c78e6046", "5aef36278ed942789d24bacd873c6f0b", "f3b8e4f9b9264146b2465c708351f65d", "3c460f721f424ac99c91dcac2fa4bbfa", "e48127d45b8e433e8d2ea1eae029c781", "cffce888c2604c8b9fe59c858603ddf9", "d2ecd924600a49908c9729c0903e35c4", "09961195b22648ea859f993290dd6dd3", "7b30ceacc6ac40a48ea32ff82b34428c", "08536b0168fe47c993102b0f46581842", "6f0bebdc37c64942923c577dada7d5b9", "41e13138581b4197a82e91cea90194a2", "ef47628e909d471781ce6f83ae88f2a9", "38d68291eac247cf85bfb1de50d8b120", "083bb600d64848589c02b9d772736a8c", "d062239d38e74048aadcc946bb974cb7", "3053da3f6f31407fb5ee0dc0402bfdb0", "ee2c90d133ed4fa69ffa5427e0e71f4e", "55593e579b654a4686a97175bc9372d2", "5313e9e56e3549068ae32738baf0b306", "d2b42ed99d254880b96d6bf5813cef56", "875fe414f74d4de989dcf249e1653aa6", "e620ef40b21640079d41d28b863eb013", "d1f84a81069042bdae75373de6536172", "eb5c4067edac458fbe71d8a324aa7408", "1617ee5bd47d403d874325bc0770c5b4", "9e9b0df4e1db4685b5fe64b0df76f522", "258e87e4744e4621acd3cbc64b5d2618", "8ab6cd9e36dc4da1b90a878b72504d0e", "43d03284a26148f89020a1001e226801", "19efa6d1430e4947888691bd06bc984f", "20a7f65a15be4081ba5b2848eb08a652", "5ebeb9328cb64491b80a8e6f956e74c9", "1ae9df2028994482a4a0c8ea90e94295", "14368928461046b8a478c6850a8c0309", "9b7cbfa1b1a0423a978fd657004ef995", "8284e1e6aea14607a8cb9d39d5fa8303", "491343173f17475b909aaa57a445b68b", "006504cc2cc34662a7686b5199b3fb3a", "7861e68bd3654ac580a4b70a5d9fa6f9", "9adc761447984736adc1d059197f389e", "f680c6102c2f4396ad9f0b6a5842a544", "c24bf754dca14ea6b2929391ecfb5e39", "5789aacf1599425e84eb25f6d0e1374a", "6f37a2f65b304e93bdb952fb76e4928a", "af0177c68cd9469c9f3229a12d5bbc00", "f742607592864f01befe3ac663f5e8b7", "4e32326f49d84389b198eb08ae529acd", "ed8ff220b68f46399040a0864a1434fc", "8e7b505016044ab18fffa81d80e12b24", "553cc9502a9a4ccc9c21efdcb08aa451" ] }, "id": "0obaABVcQ3N_", "outputId": "39b0b63d-7986-454e-aa11-5b997ad337d1" }, "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "Downloading (…)lve/main/config.json: 0%| | 0.00/898 [00:00