File size: 11,040 Bytes

9ab5bc1

{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "Bookworm MTL.ipynb",
      "provenance": [],
      "collapsed_sections": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    },
    "accelerator": "GPU",
    "gpuClass": "standard"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "# Ascendance of a Bookworm MTL\n",
        "\n",
        "This notebook uses a custom machine translation model to translate the Ascendance of a Bookworm WN into English.\n",
        "\n",
        "This model is in BETA. Pronouns are not fixed yet, new characters' names may be wrong, and sentence splitting isn't implemented yet, so the model likes making a single long sentence. These issues will be fixed in the future.\n",
        "\n",
        "If you encounter any poorly translated sentences and want to help improve the model, see the note at the bottom of the page.\n",
        "\n",
        "To run this notebook, make sure you are using a GPU runtime and then go to\n",
        "Runtime > Run all. Once that is done, you can change the text in the translation cell and run it multiple times by clicking the run button to the left of the cell. "
      ],
      "metadata": {
        "id": "nkp0dv1zg93C"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "#@title Run this to set up the environment\n",
        "\n",
        "!pip install transformers\n",
        "!pip install accelerate\n",
        "!pip install unidecode\n",
        "!pip install spacy\n",
        "!python -m spacy download ja_core_news_lg"
      ],
      "metadata": {
        "cellView": "form",
        "id": "nM7cmpX4hl0q"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "#@title Run this to import python packages\n",
        "\n",
        "from functools import partial\n",
        "import torch\n",
        "from torch.cuda.amp import autocast\n",
        "from transformers import AutoTokenizer, AutoConfig, AutoModelForSeq2SeqLM, NllbTokenizerFast\n",
        "import spacy\n",
        "from tqdm.notebook import tqdm\n",
        "import re\n",
        "import unidecode\n",
        "import unicodedata"
      ],
      "metadata": {
        "cellView": "form",
        "id": "mSnruJt8r3qP"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "#@title Run this to set the output language\n",
        "#@markdown This model is multi-lingual! Here you can set the output language.\n",
        "#@markdown It is best with English, but it can translate into other\n",
        "#@markdown languages too. A couple are listed here, but you can enter a different\n",
        "#@markdown one if you want. See pages 13-16 in [this pdf](https://arxiv.org/pdf/2207.04672.pdf)\n",
        "#@markdown for a full list of supported languages.\n",
        "\n",
        "target_language = 'eng_Latn' #@param [\"eng_Latn\", \"spa_Latn\", \"fra_Latn\", \"deu_Latn\"] {allow-input: true}"
      ],
      "metadata": {
        "cellView": "form",
        "id": "6w_HfApfhn9j"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "#@title Run this to initialize the model\n",
        "\n",
        "DEVICE = 'cuda:0'\n",
        "model_checkpoint = \"thefrigidliquidation/nllb-200-distilled-1.3B-bookworm\"\n",
        "\n",
        "config = AutoConfig.from_pretrained(model_checkpoint)\n",
        "tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, src_lang=\"jpn_Jpan\", tgt_lang=target_language)\n",
        "model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint, torch_dtype=torch.float16).to(DEVICE)\n",
        "\n",
        "nlp_ja = spacy.load('ja_core_news_lg')"
      ],
      "metadata": {
        "cellView": "form",
        "id": "cGnkjUgej6Uv"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "#@title Run this to set up the code to do the translating\n",
        "\n",
        "DOTS_REGEX = re.compile(r\"^(?P<dots>[.…]+)。?$\")\n",
        "\n",
        "\n",
        "def char_filter(string):\n",
        "    latin = re.compile('[a-zA-Z]+')\n",
        "    for char in unicodedata.normalize('NFC', string):\n",
        "        decoded = unidecode.unidecode(char)\n",
        "        if latin.match(decoded):\n",
        "            yield char\n",
        "        else:\n",
        "            yield decoded\n",
        "\n",
        "\n",
        "def clean_string(string):\n",
        "    s = \"\".join(char_filter(string))\n",
        "    s = \"\\n\".join((x.rstrip() for x in s.splitlines()))\n",
        "    return s\n",
        "\n",
        "\n",
        "def split_lglines_sentences(nlp, text, split_on_len=200):\n",
        "    lines = text.splitlines()\n",
        "    for line in lines:\n",
        "        if len(line) < split_on_len:\n",
        "            yield line.strip()\n",
        "            continue\n",
        "        doc = nlp(line)\n",
        "        assert doc.has_annotation(\"SENT_START\")\n",
        "        spacy_sents = [str(x).strip() for x in doc.sents]\n",
        "        if len(spacy_sents) == 1:\n",
        "            yield spacy_sents[0]\n",
        "            continue\n",
        "        # japanese spacy is bad. combine again if needed\n",
        "        sents = []\n",
        "        for sent in spacy_sents:\n",
        "            if (len(sent) < 4) and (len(sents) > 0) and (len(sents[-1]) == 0 or sents[-1][-1] != '.'):\n",
        "                sents[-1] += sent\n",
        "            else:\n",
        "                sents.append(sent)\n",
        "        yield from (x for x in sents if not DOTS_REGEX.match(x))\n",
        "\n",
        "\n",
        "def translate_m2m(translator, tokenizer: NllbTokenizerFast, device, pars, verbose: bool = False):\n",
        "    en_pars = []\n",
        "    pars_it = tqdm(pars, leave=False, smoothing=0.0) if verbose else pars\n",
        "    for line in pars_it:\n",
        "        if line.strip() == \"\":\n",
        "            en_pars.append(\"\")\n",
        "            continue\n",
        "        inputs = tokenizer(f\"{line}\", return_tensors=\"pt\")\n",
        "        inputs = {k: v.to(device) for (k, v) in inputs.items()}\n",
        "        generated_tokens = translator.generate(\n",
        "            **inputs,\n",
        "            forced_bos_token_id=tokenizer.lang_code_to_id[tokenizer.tgt_lang],\n",
        "            max_new_tokens=512,\n",
        "            no_repeat_ngram_size=4,\n",
        "        ).cpu()\n",
        "        with tokenizer.as_target_tokenizer():\n",
        "            outputs = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)\n",
        "        en_pars.append(*outputs)\n",
        "    return en_pars\n",
        "\n",
        "\n",
        "translate = partial(translate_m2m, model, tokenizer, DEVICE)\n",
        "\n",
        "\n",
        "def translate_long_text(text: str):\n",
        "    lines = split_lglines_sentences(nlp_ja, text, split_on_len=150)\n",
        "    with torch.no_grad():\n",
        "        with autocast(dtype=torch.float16):\n",
        "            en_lines = translate([clean_string(x).strip() for x in lines], verbose=True)\n",
        "            for en_line in en_lines:\n",
        "                print(en_line)"
      ],
      "metadata": {
        "cellView": "form",
        "id": "zPFc9VP0k4_y"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "#@title Run this to translate the text\n",
        "\n",
        "#@markdown Enter the Japansese text into the box on the left between the three quation marks (\"\"\").\n",
        "#@markdown Make sure there is no text on the lines containing the three quotes.\n",
        "#@markdown See the example text for an idea of the formatting required.\n",
        "\n",
        "text = \"\"\"\n",
        "本須もとす麗乃うらのは本が好きだ。\n",
        "\n",
        "心理学、宗教、歴史、地理、教育学、民俗学、数学、物理、地学、化学、生物学、芸術、体育、言語、物語……人類の知識がぎっちり詰め込まれた本を心の底から愛している。\n",
        "\n",
        "様々な知識が一冊にまとめられている本を読むと、とても得をした気分になれるし、自分がこの目で見たことがない世界を、本屋や図書館に並ぶ写真集を通して見るのも、世界が広がっていくようで陶酔できる。\n",
        "\n",
        "外国の古い物語だって、違う時代の、違う国の風習が垣間見えて趣深いし、あらゆる分野において歴史があり、それを紐解いていけば、時間を忘れるなんていつものことである。\n",
        "\n",
        "麗乃は、図書館の古い本が集められている書庫の、古い本独特の少々黴かび臭い匂いや埃っぽい匂いが好きで、図書館に行くとわざわざ書庫に入り込む。そこでゆっくりと古い匂いのする空気を吸い込み、年を経た本を見回せば、麗乃はそれだけで嬉しくなって、興奮してしまう。\n",
        "\"\"\"[1:-1]\n",
        "\n",
        "translate_long_text(text)"
      ],
      "metadata": {
        "id": "Rwv_rO9plAsj"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "#@title Submit corrected sentences to improve the model!\n",
        "#@markdown If you encounter poorly translated sentences with the wrong name or term, please correct it!\n",
        "#@markdown You can use other translation sites (like [DeepL](https://www.deepl.com/translator))\n",
        "#@markdown to make sure the Japanese and English sentences match.\n",
        "\n",
        "#@markdown Then run this cell and message [u/thefrigidliquidation](https://www.reddit.com/user/thefrigidliquidation/)\n",
        "#@markdown on reddit with this cells output.\n",
        "\n",
        "import base64\n",
        "import json\n",
        "\n",
        "\n",
        "\n",
        "ja_sent = 'The Japanese sentence.' #@param {type:\"string\"}\n",
        "en_sent = 'The corrected English sentence.' #@param {type:\"string\"}\n",
        "\n",
        "df = {'translation': {'en': en_sent, 'ja': ja_sent}}\n",
        "df_json = json.dumps(df)\n",
        "\n",
        "print(base64.b64encode(df_json.encode('ascii')).decode('ascii'))\n"
      ],
      "metadata": {
        "cellView": "form",
        "id": "0yx9hnj6yBKA"
      },
      "execution_count": null,
      "outputs": []
    }
  ]
}