tts-vie2

Sleeping

File size: 20,410 Bytes

33acd27

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "view-in-github"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/thinhlpg/vixtts-demo/blob/dev/viXTTS_Demo.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Yv6wKD48of3x"
      },
      "source": [
        "# 🔥🔥🔥**viXTTS Demo**🗣️🗣️🗣️\n",
        "\n",
        "- https://github.com/thinhlpg/vixtts-demo/\n",
        "- https://github.com/thinhlpg/viVoice\n",
        "\n",
        "Demo này giúp bạn chạy viXTTS miễn phí trên Google Colab!\n",
        "Xem thông tin mô hình tại [đây](https://huggingface.co/capleaf/viXTTS)\n",
        "\n",
        "Bạn có thể dùng demo này với mục đích:\n",
        "- ✅Mục đích cá nhân, học tập, nghiên cứu, thử nghiệm\n",
        "\n",
        "Bạn **KHÔNG** được dùng demo này với mục đích:\n",
        "- ❌Mục đích trái đạo đức, vi phạm pháp luật Việt Nam\n",
        "- ❌Tạo ra nội dung gây thù ghét, kỳ thị, bạo lực hoặc nội dung vi phạm bản quyền\n",
        "- ❌Giả mạo danh tính hoặc gây hiểu nhầm rằng nội dung được tạo ra bởi một cá nhân hoặc tổ chức khác"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "kIEFgM3gnEZm"
      },
      "outputs": [],
      "source": [
        "# @title 1. ⚙️ **Cài đặt**\n",
        "# @markdown 👈Nhấn nút này để cài đặt (~5 phút)\n",
        "# Change timezone to Vietnam\n",
        "!rm /etc/localtime\n",
        "!ln -s /usr/share/zoneinfo/Asia/Ho_Chi_Minh /etc/localtime\n",
        "!date\n",
        "\n",
        "print(\" > Cài đặt thư viện...\")\n",
        "!rm -rf TTS/\n",
        "!git clone --branch add-vietnamese-xtts -q https://github.com/thinhlpg/TTS.git\n",
        "!pip install --use-deprecated=legacy-resolver -q -e TTS\n",
        "!pip install deepspeed -q\n",
        "!pip install -q vinorm==2.0.7\n",
        "!pip install -q cutlet\n",
        "!pip install -q unidic==1.1.0\n",
        "!pip install -q underthesea\n",
        "!pip install -q gradio==4.35\n",
        "!pip install deepfilternet==0.5.6 -q\n",
        "\n",
        "import os\n",
        "from huggingface_hub import snapshot_download\n",
        "\n",
        "\n",
        "os.system(\"python -m unidic download\")\n",
        "print(\" > Tải mô hình...\")\n",
        "snapshot_download(repo_id=\"thinhlpg/viXTTS\",\n",
        "                  repo_type=\"model\",\n",
        "                  local_dir=\"model\")\n",
        "\n",
        "from IPython.display import clear_output\n",
        "clear_output()\n",
        "print(\" > ✅ Cài đặt hoàn tất, bạn hãy chạy tiếp các bước tiếp theo nhé!\")\n",
        "quit()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "oGSVcv3xuWWi"
      },
      "outputs": [],
      "source": [
        "# The inference code is adopted from https://github.com/coqui-ai/TTS/blob/dev/TTS/demos/xtts_ft_demo/xtts_demo.py\n",
        "# @title 2. 🤗 **Sử dụng**\n",
        "# @markdown 👈 Nhấn để chạy.\n",
        "# @markdown Nếu gặp lỗi thì cũng nhấn nút này nhé!\n",
        "\n",
        "# @markdown Lần đầu chạy sẽ hơi lâu, bạn chờ tí nhé!\n",
        "\n",
        "# @markdown Kết quả sẽ được lưu vào `/content/output`\n",
        "\n",
        "# @markdown Chọn ngôn ngữ:\n",
        "language = \"Tiếng Việt\" # @param [\"Tiếng Việt\", \"Tiếng Anh\",\"Tiếng Tây Ban Nha\", \"Tiếng Pháp\",\"Tiếng Đức\",\"Tiếng Ý\", \"Tiếng Bồ Đào Nha\", \"Tiếng Ba Lan\", \"Tiếng Thổ Nhĩ Kỳ\", \"Tiếng Nga\", \"Tiếng Hà Lan\", \"Tiếng Séc\", \"Tiếng Ả Rập\", \"Tiếng Trung (giản thể)\", \"Tiếng Nhật\", \"Tiếng Hungary\", \"Tiếng Hàn\", \"Tiếng Hindi\"]\n",
        "# @markdown Văn bản để đọc. Độ dài tối thiểu mỗi câu nên từ 10 từ để đặt kết quả tốt nhất.\n",
        "input_text =\"Xin chào, tôi là một công cụ có khả năng chuyển đổi văn bản thành giọng nói tự nhiên, được phát triển bởi nhóm Nón lá. Tôi có thể hổ trợ người khiếm thị,  đọc sách nói, làm trợ lý ảo, review phim, làm waifu để an ủi bạn, và phục vụ nhiều mục đích khác.\" # @param {type:\"string\"}\n",
        "# @markdown Chọn giọng mẫu:\n",
        "reference_audio = \"model/vi_sample.wav\" # @param [ \"model/user_sample.wav\",  \"model/vi_sample.wav\",  \"model/samples/nam-calm.wav\",  \"model/samples/nam-cham.wav\",  \"model/samples/nam-nhanh.wav\",  \"model/samples/nam-truyen-cam.wav\",  \"model/samples/nu-calm.wav\",  \"model/samples/nu-cham.wav\",  \"model/samples/nu-luu-loat.wav\",  \"model/samples/nu-nhan-nha.wav\",  \"model/samples/nu-nhe-nhang.wav\"]\n",
        "# @markdown Tự động chuẩn hóa chữ (VD: 20/11 -> hai mươi tháng mười một)\n",
        "normalize_text = True # @param {type:\"boolean\"}\n",
        "# @markdown In chi tiết xử lý\n",
        "verbose = True # @param {type:\"boolean\"}\n",
        "# @markdown Lưu từng câu thành file riêng lẻ.\n",
        "output_chunks = True # @param {type:\"boolean\"}\n",
        "\n",
        "from IPython.display import clear_output\n",
        "def cry_and_quit():\n",
        "    clear_output()\n",
        "    print(\"> Lỗi rồi huhu 😭😭, bạn hãy nhấn chạy lại phần này nhé!\")\n",
        "    quit()\n",
        "\n",
        "import os\n",
        "import string\n",
        "import unicodedata\n",
        "from datetime import datetime\n",
        "from pprint import pprint\n",
        "\n",
        "import torch\n",
        "import torchaudio\n",
        "from tqdm import tqdm\n",
        "from underthesea import sent_tokenize\n",
        "from unidecode import unidecode\n",
        "\n",
        "try:\n",
        "    from vinorm import TTSnorm\n",
        "    from TTS.tts.configs.xtts_config import XttsConfig\n",
        "    from TTS.tts.models.xtts import Xtts\n",
        "except:\n",
        "    cry_and_quit()\n",
        "\n",
        "# Load model\n",
        "def clear_gpu_cache():\n",
        "    if torch.cuda.is_available():\n",
        "        torch.cuda.empty_cache()\n",
        "\n",
        "\n",
        "def load_model(xtts_checkpoint, xtts_config, xtts_vocab):\n",
        "    clear_gpu_cache()\n",
        "    if not xtts_checkpoint or not xtts_config or not xtts_vocab:\n",
        "        return \"You need to run the previous steps or manually set the `XTTS checkpoint path`, `XTTS config path`, and `XTTS vocab path` fields !!\"\n",
        "    config = XttsConfig()\n",
        "    config.load_json(xtts_config)\n",
        "    XTTS_MODEL = Xtts.init_from_config(config)\n",
        "    print(\"Loading XTTS model! \")\n",
        "    XTTS_MODEL.load_checkpoint(config,\n",
        "                               checkpoint_path=xtts_checkpoint,\n",
        "                               vocab_path=xtts_vocab,\n",
        "                               use_deepspeed=True)\n",
        "    if torch.cuda.is_available():\n",
        "        XTTS_MODEL.cuda()\n",
        "\n",
        "    print(\"Model Loaded!\")\n",
        "    return XTTS_MODEL\n",
        "\n",
        "\n",
        "def get_file_name(text, max_char=50):\n",
        "    filename = text[:max_char]\n",
        "    filename = filename.lower()\n",
        "    filename = filename.replace(\" \", \"_\")\n",
        "    filename = filename.translate(str.maketrans(\"\", \"\", string.punctuation.replace(\"_\", \"\")))\n",
        "    filename = unidecode(filename)\n",
        "    current_datetime = datetime.now().strftime(\"%m%d%H%M%S\")\n",
        "    filename = f\"{current_datetime}_{filename}\"\n",
        "    return filename\n",
        "\n",
        "\n",
        "def calculate_keep_len(text, lang):\n",
        "    if lang in [\"ja\", \"zh-cn\"]:\n",
        "        return -1\n",
        "\n",
        "    word_count = len(text.split())\n",
        "    num_punct = (\n",
        "        text.count(\".\")\n",
        "        + text.count(\"!\")\n",
        "        + text.count(\"?\")\n",
        "        + text.count(\",\")\n",
        "    )\n",
        "\n",
        "    if word_count < 5:\n",
        "        return 15000 * word_count + 2000 * num_punct\n",
        "    elif word_count < 10:\n",
        "        return 13000 * word_count + 2000 * num_punct\n",
        "    return -1\n",
        "\n",
        "\n",
        "def normalize_vietnamese_text(text):\n",
        "    text = (\n",
        "        TTSnorm(text, unknown=False, lower=False, rule=True)\n",
        "        .replace(\"..\", \".\")\n",
        "        .replace(\"!.\", \"!\")\n",
        "        .replace(\"?.\", \"?\")\n",
        "        .replace(\" .\", \".\")\n",
        "        .replace(\" ,\", \",\")\n",
        "        .replace('\"', \"\")\n",
        "        .replace(\"'\", \"\")\n",
        "        .replace(\"AI\", \"Ây Ai\")\n",
        "        .replace(\"A.I\", \"Ây Ai\")\n",
        "    )\n",
        "    return text\n",
        "\n",
        "\n",
        "def run_tts(XTTS_MODEL, lang, tts_text, speaker_audio_file,\n",
        "            normalize_text= True,\n",
        "            verbose=False,\n",
        "            output_chunks=False):\n",
        "    \"\"\"\n",
        "    Run text-to-speech (TTS) synthesis using the provided XTTS_MODEL.\n",
        "\n",
        "    Args:\n",
        "        XTTS_MODEL: A pre-trained TTS model.\n",
        "        lang (str): The language of the input text.\n",
        "        tts_text (str): The text to be synthesized into speech.\n",
        "        speaker_audio_file (str): Path to the audio file of the speaker to condition the synthesis on.\n",
        "        normalize_text (bool, optional): Whether to normalize the input text. Defaults to True.\n",
        "        verbose (bool, optional): Whether to print verbose information. Defaults to False.\n",
        "        output_chunks (bool, optional): Whether to save synthesized speech chunks separately. Defaults to False.\n",
        "\n",
        "    Returns:\n",
        "        str: Path to the synthesized audio file.\n",
        "    \"\"\"\n",
        "\n",
        "    if XTTS_MODEL is None or not speaker_audio_file:\n",
        "        return \"You need to run the previous step to load the model !!\", None, None\n",
        "\n",
        "    output_dir = \"./output\"\n",
        "    os.makedirs(output_dir, exist_ok=True)\n",
        "\n",
        "    gpt_cond_latent, speaker_embedding = XTTS_MODEL.get_conditioning_latents(\n",
        "        audio_path=speaker_audio_file,\n",
        "        gpt_cond_len=XTTS_MODEL.config.gpt_cond_len,\n",
        "        max_ref_length=XTTS_MODEL.config.max_ref_len,\n",
        "        sound_norm_refs=XTTS_MODEL.config.sound_norm_refs,\n",
        "    )\n",
        "\n",
        "    if normalize_text and lang == \"vi\":\n",
        "        # Bug on google colab\n",
        "        try:\n",
        "            tts_text = normalize_vietnamese_text(tts_text)\n",
        "        except:\n",
        "            cry_and_quit()\n",
        "\n",
        "    if lang in [\"ja\", \"zh-cn\"]:\n",
        "        tts_texts = tts_text.split(\"。\")\n",
        "    else:\n",
        "        tts_texts = sent_tokenize(tts_text)\n",
        "\n",
        "    if verbose:\n",
        "        print(\"Text for TTS:\")\n",
        "        pprint(tts_texts)\n",
        "\n",
        "    wav_chunks = []\n",
        "    for text in tqdm(tts_texts):\n",
        "        if text.strip() == \"\":\n",
        "            continue\n",
        "\n",
        "        wav_chunk = XTTS_MODEL.inference(\n",
        "            text=text,\n",
        "            language=lang,\n",
        "            gpt_cond_latent=gpt_cond_latent,\n",
        "            speaker_embedding=speaker_embedding,\n",
        "            temperature=0.3,\n",
        "            length_penalty=1.0,\n",
        "            repetition_penalty=10.0,\n",
        "            top_k=30,\n",
        "            top_p=0.85,\n",
        "        )\n",
        "\n",
        "        # Quick hack for short sentences\n",
        "        keep_len = calculate_keep_len(text, lang)\n",
        "        wav_chunk[\"wav\"] = torch.tensor(wav_chunk[\"wav\"][:keep_len])\n",
        "\n",
        "        if output_chunks:\n",
        "            out_path = os.path.join(output_dir, f\"{get_file_name(text)}.wav\")\n",
        "            torchaudio.save(out_path, wav_chunk[\"wav\"].unsqueeze(0), 24000)\n",
        "            if verbose:\n",
        "                print(f\"Saved chunk to {out_path}\")\n",
        "\n",
        "        wav_chunks.append(wav_chunk[\"wav\"])\n",
        "\n",
        "    out_wav = torch.cat(wav_chunks, dim=0).unsqueeze(0)\n",
        "    out_path = os.path.join(output_dir, f\"{get_file_name(tts_text)}.wav\")\n",
        "    torchaudio.save(out_path, out_wav, 24000)\n",
        "\n",
        "    if verbose:\n",
        "        print(f\"Saved final file to {out_path}\")\n",
        "\n",
        "    return out_path\n",
        "\n",
        "\n",
        "language_code_map = {\n",
        "    \"Tiếng Việt\": \"vi\",\n",
        "    \"Tiếng Anh\": \"en\",\n",
        "    \"Tiếng Tây Ban Nha\": \"es\",\n",
        "    \"Tiếng Pháp\": \"fr\",\n",
        "    \"Tiếng Đức\": \"de\",\n",
        "    \"Tiếng Ý\": \"it\",\n",
        "    \"Tiếng Bồ Đào Nha\": \"pt\",\n",
        "    \"Tiếng Ba Lan\": \"pl\",\n",
        "    \"Tiếng Thổ Nhĩ Kỳ\": \"tr\",\n",
        "    \"Tiếng Nga\": \"ru\",\n",
        "    \"Tiếng Hà Lan\": \"nl\",\n",
        "    \"Tiếng Séc\": \"cs\",\n",
        "    \"Tiếng Ả Rập\": \"ar\",\n",
        "    \"Tiếng Trung (giản thể)\": \"zh-cn\",\n",
        "    \"Tiếng Nhật\": \"ja\",\n",
        "    \"Tiếng Hungary\": \"hu\",\n",
        "    \"Tiếng Hàn\": \"ko\",\n",
        "    \"Tiếng Hindi\": \"hi\"\n",
        "}\n",
        "\n",
        "print(\"> Đang nạp mô hình...\")\n",
        "try:\n",
        "    if not vixtts_model:\n",
        "        vixtts_model = load_model(xtts_checkpoint=\"model/model.pth\",\n",
        "                                xtts_config=\"model/config.json\",\n",
        "                                xtts_vocab=\"model/vocab.json\")\n",
        "except:\n",
        "    vixtts_model = load_model(xtts_checkpoint=\"model/model.pth\",\n",
        "                                xtts_config=\"model/config.json\",\n",
        "                                xtts_vocab=\"model/vocab.json\")\n",
        "clear_output()\n",
        "print(\"> Đã nạp mô hình\")\n",
        "\n",
        "if not os.path.exists(reference_audio):\n",
        "    print(\"⚠️⚠️⚠️Bạn chưa tải file âm thanh lên. Hãy chọn giọng khác, hoặc tải file của bạn lên ở bên dưới.⚠️⚠️⚠️\")\n",
        "    audio_file=\"/content/model/vi_sample.wav\"\n",
        "else:\n",
        "    audio_file = run_tts(vixtts_model,\n",
        "            lang=language_code_map[language],\n",
        "            tts_text=input_text,\n",
        "            speaker_audio_file=reference_audio,\n",
        "            normalize_text=normalize_text,\n",
        "            verbose=verbose,\n",
        "            output_chunks=output_chunks,)\n",
        "\n",
        "from IPython.display import Audio\n",
        "Audio(audio_file)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "C7Bm9c2iTmpQ"
      },
      "outputs": [],
      "source": [
        "# @title 🎤 **Tải file âm thanh của bạn lên**\n",
        "# @markdown Để đạt chất lượng tốt nhất, hãy tham khảo file '/content/model/vi_sample.wav'\n",
        "# @\n",
        "import os\n",
        "import locale\n",
        "from google.colab import files\n",
        "\n",
        "denoise = True # @param {type:\"boolean\"}\n",
        "\n",
        "# Upload the audio file\n",
        "uploaded = files.upload()\n",
        "for filename in uploaded.keys():\n",
        "    # Convert the audio file to WAV format using ffmpeg\n",
        "    uploaded_dir = os.path.dirname(filename)\n",
        "    if denoise:\n",
        "        !deepFilter \"{filename}\"\n",
        "        !ffmpeg -i \"{filename.replace('.wav', '_DeepFilterNet3.wav')}\" -ac 1 -ar 22050 -vn /content/model/user_sample.wav -y -hide_banner -loglevel error\n",
        "        os.remove(filename.replace('.wav', '_DeepFilterNet3.wav'))\n",
        "    else:\n",
        "        !ffmpeg -i \"{filename}\" -ac 1 -ar 22050 -vn /content/model/user_sample.wav -y -hide_banner -loglevel error\n",
        "        os.remove(filename)\n",
        "    break\n",
        "\n",
        "from IPython.display import Audio, clear_output\n",
        "clear_output()\n",
        "print(\"> Đã tải file âm thanh lên\")\n",
        "Audio(\"/content/model/user_sample.wav\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "6sC6v6pNQVt5"
      },
      "outputs": [],
      "source": [
        "# @title ⏬ **Lưu kết quả vào Drive**\n",
        "# @markdown Chạy phần này để lưu kết quả vào Google Drive của bạn\n",
        "# @markdown `/content/drive/MyDrive/vixtts-output`\n",
        "\n",
        "from google.colab import drive\n",
        "drive.mount('/content/drive')\n",
        "\n",
        "# Save the output folder in \"vixtts-output\" in Google Drive, without overwriting existing files\n",
        "!cp -n -r /content/output/* /content/drive/MyDrive/vixtts-output\n",
        "print(\"> Đã lưu kết quả vào Google Drive\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "7-X619YbX0n-"
      },
      "outputs": [],
      "source": [
        "# @title ⚠️ **Dọn kết quả**\n",
        "# @markdown Chạy phần này để xóa toàn bộ file trong `/content/output`\n",
        "import shutil\n",
        "shutil.rmtree('/content/output')\n",
        "print(\"Đã xóa toàn bộ file trong /content/output\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "cvzEJ8c_PkRb"
      },
      "outputs": [],
      "source": [
        "# @title 📴 **Tắt Demo**\n",
        "# @markdown Khi chạy xong thì bạn hãy tắt demo để tiết kiệm GPU nhé!\n",
        "from google.colab import runtime\n",
        "runtime.unassign()"
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "authorship_tag": "ABX9TyN7+AM4TCVxzIzD/jayBMlv",
      "gpuType": "T4",
      "include_colab_link": true,
      "machine_shape": "hm",
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}