xtts-finetune-webui-gpu

Paused

App Files Files Community

drewThomasson commited on Sep 24, 2024

Commit

c39db41

verified ·

1 Parent(s): a782fff

Upload 10 files

Browse files

Files changed (10) hide show

README.md +73 -13
install.bat +10 -0
install.sh +13 -0
requirements.txt +7 -0
start.bat +5 -0
start.sh +9 -0
utils/formatter.py +198 -0
utils/gpt_train.py +221 -0
utils/tokenizer.py +869 -0
xtts_demo.py +693 -0

README.md CHANGED Viewed

@@ -1,13 +1,73 @@
----
-title: Xtts Finetune Webui Other Guys Work
-emoji: 🐢
-colorFrom: blue
-colorTo: pink
-sdk: gradio
-sdk_version: 4.44.0
-app_file: app.py
-pinned: false
-license: mit
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# xtts-finetune-webui
+This webui is a slightly modified copy of the [official webui](https://github.com/coqui-ai/TTS/pull/3296) for finetune xtts.
+If you are looking for an option for normal XTTS use look here [https://github.com/daswer123/xtts-webui](https://github.com/daswer123/xtts-webui)
+## TODO
+- [ ] Add the ability to use via console
+## Key features:
+### Data processing
+1. Updated faster-whisper to 0.10.0 with the ability to select a larger-v3 model.
+2. Changed output folder to output folder inside the main folder.
+3. If there is already a dataset in the output folder and you want to add new data, you can do so by simply adding new audio, what was there will not be processed again and the new data will be automatically added
+4. Turn on VAD filter
+5. After the dataset is created, a file is created that specifies the language of the dataset. This file is read before training so that the language always matches. It is convenient when you restart the interface
+### Fine-tuning XTTS Encoder
+1. Added the ability to select the base model for XTTS, as well as when you re-training does not need to download the model again.
+2. Added ability to select custom model as base model during training, which will allow finetune already finetune model.
+3. Added possibility to get optimized version of the model for 1 click ( step 2.5, put optimized version in output folder).
+4. You can choose whether to delete training folders after you have optimized the model
+5. When you optimize the model, the example reference audio is moved to the output folder
+6. Checking for correctness of the specified language and dataset language
+### Inference
+1. Added possibility to customize infer settings during model checking.
+### Other
+1. If you accidentally restart the interface during one of the steps, you can load data to additional buttons
+2. Removed the display of logs as it was causing problems when restarted
+3. The finished result is copied to the ready folder, these are fully finished files, you can move them anywhere and use them as a standard model
+4. Added support for finetune Japanese
+## Changes in webui
+### 1 - Data processing
+![image](https://github.com/daswer123/xtts-finetune-webui/assets/22278673/8f09b829-098b-48f5-9668-832e7319403b)
+### 2 - Fine-tuning XTTS Encoder
+![image](https://github.com/daswer123/xtts-finetune-webui/assets/22278673/897540d9-3a6b-463c-abb8-261c289cc929)
+### 3 - Inference
+![image](https://github.com/daswer123/xtts-finetune-webui/assets/22278673/aa05bcd4-8642-4de4-8f2f-bc0f5571af63)
+## Install
+1. Make sure you have `Cuda` installed
+2. `git clone https://github.com/daswer123/xtts-finetune-webui`
+3. `cd xtts-finetune-webui`
+4. `pip install torch==2.1.1+cu118 torchaudio==2.1.1+cu118 --index-url https://download.pytorch.org/whl/cu118`
+5. `pip install -r requirements.txt`
+### If you're using Windows
+1. First start `install.bat`
+2. To start the server start `start.bat`
+3. Go to the local address `127.0.0.1:5003`
+### On Linux
+1. Run `bash install.sh`
+2. To start the server start `start.sh`
+3. Go to the local address `127.0.0.1:5003`
+~

install.bat ADDED Viewed

	@@ -0,0 +1,10 @@

+@echo off
+python -m venv venv
+call venv/scripts/activate
+pip install -r .\requirements.txt
+pip install torch==2.1.1+cu118 torchaudio==2.1.1+cu118 --index-url https://download.pytorch.org/whl/cu118
+python xtts_demo.py

install.sh ADDED Viewed

	@@ -0,0 +1,13 @@

+#!/bin/bash
+# Create a Python virtual environment
+python -m venv venv
+# Activate the virtual environment
+source venv/bin/activate
+# Install other dependencies from requirements.txt
+pip install -r requirements.txt
+pip install torch==2.1.1+cu118 torchaudio==2.1.1+cu118 --index-url https://download.pytorch.org/whl/cu118
+python xtts_demo.py

requirements.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+faster_whisper==1.0.2
+gradio==4.13.0
+spacy==3.7.4
+coqui-tts[languages] == 0.24.1
+cutlet
+fugashi[unidic-lite]

start.bat ADDED Viewed

	@@ -0,0 +1,5 @@

+@echo off
+call venv/scripts/activate
+python xtts_demo.py

start.sh ADDED Viewed

	@@ -0,0 +1,9 @@

+#!/bin/bash
+# Create a Python virtual environment
+python -m venv venv
+# Activate the virtual environment
+source venv/bin/activate
+python xtts_demo.py

utils/formatter.py ADDED Viewed

	@@ -0,0 +1,198 @@

+import os
+import gc
+import torchaudio
+import pandas
+from faster_whisper import WhisperModel
+from glob import glob
+from tqdm import tqdm
+from TTS.tts.layers.xtts.tokenizer import multilingual_cleaners
+# Add support for JA train
+# from utils.tokenizer import multilingual_cleaners
+import torch
+import torchaudio
+# torch.set_num_threads(1)
+torch.set_num_threads(16)
+import os
+audio_types = (".wav", ".mp3", ".flac")
+def find_latest_best_model(folder_path):
+        search_path = os.path.join(folder_path, '**', 'best_model.pth')
+        files = glob(search_path, recursive=True)
+        latest_file = max(files, key=os.path.getctime, default=None)
+        return latest_file
+def list_audios(basePath, contains=None):
+    # return the set of files that are valid
+    return list_files(basePath, validExts=audio_types, contains=contains)
+def list_files(basePath, validExts=None, contains=None):
+    # loop over the directory structure
+    for (rootDir, dirNames, filenames) in os.walk(basePath):
+        # loop over the filenames in the current directory
+        for filename in filenames:
+            # if the contains string is not none and the filename does not contain
+            # the supplied string, then ignore the file
+            if contains is not None and filename.find(contains) == -1:
+                continue
+            # determine the file extension of the current file
+            ext = filename[filename.rfind("."):].lower()
+            # check to see if the file is an audio and should be processed
+            if validExts is None or ext.endswith(validExts):
+                # construct the path to the audio and yield it
+                audioPath = os.path.join(rootDir, filename)
+                yield audioPath
+def format_audio_list(audio_files, asr_model, target_language="en", out_path=None, buffer=0.2, eval_percentage=0.15, speaker_name="coqui", gradio_progress=None):
+    audio_total_size = 0
+    os.makedirs(out_path, exist_ok=True)
+    lang_file_path = os.path.join(out_path, "lang.txt")
+    current_language = None
+    if os.path.exists(lang_file_path):
+        with open(lang_file_path, 'r', encoding='utf-8') as existing_lang_file:
+            current_language = existing_lang_file.read().strip()
+    if current_language != target_language:
+        with open(lang_file_path, 'w', encoding='utf-8') as lang_file:
+            lang_file.write(target_language + '\n')
+        print("Warning, existing language does not match target language. Updated lang.txt with target language.")
+    else:
+        print("Existing language matches target language")
+    metadata = {"audio_file": [], "text": [], "speaker_name": []}
+    train_metadata_path = os.path.join(out_path, "metadata_train.csv")
+    eval_metadata_path = os.path.join(out_path, "metadata_eval.csv")
+    existing_metadata = {'train': None, 'eval': None}
+    if os.path.exists(train_metadata_path):
+        existing_metadata['train'] = pandas.read_csv(train_metadata_path, sep="|")
+        print("Existing training metadata found and loaded.")
+    if os.path.exists(eval_metadata_path):
+        existing_metadata['eval'] = pandas.read_csv(eval_metadata_path, sep="|")
+        print("Existing evaluation metadata found and loaded.")
+    if gradio_progress is not None:
+        tqdm_object = gradio_progress.tqdm(audio_files, desc="Formatting...")
+    else:
+        tqdm_object = tqdm(audio_files)
+    for audio_path in tqdm_object:
+        audio_file_name_without_ext, _= os.path.splitext(os.path.basename(audio_path))
+        prefix_check = f"wavs/{audio_file_name_without_ext}_"
+        skip_processing = False
+        for key in ['train', 'eval']:
+            if existing_metadata[key] is not None:
+                mask = existing_metadata[key]['audio_file'].str.startswith(prefix_check)
+                if mask.any():
+                    print(f"Segments from {audio_file_name_without_ext} have been previously processed; skipping...")
+                    skip_processing = True
+                    break
+        if skip_processing:
+            continue
+        wav, sr = torchaudio.load(audio_path)
+        if wav.size(0) != 1:
+            wav = torch.mean(wav, dim=0, keepdim=True)
+        wav = wav.squeeze()
+        audio_total_size += (wav.size(-1) / sr)
+        segments, _= asr_model.transcribe(audio_path, vad_filter=True, word_timestamps=True, language=target_language)
+        segments = list(segments)
+        i = 0
+        sentence = ""
+        sentence_start = None
+        first_word = True
+        words_list = []
+        for _, segment in enumerate(segments):
+            words = list(segment.words)
+            words_list.extend(words)
+        for word_idx, word in enumerate(words_list):
+            if first_word:
+                sentence_start = word.start
+                if word_idx == 0:
+                    sentence_start = max(sentence_start - buffer, 0)
+                else:
+                    previous_word_end = words_list[word_idx - 1].end
+                    sentence_start = max(sentence_start - buffer, (previous_word_end + sentence_start) / 2)
+                sentence = word.word
+                first_word = False
+            else:
+                sentence += word.word
+            if word.word[-1] in ["!", "。", ".", "?"]:
+                sentence = sentence[1:]
+                sentence = multilingual_cleaners(sentence, target_language)
+                audio_file_name, _= os.path.splitext(os.path.basename(audio_path))
+                audio_file = f"wavs/{audio_file_name}_{str(i).zfill(8)}.wav"
+                if word_idx + 1 < len(words_list):
+                    next_word_start = words_list[word_idx + 1].start
+                else:
+                    next_word_start = (wav.shape[0] - 1) / sr
+                word_end = min((word.end + next_word_start) / 2, word.end + buffer)
+                absolute_path = os.path.join(out_path, audio_file)
+                os.makedirs(os.path.dirname(absolute_path), exist_ok=True)
+                i += 1
+                first_word = True
+                audio = wav[int(sr*sentence_start):int(sr *word_end)].unsqueeze(0)
+                if audio.size(-1) >= sr / 3:
+                    torchaudio.save(absolute_path, audio, sr)
+                else:
+                    continue
+                metadata["audio_file"].append(audio_file)
+                metadata["text"].append(sentence)
+                metadata["speaker_name"].append(speaker_name)
+                df = pandas.DataFrame(metadata)
+                mode = 'w' if not os.path.exists(train_metadata_path) else 'a'
+                header = not os.path.exists(train_metadata_path)
+                df.to_csv(train_metadata_path, sep="|", index=False, mode=mode, header=header)
+                mode = 'w' if not os.path.exists(eval_metadata_path) else 'a'
+                header = not os.path.exists(eval_metadata_path)
+                df.to_csv(eval_metadata_path, sep="|", index=False, mode=mode, header=header)
+                metadata = {"audio_file": [], "text": [], "speaker_name": []}
+    if os.path.exists(train_metadata_path) and os.path.exists(eval_metadata_path):
+        existing_train_df = existing_metadata['train']
+        existing_eval_df = existing_metadata['eval']
+    else:
+        existing_train_df = pandas.DataFrame(columns=["audio_file", "text", "speaker_name"])
+        existing_eval_df = pandas.DataFrame(columns=["audio_file", "text", "speaker_name"])
+    new_data_df = pandas.read_csv(train_metadata_path, sep="|")
+    combined_train_df = pandas.concat([existing_train_df, new_data_df], ignore_index=True).drop_duplicates().reset_index(drop=True)
+    combined_eval_df = pandas.concat([existing_eval_df, new_data_df], ignore_index=True).drop_duplicates().reset_index(drop=True)
+    combined_train_df_shuffled = combined_train_df.sample(frac=1)
+    num_val_samples = int(len(combined_train_df_shuffled)* eval_percentage)
+    final_eval_set = combined_train_df_shuffled[:num_val_samples]
+    final_training_set = combined_train_df_shuffled[num_val_samples:]
+    final_training_set.sort_values('audio_file').to_csv(train_metadata_path, sep='|', index=False)
+    final_eval_set.sort_values('audio_file').to_csv(eval_metadata_path, sep='|', index=False)
+    return train_metadata_path, eval_metadata_path, audio_total_size

utils/gpt_train.py ADDED Viewed

	@@ -0,0 +1,221 @@

+import logging
+import os
+import gc
+from pathlib import Path
+from trainer import Trainer, TrainerArgs
+from TTS.config.shared_configs import BaseDatasetConfig
+from TTS.tts.datasets import load_tts_samples
+from TTS.tts.layers.xtts.trainer.gpt_trainer import GPTArgs, GPTTrainer, GPTTrainerConfig, XttsAudioConfig
+from TTS.utils.manage import ModelManager
+import shutil
+def train_gpt(custom_model,version, language, num_epochs, batch_size, grad_acumm, train_csv, eval_csv, output_path, max_audio_length=255995):
+    #  Logging parameters
+    RUN_NAME = "GPT_XTTS_FT"
+    PROJECT_NAME = "XTTS_trainer"
+    DASHBOARD_LOGGER = "tensorboard"
+    LOGGER_URI = None
+    # print(f"XTTS version = {version}")
+    # Set here the path that the checkpoints will be saved. Default: ./run/training/
+    OUT_PATH = os.path.join(output_path, "run", "training")
+    # Training Parameters
+    OPTIMIZER_WD_ONLY_ON_WEIGHTS = True  # for multi-gpu training please make it False
+    START_WITH_EVAL = False  # if True it will star with evaluation
+    BATCH_SIZE = batch_size  # set here the batch size
+    GRAD_ACUMM_STEPS = grad_acumm  # set here the grad accumulation steps
+    # Define here the dataset that you want to use for the fine-tuning on.
+    config_dataset = BaseDatasetConfig(
+        formatter="coqui",
+        dataset_name="ft_dataset",
+        path=os.path.dirname(train_csv),
+        meta_file_train=train_csv,
+        meta_file_val=eval_csv,
+        language=language,
+    )
+    # Add here the configs of the datasets
+    DATASETS_CONFIG_LIST = [config_dataset]
+    # Define the path where XTTS v2.0.1 files will be downloaded
+    CHECKPOINTS_OUT_PATH = os.path.join(Path.cwd(), "base_models",f"{version}")
+    os.makedirs(CHECKPOINTS_OUT_PATH, exist_ok=True)
+    # DVAE files
+    DVAE_CHECKPOINT_LINK = "https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/main/dvae.pth"
+    MEL_NORM_LINK = "https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/main/mel_stats.pth"
+    # Set the path to the downloaded files
+    DVAE_CHECKPOINT = os.path.join(CHECKPOINTS_OUT_PATH, os.path.basename(DVAE_CHECKPOINT_LINK))
+    MEL_NORM_FILE = os.path.join(CHECKPOINTS_OUT_PATH, os.path.basename(MEL_NORM_LINK))
+    # download DVAE files if needed
+    if not os.path.isfile(DVAE_CHECKPOINT) or not os.path.isfile(MEL_NORM_FILE):
+        print(" > Downloading DVAE files!")
+        ModelManager._download_model_files([MEL_NORM_LINK, DVAE_CHECKPOINT_LINK], CHECKPOINTS_OUT_PATH, progress_bar=True)
+    # Download XTTS v2.0 checkpoint if needed
+    TOKENIZER_FILE_LINK = f"https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/{version}/vocab.json"
+    XTTS_CHECKPOINT_LINK = f"https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/{version}/model.pth"
+    XTTS_CONFIG_LINK = f"https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/{version}/config.json"
+    XTTS_SPEAKER_LINK = f"https://coqui.gateway.scarf.sh/hf-coqui/XTTS-v2/main/speakers_xtts.pth"
+    # XTTS transfer learning parameters: You we need to provide the paths of XTTS model checkpoint that you want to do the fine tuning.
+    TOKENIZER_FILE = os.path.join(CHECKPOINTS_OUT_PATH, os.path.basename(TOKENIZER_FILE_LINK))  # vocab.json file
+    XTTS_CHECKPOINT = os.path.join(CHECKPOINTS_OUT_PATH, os.path.basename(XTTS_CHECKPOINT_LINK))  # model.pth file
+    XTTS_CONFIG_FILE = os.path.join(CHECKPOINTS_OUT_PATH, os.path.basename(XTTS_CONFIG_LINK))  # config.json file
+    XTTS_SPEAKER_FILE = os.path.join(CHECKPOINTS_OUT_PATH, os.path.basename(XTTS_SPEAKER_LINK))  # speakers_xtts.pth file
+    # download XTTS v2.0 files if needed
+    if not os.path.isfile(TOKENIZER_FILE) or not os.path.isfile(XTTS_CHECKPOINT):
+        print(f" > Downloading XTTS v{version} files!")
+        ModelManager._download_model_files(
+            [TOKENIZER_FILE_LINK, XTTS_CHECKPOINT_LINK, XTTS_CONFIG_LINK,XTTS_SPEAKER_LINK], CHECKPOINTS_OUT_PATH, progress_bar=True
+        )
+    # Transfer this files to ready folder
+    READY_MODEL_PATH = os.path.join(output_path,"ready")
+    if not os.path.exists(READY_MODEL_PATH):
+        os.makedirs(READY_MODEL_PATH)
+    NEW_TOKENIZER_FILE = os.path.join(READY_MODEL_PATH, "vocab.json")
+    # NEW_XTTS_CHECKPOINT = os.path.join(READY_MODEL_PATH, "model.pth")
+    NEW_XTTS_CONFIG_FILE = os.path.join(READY_MODEL_PATH, "config.json")
+    NEW_XTTS_SPEAKER_FILE = os.path.join(READY_MODEL_PATH, "speakers_xtts.pth")
+    shutil.copy(TOKENIZER_FILE, NEW_TOKENIZER_FILE)
+    # shutil.copy(XTTS_CHECKPOINT, os.path.join(READY_MODEL_PATH, "model.pth"))
+    shutil.copy(XTTS_CONFIG_FILE, NEW_XTTS_CONFIG_FILE)
+    shutil.copy(XTTS_SPEAKER_FILE, NEW_XTTS_SPEAKER_FILE)
+# Use from ready folder
+    TOKENIZER_FILE = NEW_TOKENIZER_FILE # vocab.json file
+    # XTTS_CHECKPOINT = NEW_XTTS_CHECKPOINT  # model.pth file
+    XTTS_CONFIG_FILE = NEW_XTTS_CONFIG_FILE  # config.json file
+    XTTS_SPEAKER_FILE = NEW_XTTS_SPEAKER_FILE  # speakers_xtts.pth file
+    if custom_model != "":
+        if os.path.exists(custom_model) and custom_model.endswith('.pth'):
+            XTTS_CHECKPOINT = custom_model
+            print(f" > Loading custom model: {XTTS_CHECKPOINT}")
+        else:
+            print(" > Error: The specified custom model is not a valid .pth file path.")
+    num_workers = 8
+    if language == "ja":
+        num_workers = 0
+    # init args and config
+    model_args = GPTArgs(
+        max_conditioning_length=132300,  # 6 secs
+        min_conditioning_length=66150,  # 3 secs
+        debug_loading_failures=False,
+        max_wav_length=max_audio_length,  # ~11.6 seconds
+        max_text_length=200,
+        mel_norm_file=MEL_NORM_FILE,
+        dvae_checkpoint=DVAE_CHECKPOINT,
+        xtts_checkpoint=XTTS_CHECKPOINT,  # checkpoint path of the model that you want to fine-tune
+        tokenizer_file=TOKENIZER_FILE,
+        gpt_num_audio_tokens=1026,
+        gpt_start_audio_token=1024,
+        gpt_stop_audio_token=1025,
+        gpt_use_masking_gt_prompt_approach=True,
+        gpt_use_perceiver_resampler=True,
+    )
+    # define audio config
+    audio_config = XttsAudioConfig(sample_rate=22050, dvae_sample_rate=22050, output_sample_rate=24000)
+    # training parameters config
+    config = GPTTrainerConfig(
+        epochs=num_epochs,
+        output_path=OUT_PATH,
+        model_args=model_args,
+        run_name=RUN_NAME,
+        project_name=PROJECT_NAME,
+        run_description="""
+            GPT XTTS training
+            """,
+        dashboard_logger=DASHBOARD_LOGGER,
+        logger_uri=LOGGER_URI,
+        audio=audio_config,
+        batch_size=BATCH_SIZE,
+        batch_group_size=48,
+        eval_batch_size=BATCH_SIZE,
+        num_loader_workers=num_workers,
+        eval_split_max_size=256,
+        print_step=50,
+        plot_step=100,
+        log_model_step=100,
+        save_step=1000,
+        save_n_checkpoints=1,
+        save_checkpoints=True,
+        # target_loss="loss",
+        print_eval=False,
+        # Optimizer values like tortoise, pytorch implementation with modifications to not apply WD to non-weight parameters.
+        optimizer="AdamW",
+        optimizer_wd_only_on_weights=OPTIMIZER_WD_ONLY_ON_WEIGHTS,
+        optimizer_params={"betas": [0.9, 0.96], "eps": 1e-8, "weight_decay": 1e-2},
+        lr=5e-06,  # learning rate
+        lr_scheduler="MultiStepLR",
+        # it was adjusted accordly for the new step scheme
+        lr_scheduler_params={"milestones": [50000 * 18, 150000 * 18, 300000 * 18], "gamma": 0.5, "last_epoch": -1},
+        test_sentences=[],
+    )
+    # init the model from config
+    model = GPTTrainer.init_from_config(config)
+    # load training samples
+    train_samples, eval_samples = load_tts_samples(
+        DATASETS_CONFIG_LIST,
+        eval_split=True,
+        eval_split_max_size=config.eval_split_max_size,
+        eval_split_size=config.eval_split_size,
+    )
+    # init the trainer and 🚀
+    trainer = Trainer(
+        TrainerArgs(
+            restore_path=None,  # xtts checkpoint is restored via xtts_checkpoint key so no need of restore it using Trainer restore_path parameter
+            skip_train_epoch=False,
+            start_with_eval=START_WITH_EVAL,
+            grad_accum_steps=GRAD_ACUMM_STEPS,
+        ),
+        config,
+        output_path=OUT_PATH,
+        model=model,
+        train_samples=train_samples,
+        eval_samples=eval_samples,
+    )
+    trainer.fit()
+    # get the longest text audio file to use as speaker reference
+    samples_len = [len(item["text"].split(" ")) for item in train_samples]
+    longest_text_idx =  samples_len.index(max(samples_len))
+    speaker_ref = train_samples[longest_text_idx]["audio_file"]
+    trainer_out_path = trainer.output_path
+    # close file handlers and remove them from the logger
+    for handler in logging.getLogger('trainer').handlers:
+        if isinstance(handler, logging.FileHandler):
+            handler.close()
+            logging.getLogger('trainer').removeHandler(handler)
+    # now you should be able to delete the log file
+    log_file = os.path.join(trainer.output_path, f"trainer_{trainer.args.rank}_log.txt")
+    os.remove(log_file)
+    # deallocate VRAM and RAM
+    del model, trainer, train_samples, eval_samples
+    gc.collect()
+    return XTTS_SPEAKER_FILE,XTTS_CONFIG_FILE, XTTS_CHECKPOINT, TOKENIZER_FILE, trainer_out_path, speaker_ref

utils/tokenizer.py ADDED Viewed

	@@ -0,0 +1,869 @@

+import os
+import re
+import textwrap
+from functools import cached_property
+import pypinyin
+import torch
+from hangul_romanize import Transliter
+from hangul_romanize.rule import academic
+from num2words import num2words
+from spacy.lang.ar import Arabic
+from spacy.lang.en import English
+from spacy.lang.es import Spanish
+from spacy.lang.ja import Japanese
+from spacy.lang.zh import Chinese
+from tokenizers import Tokenizer
+from TTS.tts.layers.xtts.zh_num2words import TextNorm as zh_num2words
+def get_spacy_lang(lang):
+    if lang == "zh":
+        return Chinese()
+    elif lang == "ja":
+        return Japanese()
+    elif lang == "ar":
+        return Arabic()
+    elif lang == "es":
+        return Spanish()
+    else:
+        # For most languages, Enlish does the job
+        return English()
+def split_sentence(text, lang, text_split_length=250):
+    """Preprocess the input text"""
+    text_splits = []
+    if text_split_length is not None and len(text) >= text_split_length:
+        text_splits.append("")
+        nlp = get_spacy_lang(lang)
+        nlp.add_pipe("sentencizer")
+        doc = nlp(text)
+        for sentence in doc.sents:
+            if len(text_splits[-1]) + len(str(sentence)) <= text_split_length:
+                # if the last sentence + the current sentence is less than the text_split_length
+                # then add the current sentence to the last sentence
+                text_splits[-1] += " " + str(sentence)
+                text_splits[-1] = text_splits[-1].lstrip()
+            elif len(str(sentence)) > text_split_length:
+                # if the current sentence is greater than the text_split_length
+                for line in textwrap.wrap(
+                    str(sentence),
+                    width=text_split_length,
+                    drop_whitespace=True,
+                    break_on_hyphens=False,
+                    tabsize=1,
+                ):
+                    text_splits.append(str(line))
+            else:
+                text_splits.append(str(sentence))
+        if len(text_splits) > 1:
+            if text_splits[0] == "":
+                del text_splits[0]
+    else:
+        text_splits = [text.lstrip()]
+    return text_splits
+_whitespace_re = re.compile(r"\s+")
+# List of (regular expression, replacement) pairs for abbreviations:
+_abbreviations = {
+    "en": [
+        (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+        for x in [
+            ("mrs", "misess"),
+            ("mr", "mister"),
+            ("dr", "doctor"),
+            ("st", "saint"),
+            ("co", "company"),
+            ("jr", "junior"),
+            ("maj", "major"),
+            ("gen", "general"),
+            ("drs", "doctors"),
+            ("rev", "reverend"),
+            ("lt", "lieutenant"),
+            ("hon", "honorable"),
+            ("sgt", "sergeant"),
+            ("capt", "captain"),
+            ("esq", "esquire"),
+            ("ltd", "limited"),
+            ("col", "colonel"),
+            ("ft", "fort"),
+        ]
+    ],
+    "es": [
+        (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+        for x in [
+            ("sra", "señora"),
+            ("sr", "señor"),
+            ("dr", "doctor"),
+            ("dra", "doctora"),
+            ("st", "santo"),
+            ("co", "compañía"),
+            ("jr", "junior"),
+            ("ltd", "limitada"),
+        ]
+    ],
+    "fr": [
+        (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+        for x in [
+            ("mme", "madame"),
+            ("mr", "monsieur"),
+            ("dr", "docteur"),
+            ("st", "saint"),
+            ("co", "compagnie"),
+            ("jr", "junior"),
+            ("ltd", "limitée"),
+        ]
+    ],
+    "de": [
+        (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+        for x in [
+            ("fr", "frau"),
+            ("dr", "doktor"),
+            ("st", "sankt"),
+            ("co", "firma"),
+            ("jr", "junior"),
+        ]
+    ],
+    "pt": [
+        (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+        for x in [
+            ("sra", "senhora"),
+            ("sr", "senhor"),
+            ("dr", "doutor"),
+            ("dra", "doutora"),
+            ("st", "santo"),
+            ("co", "companhia"),
+            ("jr", "júnior"),
+            ("ltd", "limitada"),
+        ]
+    ],
+    "it": [
+        (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+        for x in [
+            # ("sig.ra", "signora"),
+            ("sig", "signore"),
+            ("dr", "dottore"),
+            ("st", "santo"),
+            ("co", "compagnia"),
+            ("jr", "junior"),
+            ("ltd", "limitata"),
+        ]
+    ],
+    "pl": [
+        (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+        for x in [
+            ("p", "pani"),
+            ("m", "pan"),
+            ("dr", "doktor"),
+            ("sw", "święty"),
+            ("jr", "junior"),
+        ]
+    ],
+    "ar": [
+        (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+        for x in [
+            # There are not many common abbreviations in Arabic as in English.
+        ]
+    ],
+    "zh": [
+        (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+        for x in [
+            # Chinese doesn't typically use abbreviations in the same way as Latin-based scripts.
+        ]
+    ],
+    "cs": [
+        (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+        for x in [
+            ("dr", "doktor"),  # doctor
+            ("ing", "inženýr"),  # engineer
+            ("p", "pan"),  # Could also map to pani for woman but no easy way to do it
+            # Other abbreviations would be specialized and not as common.
+        ]
+    ],
+    "ru": [
+        (re.compile("\\b%s\\b" % x[0], re.IGNORECASE), x[1])
+        for x in [
+            ("г-жа", "госпожа"),  # Mrs.
+            ("г-н", "господин"),  # Mr.
+            ("д-р", "доктор"),  # doctor
+            # Other abbreviations are less common or specialized.
+        ]
+    ],
+    "nl": [
+        (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+        for x in [
+            ("dhr", "de heer"),  # Mr.
+            ("mevr", "mevrouw"),  # Mrs.
+            ("dr", "dokter"),  # doctor
+            ("jhr", "jonkheer"),  # young lord or nobleman
+            # Dutch uses more abbreviations, but these are the most common ones.
+        ]
+    ],
+    "tr": [
+        (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+        for x in [
+            ("b", "bay"),  # Mr.
+            ("byk", "büyük"),  # büyük
+            ("dr", "doktor"),  # doctor
+            # Add other Turkish abbreviations here if needed.
+        ]
+    ],
+    "hu": [
+        (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+        for x in [
+            ("dr", "doktor"),  # doctor
+            ("b", "bácsi"),  # Mr.
+            ("nőv", "nővér"),  # nurse
+            # Add other Hungarian abbreviations here if needed.
+        ]
+    ],
+    "ko": [
+        (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+        for x in [
+            # Korean doesn't typically use abbreviations in the same way as Latin-based scripts.
+        ]
+    ],
+    "ja": [
+        (re.compile("\\b%s\\b" % x[0]), x[1])
+        for x in [
+            ("氏", "さん"),  # Mr.
+            ("夫人", "おんなのひと"),  # Mrs.
+            ("博士", "はかせ"),  # Doctor or PhD
+            ("株", "株式会社"),  # Corporation
+            ("有", "有限会社"),  # Limited company
+            ("大学", "だいがく"),   # University
+            ("先生", "せんせい"),   # Teacher/Professor/Master
+            ("君", "くん")   # Used at the end of boys' names to express familiarity or affection.
+        ]
+    ],
+}
+def expand_abbreviations_multilingual(text, lang="en"):
+    for regex, replacement in _abbreviations[lang]:
+        text = re.sub(regex, replacement, text)
+    return text
+_symbols_multilingual = {
+    "en": [
+        (re.compile(r"%s" % re.escape(x[0]), re.IGNORECASE), x[1])
+        for x in [
+            ("&", " and "),
+            ("@", " at "),
+            ("%", " percent "),
+            ("#", " hash "),
+            ("$", " dollar "),
+            ("£", " pound "),
+            ("°", " degree "),
+        ]
+    ],
+    "es": [
+        (re.compile(r"%s" % re.escape(x[0]), re.IGNORECASE), x[1])
+        for x in [
+            ("&", " y "),
+            ("@", " arroba "),
+            ("%", " por ciento "),
+            ("#", " numeral "),
+            ("$", " dolar "),
+            ("£", " libra "),
+            ("°", " grados "),
+        ]
+    ],
+    "fr": [
+        (re.compile(r"%s" % re.escape(x[0]), re.IGNORECASE), x[1])
+        for x in [
+            ("&", " et "),
+            ("@", " arobase "),
+            ("%", " pour cent "),
+            ("#", " dièse "),
+            ("$", " dollar "),
+            ("£", " livre "),
+            ("°", " degrés "),
+        ]
+    ],
+    "de": [
+        (re.compile(r"%s" % re.escape(x[0]), re.IGNORECASE), x[1])
+        for x in [
+            ("&", " und "),
+            ("@", " at "),
+            ("%", " prozent "),
+            ("#", " raute "),
+            ("$", " dollar "),
+            ("£", " pfund "),
+            ("°", " grad "),
+        ]
+    ],
+    "pt": [
+        (re.compile(r"%s" % re.escape(x[0]), re.IGNORECASE), x[1])
+        for x in [
+            ("&", " e "),
+            ("@", " arroba "),
+            ("%", " por cento "),
+            ("#", " cardinal "),
+            ("$", " dólar "),
+            ("£", " libra "),
+            ("°", " graus "),
+        ]
+    ],
+    "it": [
+        (re.compile(r"%s" % re.escape(x[0]), re.IGNORECASE), x[1])
+        for x in [
+            ("&", " e "),
+            ("@", " chiocciola "),
+            ("%", " per cento "),
+            ("#", " cancelletto "),
+            ("$", " dollaro "),
+            ("£", " sterlina "),
+            ("°", " gradi "),
+        ]
+    ],
+    "pl": [
+        (re.compile(r"%s" % re.escape(x[0]), re.IGNORECASE), x[1])
+        for x in [
+            ("&", " i "),
+            ("@", " małpa "),
+            ("%", " procent "),
+            ("#", " krzyżyk "),
+            ("$", " dolar "),
+            ("£", " funt "),
+            ("°", " stopnie "),
+        ]
+    ],
+    "ar": [
+        # Arabic
+        (re.compile(r"%s" % re.escape(x[0]), re.IGNORECASE), x[1])
+        for x in [
+            ("&", " و "),
+            ("@", " على "),
+            ("%", " في المئة "),
+            ("#", " رقم "),
+            ("$", " دولار "),
+            ("£", " جنيه "),
+            ("°", " درجة "),
+        ]
+    ],
+    "zh": [
+        # Chinese
+        (re.compile(r"%s" % re.escape(x[0]), re.IGNORECASE), x[1])
+        for x in [
+            ("&", " 和 "),
+            ("@", " 在 "),
+            ("%", " 百分之 "),
+            ("#", " 号 "),
+            ("$", " 美元 "),
+            ("£", " 英镑 "),
+            ("°", " 度 "),
+        ]
+    ],
+    "cs": [
+        # Czech
+        (re.compile(r"%s" % re.escape(x[0]), re.IGNORECASE), x[1])
+        for x in [
+            ("&", " a "),
+            ("@", " na "),
+            ("%", " procento "),
+            ("#", " křížek "),
+            ("$", " dolar "),
+            ("£", " libra "),
+            ("°", " stupně "),
+        ]
+    ],
+    "ru": [
+        # Russian
+        (re.compile(r"%s" % re.escape(x[0]), re.IGNORECASE), x[1])
+        for x in [
+            ("&", " и "),
+            ("@", " собака "),
+            ("%", " процентов "),
+            ("#", " номер "),
+            ("$", " доллар "),
+            ("£", " фунт "),
+            ("°", " градус "),
+        ]
+    ],
+    "nl": [
+        # Dutch
+        (re.compile(r"%s" % re.escape(x[0]), re.IGNORECASE), x[1])
+        for x in [
+            ("&", " en "),
+            ("@", " bij "),
+            ("%", " procent "),
+            ("#", " hekje "),
+            ("$", " dollar "),
+            ("£", " pond "),
+            ("°", " graden "),
+        ]
+    ],
+    "tr": [
+        (re.compile(r"%s" % re.escape(x[0]), re.IGNORECASE), x[1])
+        for x in [
+            ("&", " ve "),
+            ("@", " at "),
+            ("%", " yüzde "),
+            ("#", " diyez "),
+            ("$", " dolar "),
+            ("£", " sterlin "),
+            ("°", " derece "),
+        ]
+    ],
+    "hu": [
+        (re.compile(r"%s" % re.escape(x[0]), re.IGNORECASE), x[1])
+        for x in [
+            ("&", " és "),
+            ("@", " kukac "),
+            ("%", " százalék "),
+            ("#", " kettőskereszt "),
+            ("$", " dollár "),
+            ("£", " font "),
+            ("°", " fok "),
+        ]
+    ],
+    "ko": [
+        # Korean
+        (re.compile(r"%s" % re.escape(x[0]), re.IGNORECASE), x[1])
+        for x in [
+            ("&", " 그리고 "),
+            ("@", " 에 "),
+            ("%", " 퍼센트 "),
+            ("#", " 번호 "),
+            ("$", " 달러 "),
+            ("£", " 파운드 "),
+            ("°", " 도 "),
+        ]
+    ],
+    "ja": [
+        (re.compile(r"%s" % re.escape(x[0])), x[1])
+        for x in [
+            ("&", " と "),
+            ("@", " アットマーク "),
+            ("%", " パーセント "),
+            ("#", " ナンバー "),
+            ("$", " ドル "),
+            ("£", " ポンド "),
+            ("°", " 度"),
+            ]
+    ],
+}
+def expand_symbols_multilingual(text, lang="en"):
+    for regex, replacement in _symbols_multilingual[lang]:
+        text = re.sub(regex, replacement, text)
+        text = text.replace("  ", " ")  # Ensure there are no double spaces
+    return text.strip()
+_ordinal_re = {
+    "en": re.compile(r"([0-9]+)(st|nd|rd|th)"),
+    "es": re.compile(r"([0-9]+)(º|ª|er|o|a|os|as)"),
+    "fr": re.compile(r"([0-9]+)(º|ª|er|re|e|ème)"),
+    "de": re.compile(r"([0-9]+)(st|nd|rd|th|º|ª|\.(?=\s|$))"),
+    "pt": re.compile(r"([0-9]+)(º|ª|o|a|os|as)"),
+    "it": re.compile(r"([0-9]+)(º|°|ª|o|a|i|e)"),
+    "pl": re.compile(r"([0-9]+)(º|ª|st|nd|rd|th)"),
+    "ar": re.compile(r"([0-9]+)(ون|ين|ث|ر|ى)"),
+    "cs": re.compile(r"([0-9]+)\.(?=\s|$)"),  # In Czech, a dot is often used after the number to indicate ordinals.
+    "ru": re.compile(r"([0-9]+)(-й|-я|-е|-ое|-ье|-го)"),
+    "nl": re.compile(r"([0-9]+)(de|ste|e)"),
+    "tr": re.compile(r"([0-9]+)(\.|inci|nci|uncu|üncü|\.)"),
+    "hu": re.compile(r"([0-9]+)(\.|adik|edik|odik|edik|ödik|ödike|ik)"),
+    "ko": re.compile(r"([0-9]+)(번째|번|차|째)"),
+    "ja": re.compile(r"([0-9]+)(番|回|つ|目|等|位)")
+}
+_number_re = re.compile(r"[0-9]+")
+_currency_re = {
+    "USD": re.compile(r"((\$[0-9\.\,]*[0-9]+)|([0-9\.\,]*[0-9]+\$))"),
+    "GBP": re.compile(r"((£[0-9\.\,]*[0-9]+)|([0-9\.\,]*[0-9]+£))"),
+    "EUR": re.compile(r"(([0-9\.\,]*[0-9]+€)|((€[0-9\.\,]*[0-9]+)))"),
+}
+_comma_number_re = re.compile(r"\b\d{1,3}(,\d{3})*(\.\d+)?\b")
+_dot_number_re = re.compile(r"\b\d{1,3}(.\d{3})*(\,\d+)?\b")
+_decimal_number_re = re.compile(r"([0-9]+[.,][0-9]+)")
+def _remove_commas(m):
+    text = m.group(0)
+    if "," in text:
+        text = text.replace(",", "")
+    return text
+def _remove_dots(m):
+    text = m.group(0)
+    if "." in text:
+        text = text.replace(".", "")
+    return text
+def _expand_decimal_point(m, lang="en"):
+    amount = m.group(1).replace(",", ".")
+    return num2words(float(amount), lang=lang if lang != "cs" else "cz")
+def _expand_currency(m, lang="en", currency="USD"):
+    amount = float((re.sub(r"[^\d.]", "", m.group(0).replace(",", "."))))
+    full_amount = num2words(amount, to="currency", currency=currency, lang=lang if lang != "cs" else "cz")
+    and_equivalents = {
+        "en": ", ",
+        "es": " con ",
+        "fr": " et ",
+        "de": " und ",
+        "pt": " e ",
+        "it": " e ",
+        "pl": ", ",
+        "cs": ", ",
+        "ru": ", ",
+        "nl": ", ",
+        "ar": ", ",
+        "tr": ", ",
+        "hu": ", ",
+        "ko": ", ",
+    }
+    if amount.is_integer():
+        last_and = full_amount.rfind(and_equivalents[lang])
+        if last_and != -1:
+            full_amount = full_amount[:last_and]
+    return full_amount
+def _expand_ordinal(m, lang="en"):
+    return num2words(int(m.group(1)), ordinal=True, lang=lang if lang != "cs" else "cz")
+def _expand_number(m, lang="en"):
+    return num2words(int(m.group(0)), lang=lang if lang != "cs" else "cz")
+def expand_numbers_multilingual(text, lang="en"):
+    if lang == "zh":
+        text = zh_num2words()(text)
+    else:
+        if lang in ["en", "ru"]:
+            text = re.sub(_comma_number_re, _remove_commas, text)
+        else:
+            text = re.sub(_dot_number_re, _remove_dots, text)
+        try:
+            text = re.sub(_currency_re["GBP"], lambda m: _expand_currency(m, lang, "GBP"), text)
+            text = re.sub(_currency_re["USD"], lambda m: _expand_currency(m, lang, "USD"), text)
+            text = re.sub(_currency_re["EUR"], lambda m: _expand_currency(m, lang, "EUR"), text)
+        except:
+            pass
+        if lang != "tr":
+            text = re.sub(_decimal_number_re, lambda m: _expand_decimal_point(m, lang), text)
+        text = re.sub(_ordinal_re[lang], lambda m: _expand_ordinal(m, lang), text)
+        text = re.sub(_number_re, lambda m: _expand_number(m, lang), text)
+    return text
+def lowercase(text):
+    return text.lower()
+def collapse_whitespace(text):
+    return re.sub(_whitespace_re, " ", text)
+def multilingual_cleaners(text, lang):
+    text = text.replace('"', "")
+    if lang == "tr":
+        text = text.replace("İ", "i")
+        text = text.replace("Ö", "ö")
+        text = text.replace("Ü", "ü")
+    text = lowercase(text)
+    text = expand_numbers_multilingual(text, lang)
+    text = expand_abbreviations_multilingual(text, lang)
+    text = expand_symbols_multilingual(text, lang=lang)
+    text = collapse_whitespace(text)
+    return text
+def basic_cleaners(text):
+    """Basic pipeline that lowercases and collapses whitespace without transliteration."""
+    text = lowercase(text)
+    text = collapse_whitespace(text)
+    return text
+def chinese_transliterate(text):
+    return "".join(
+        [p[0] for p in pypinyin.pinyin(text, style=pypinyin.Style.TONE3, heteronym=False, neutral_tone_with_five=True)]
+    )
+def japanese_cleaners(text, katsu):
+    text = katsu.romaji(text)
+    text = lowercase(text)
+    return text
+def korean_transliterate(text):
+    r = Transliter(academic)
+    return r.translit(text)
+DEFAULT_VOCAB_FILE = os.path.join(os.path.dirname(os.path.realpath(__file__)), "../data/tokenizer.json")
+class VoiceBpeTokenizer:
+    def __init__(self, vocab_file=None):
+        self.tokenizer = None
+        if vocab_file is not None:
+            self.tokenizer = Tokenizer.from_file(vocab_file)
+        self.char_limits = {
+            "en": 250,
+            "de": 253,
+            "fr": 273,
+            "es": 239,
+            "it": 213,
+            "pt": 203,
+            "pl": 224,
+            "zh": 82,
+            "ar": 166,
+            "cs": 186,
+            "ru": 182,
+            "nl": 251,
+            "tr": 226,
+            "ja": 71,
+            "hu": 224,
+            "ko": 95,
+        }
+    @cached_property
+    def katsu(self):
+        import cutlet
+        return cutlet.Cutlet()
+    def check_input_length(self, txt, lang):
+        lang = lang.split("-")[0]  # remove the region
+        limit = self.char_limits.get(lang, 250)
+        if len(txt) > limit:
+            print(
+                f"[!] Warning: The text length exceeds the character limit of {limit} for language '{lang}', this might cause truncated audio."
+            )
+    def preprocess_text(self, txt, lang):
+        if lang in {"ar", "cs", "de", "en", "es", "fr", "hu", "it", "nl", "pl", "pt", "ru", "tr", "zh", "ko"}:
+            txt = multilingual_cleaners(txt, lang)
+            if lang == "zh":
+                txt = chinese_transliterate(txt)
+            if lang == "ko":
+                txt = korean_transliterate(txt)
+        elif lang == "ja":
+            txt = japanese_cleaners(txt, self.katsu)
+        elif lang == "hi":
+            # @manmay will implement this
+            txt = basic_cleaners(txt)
+        else:
+            raise NotImplementedError(f"Language '{lang}' is not supported.")
+        return txt
+    def encode(self, txt, lang):
+        lang = lang.split("-")[0]  # remove the region
+        self.check_input_length(txt, lang)
+        txt = self.preprocess_text(txt, lang)
+        lang = "zh-cn" if lang == "zh" else lang
+        txt = f"[{lang}]{txt}"
+        txt = txt.replace(" ", "[SPACE]")
+        return self.tokenizer.encode(txt).ids
+    def decode(self, seq):
+        if isinstance(seq, torch.Tensor):
+            seq = seq.cpu().numpy()
+        txt = self.tokenizer.decode(seq, skip_special_tokens=False).replace(" ", "")
+        txt = txt.replace("[SPACE]", " ")
+        txt = txt.replace("[STOP]", "")
+        txt = txt.replace("[UNK]", "")
+        return txt
+    def __len__(self):
+        return self.tokenizer.get_vocab_size()
+    def get_number_tokens(self):
+        return max(self.tokenizer.get_vocab().values()) + 1
+def test_expand_numbers_multilingual():
+    test_cases = [
+        # English
+        ("In 12.5 seconds.", "In twelve point five seconds.", "en"),
+        ("There were 50 soldiers.", "There were fifty soldiers.", "en"),
+        ("This is a 1st test", "This is a first test", "en"),
+        ("That will be $20 sir.", "That will be twenty dollars sir.", "en"),
+        ("That will be 20€ sir.", "That will be twenty euro sir.", "en"),
+        ("That will be 20.15€ sir.", "That will be twenty euro, fifteen cents sir.", "en"),
+        ("That's 100,000.5.", "That's one hundred thousand point five.", "en"),
+        # French
+        ("En 12,5 secondes.", "En douze virgule cinq secondes.", "fr"),
+        ("Il y avait 50 soldats.", "Il y avait cinquante soldats.", "fr"),
+        ("Ceci est un 1er test", "Ceci est un premier test", "fr"),
+        ("Cela vous fera $20 monsieur.", "Cela vous fera vingt dollars monsieur.", "fr"),
+        ("Cela vous fera 20€ monsieur.", "Cela vous fera vingt euros monsieur.", "fr"),
+        ("Cela vous fera 20,15€ monsieur.", "Cela vous fera vingt euros et quinze centimes monsieur.", "fr"),
+        ("Ce sera 100.000,5.", "Ce sera cent mille virgule cinq.", "fr"),
+        # German
+        ("In 12,5 Sekunden.", "In zwölf Komma fünf Sekunden.", "de"),
+        ("Es gab 50 Soldaten.", "Es gab fünfzig Soldaten.", "de"),
+        ("Dies ist ein 1. Test", "Dies ist ein erste Test", "de"),  # Issue with gender
+        ("Das macht $20 Herr.", "Das macht zwanzig Dollar Herr.", "de"),
+        ("Das macht 20€ Herr.", "Das macht zwanzig Euro Herr.", "de"),
+        ("Das macht 20,15€ Herr.", "Das macht zwanzig Euro und fünfzehn Cent Herr.", "de"),
+        # Spanish
+        ("En 12,5 segundos.", "En doce punto cinco segundos.", "es"),
+        ("Había 50 soldados.", "Había cincuenta soldados.", "es"),
+        ("Este es un 1er test", "Este es un primero test", "es"),
+        ("Eso le costará $20 señor.", "Eso le costará veinte dólares señor.", "es"),
+        ("Eso le costará 20€ señor.", "Eso le costará veinte euros señor.", "es"),
+        ("Eso le costará 20,15€ señor.", "Eso le costará veinte euros con quince céntimos señor.", "es"),
+        # Italian
+        ("In 12,5 secondi.", "In dodici virgola cinque secondi.", "it"),
+        ("C'erano 50 soldati.", "C'erano cinquanta soldati.", "it"),
+        ("Questo è un 1° test", "Questo è un primo test", "it"),
+        ("Ti costerà $20 signore.", "Ti costerà venti dollari signore.", "it"),
+        ("Ti costerà 20€ signore.", "Ti costerà venti euro signore.", "it"),
+        ("Ti costerà 20,15€ signore.", "Ti costerà venti euro e quindici centesimi signore.", "it"),
+        # Portuguese
+        ("Em 12,5 segundos.", "Em doze vírgula cinco segundos.", "pt"),
+        ("Havia 50 soldados.", "Havia cinquenta soldados.", "pt"),
+        ("Este é um 1º teste", "Este é um primeiro teste", "pt"),
+        ("Isso custará $20 senhor.", "Isso custará vinte dólares senhor.", "pt"),
+        ("Isso custará 20€ senhor.", "Isso custará vinte euros senhor.", "pt"),
+        (
+            "Isso custará 20,15€ senhor.",
+            "Isso custará vinte euros e quinze cêntimos senhor.",
+            "pt",
+        ),  # "cêntimos" should be "centavos" num2words issue
+        # Polish
+        ("W 12,5 sekundy.", "W dwanaście przecinek pięć sekundy.", "pl"),
+        ("Było 50 żołnierzy.", "Było pięćdziesiąt żołnierzy.", "pl"),
+        ("To będzie kosztować 20€ panie.", "To będzie kosztować dwadzieścia euro panie.", "pl"),
+        ("To będzie kosztować 20,15€ panie.", "To będzie kosztować dwadzieścia euro, piętnaście centów panie.", "pl"),
+        # Arabic
+        ("في الـ 12,5 ثانية.", "في الـ اثنا عشر  , خمسون ثانية.", "ar"),
+        ("كان هناك 50 جنديًا.", "كان هناك خمسون جنديًا.", "ar"),
+        # ("ستكون النتيجة $20 يا سيد.", 'ستكون النتيجة عشرون دولار يا سيد.', 'ar'), # $ and € are mising from num2words
+        # ("ستكون النتيجة 20€ يا سيد.", 'ستكون النتيجة عشرون يورو يا سيد.', 'ar'),
+        # Czech
+        ("Za 12,5 vteřiny.", "Za dvanáct celá pět vteřiny.", "cs"),
+        ("Bylo tam 50 vojáků.", "Bylo tam padesát vojáků.", "cs"),
+        ("To bude stát 20€ pane.", "To bude stát dvacet euro pane.", "cs"),
+        ("To bude 20.15€ pane.", "To bude dvacet euro, patnáct centů pane.", "cs"),
+        # Russian
+        ("Через 12.5 секунды.", "Через двенадцать запятая пять секунды.", "ru"),
+        ("Там было 50 солдат.", "Там было пятьдесят солдат.", "ru"),
+        ("Это будет 20.15€ сэр.", "Это будет двадцать евро, пятнадцать центов сэр.", "ru"),
+        ("Это будет стоить 20€ господин.", "Это будет стоить двадцать евро господин.", "ru"),
+        # Dutch
+        ("In 12,5 seconden.", "In twaalf komma vijf seconden.", "nl"),
+        ("Er waren 50 soldaten.", "Er waren vijftig soldaten.", "nl"),
+        ("Dat wordt dan $20 meneer.", "Dat wordt dan twintig dollar meneer.", "nl"),
+        ("Dat wordt dan 20€ meneer.", "Dat wordt dan twintig euro meneer.", "nl"),
+        # Chinese (Simplified)
+        ("在12.5秒内", "在十二点五秒内", "zh"),
+        ("有50名士兵", "有五十名士兵", "zh"),
+        # ("那将是$20先生", '那将是二十美元先生', 'zh'), currency doesn't work
+        # ("那将是20€先生", '那将是二十欧元先生', 'zh'),
+        # Turkish
+        # ("12,5 saniye içinde.", 'On iki virgül beş saniye içinde.', 'tr'), # decimal doesn't work for TR
+        ("50 asker vardı.", "elli asker vardı.", "tr"),
+        ("Bu 1. test", "Bu birinci test", "tr"),
+        # ("Bu 100.000,5.", 'Bu yüz bin virgül beş.', 'tr'),
+        # Hungarian
+        ("12,5 másodperc alatt.", "tizenkettő egész öt tized másodperc alatt.", "hu"),
+        ("50 katona volt.", "ötven katona volt.", "hu"),
+        ("Ez az 1. teszt", "Ez az első teszt", "hu"),
+        # Korean
+        ("12.5 초 안에.", "십이 점 다섯 초 안에.", "ko"),
+        ("50 명의 병사가 있었다.", "오십 명의 병사가 있었다.", "ko"),
+        ("이것은 1 번째 테스트입니다", "이것은 첫 번째 테스트입니다", "ko"),
+    ]
+    for a, b, lang in test_cases:
+        out = expand_numbers_multilingual(a, lang=lang)
+        assert out == b, f"'{out}' vs '{b}'"
+def test_abbreviations_multilingual():
+    test_cases = [
+        # English
+        ("Hello Mr. Smith.", "Hello mister Smith.", "en"),
+        ("Dr. Jones is here.", "doctor Jones is here.", "en"),
+        # Spanish
+        ("Hola Sr. Garcia.", "Hola señor Garcia.", "es"),
+        ("La Dra. Martinez es muy buena.", "La doctora Martinez es muy buena.", "es"),
+        # French
+        ("Bonjour Mr. Dupond.", "Bonjour monsieur Dupond.", "fr"),
+        ("Mme. Moreau est absente aujourd'hui.", "madame Moreau est absente aujourd'hui.", "fr"),
+        # German
+        ("Frau Dr. Müller ist sehr klug.", "Frau doktor Müller ist sehr klug.", "de"),
+        # Portuguese
+        ("Olá Sr. Silva.", "Olá senhor Silva.", "pt"),
+        ("Dra. Costa, você está disponível?", "doutora Costa, você está disponível?", "pt"),
+        # Italian
+        ("Buongiorno, Sig. Rossi.", "Buongiorno, signore Rossi.", "it"),
+        # ("Sig.ra Bianchi, posso aiutarti?", 'signora Bianchi, posso aiutarti?', 'it'), # Issue with matching that pattern
+        # Polish
+        ("Dzień dobry, P. Kowalski.", "Dzień dobry, pani Kowalski.", "pl"),
+        ("M. Nowak, czy mogę zadać pytanie?", "pan Nowak, czy mogę zadać pytanie?", "pl"),
+        # Czech
+        ("P. Novák", "pan Novák", "cs"),
+        ("Dr. Vojtěch", "doktor Vojtěch", "cs"),
+        # Dutch
+        ("Dhr. Jansen", "de heer Jansen", "nl"),
+        ("Mevr. de Vries", "mevrouw de Vries", "nl"),
+        # Russian
+        ("Здравствуйте Г-н Иванов.", "Здравствуйте господин Иванов.", "ru"),
+        ("Д-р Смирнов здесь, чтобы увидеть вас.", "доктор Смирнов здесь, чтобы увидеть вас.", "ru"),
+        # Turkish
+        ("Merhaba B. Yılmaz.", "Merhaba bay Yılmaz.", "tr"),
+        ("Dr. Ayşe burada.", "doktor Ayşe burada.", "tr"),
+        # Hungarian
+        ("Dr. Szabó itt van.", "doktor Szabó itt van.", "hu"),
+    ]
+    for a, b, lang in test_cases:
+        out = expand_abbreviations_multilingual(a, lang=lang)
+        assert out == b, f"'{out}' vs '{b}'"
+def test_symbols_multilingual():
+    test_cases = [
+        ("I have 14% battery", "I have 14 percent battery", "en"),
+        ("Te veo @ la fiesta", "Te veo arroba la fiesta", "es"),
+        ("J'ai 14° de fièvre", "J'ai 14 degrés de fièvre", "fr"),
+        ("Die Rechnung beträgt £ 20", "Die Rechnung beträgt pfund 20", "de"),
+        ("O meu email é ana&joao@gmail.com", "O meu email é ana e joao arroba gmail.com", "pt"),
+        ("linguaggio di programmazione C#", "linguaggio di programmazione C cancelletto", "it"),
+        ("Moja temperatura to 36.6°", "Moja temperatura to 36.6 stopnie", "pl"),
+        ("Mám 14% baterie", "Mám 14 procento baterie", "cs"),
+        ("Těším se na tebe @ party", "Těším se na tebe na party", "cs"),
+        ("У меня 14% заряда", "У меня 14 процентов заряда", "ru"),
+        ("Я буду @ дома", "Я буду собака дома", "ru"),
+        ("Ik heb 14% batterij", "Ik heb 14 procent batterij", "nl"),
+        ("Ik zie je @ het feest", "Ik zie je bij het feest", "nl"),
+        ("لدي 14% في البطارية", "لدي 14 في المئة في البطارية", "ar"),
+        ("我的电量为 14%", "我的电量为 14 百分之", "zh"),
+        ("Pilim %14 dolu.", "Pilim yüzde 14 dolu.", "tr"),
+        ("Az akkumulátorom töltöttsége 14%", "Az akkumulátorom töltöttsége 14 százalék", "hu"),
+        ("배터리 잔량이 14%입니다.", "배터리 잔량이 14 퍼센트입니다.", "ko"),
+    ]
+    for a, b, lang in test_cases:
+        out = expand_symbols_multilingual(a, lang=lang)
+        assert out == b, f"'{out}' vs '{b}'"
+if __name__ == "__main__":
+    test_expand_numbers_multilingual()
+    test_abbreviations_multilingual()
+    test_symbols_multilingual()

xtts_demo.py ADDED Viewed

	@@ -0,0 +1,693 @@

+import argparse
+import os
+import sys
+import tempfile
+from pathlib import Path
+import os
+import shutil
+import glob
+import gradio as gr
+import librosa.display
+import numpy as np
+import torch
+import torchaudio
+import traceback
+from utils.formatter import format_audio_list,find_latest_best_model, list_audios
+from utils.gpt_train import train_gpt
+from faster_whisper import WhisperModel
+from TTS.tts.configs.xtts_config import XttsConfig
+from TTS.tts.models.xtts import Xtts
+from TTS.tts.configs.xtts_config import XttsConfig
+from TTS.tts.models.xtts import Xtts
+# Clear logs
+def remove_log_file(file_path):
+     log_file = Path(file_path)
+     if log_file.exists() and log_file.is_file():
+         log_file.unlink()
+# remove_log_file(str(Path.cwd() / "log.out"))
+def clear_gpu_cache():
+    # clear the GPU cache
+    if torch.cuda.is_available():
+        torch.cuda.empty_cache()
+XTTS_MODEL = None
+def load_model(xtts_checkpoint, xtts_config, xtts_vocab,xtts_speaker):
+    global XTTS_MODEL
+    clear_gpu_cache()
+    if not xtts_checkpoint or not xtts_config or not xtts_vocab:
+        return "You need to run the previous steps or manually set the `XTTS checkpoint path`, `XTTS config path`, and `XTTS vocab path` fields !!"
+    config = XttsConfig()
+    config.load_json(xtts_config)
+    XTTS_MODEL = Xtts.init_from_config(config)
+    print("Loading XTTS model! ")
+    XTTS_MODEL.load_checkpoint(config, checkpoint_path=xtts_checkpoint, vocab_path=xtts_vocab,speaker_file_path=xtts_speaker, use_deepspeed=False)
+    if torch.cuda.is_available():
+        XTTS_MODEL.cuda()
+    print("Model Loaded!")
+    return "Model Loaded!"
+def run_tts(lang, tts_text, speaker_audio_file, temperature, length_penalty,repetition_penalty,top_k,top_p,sentence_split,use_config):
+    if XTTS_MODEL is None or not speaker_audio_file:
+        return "You need to run the previous step to load the model !!", None, None
+    gpt_cond_latent, speaker_embedding = XTTS_MODEL.get_conditioning_latents(audio_path=speaker_audio_file, gpt_cond_len=XTTS_MODEL.config.gpt_cond_len, max_ref_length=XTTS_MODEL.config.max_ref_len, sound_norm_refs=XTTS_MODEL.config.sound_norm_refs)
+    if use_config:
+        out = XTTS_MODEL.inference(
+            text=tts_text,
+            language=lang,
+            gpt_cond_latent=gpt_cond_latent,
+            speaker_embedding=speaker_embedding,
+            temperature=XTTS_MODEL.config.temperature, # Add custom parameters here
+            length_penalty=XTTS_MODEL.config.length_penalty,
+            repetition_penalty=XTTS_MODEL.config.repetition_penalty,
+            top_k=XTTS_MODEL.config.top_k,
+            top_p=XTTS_MODEL.config.top_p,
+            enable_text_splitting = True
+        )
+    else:
+        out = XTTS_MODEL.inference(
+            text=tts_text,
+            language=lang,
+            gpt_cond_latent=gpt_cond_latent,
+            speaker_embedding=speaker_embedding,
+            temperature=temperature, # Add custom parameters here
+            length_penalty=length_penalty,
+            repetition_penalty=float(repetition_penalty),
+            top_k=top_k,
+            top_p=top_p,
+            enable_text_splitting = sentence_split
+        )
+    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as fp:
+        out["wav"] = torch.tensor(out["wav"]).unsqueeze(0)
+        out_path = fp.name
+        torchaudio.save(out_path, out["wav"], 24000)
+    return "Speech generated !", out_path, speaker_audio_file
+def load_params_tts(out_path,version):
+    out_path = Path(out_path)
+    # base_model_path = Path.cwd() / "models" / version
+    # if not base_model_path.exists():
+    #     return "Base model not found !","","",""
+    ready_model_path = out_path / "ready"
+    vocab_path =  ready_model_path / "vocab.json"
+    config_path = ready_model_path / "config.json"
+    speaker_path =  ready_model_path / "speakers_xtts.pth"
+    reference_path  = ready_model_path / "reference.wav"
+    model_path = ready_model_path / "model.pth"
+    if not model_path.exists():
+        model_path = ready_model_path / "unoptimize_model.pth"
+        if not model_path.exists():
+          return "Params for TTS not found", "", "", ""
+    return "Params for TTS loaded", model_path, config_path, vocab_path,speaker_path, reference_path
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        description="""XTTS fine-tuning demo\n\n"""
+        """
+        Example runs:
+        python3 TTS/demos/xtts_ft_demo/xtts_demo.py --port
+        """,
+        formatter_class=argparse.RawTextHelpFormatter,
+    )
+    parser.add_argument(
+        "--port",
+        type=int,
+        help="Port to run the gradio demo. Default: 5003",
+        default=5003,
+    )
+    parser.add_argument(
+        "--out_path",
+        type=str,
+        help="Output path (where data and checkpoints will be saved) Default: output/",
+        default=str(Path.cwd() / "finetune_models"),
+    )
+    parser.add_argument(
+        "--num_epochs",
+        type=int,
+        help="Number of epochs to train. Default: 6",
+        default=6,
+    )
+    parser.add_argument(
+        "--batch_size",
+        type=int,
+        help="Batch size. Default: 2",
+        default=2,
+    )
+    parser.add_argument(
+        "--grad_acumm",
+        type=int,
+        help="Grad accumulation steps. Default: 1",
+        default=1,
+    )
+    parser.add_argument(
+        "--max_audio_length",
+        type=int,
+        help="Max permitted audio size in seconds. Default: 11",
+        default=11,
+    )
+    args = parser.parse_args()
+    with gr.Blocks() as demo:
+        with gr.Tab("1 - Data processing"):
+            out_path = gr.Textbox(
+                label="Output path (where data and checkpoints will be saved):",
+                value=args.out_path,
+            )
+            # upload_file = gr.Audio(
+            #     sources="upload",
+            #     label="Select here the audio files that you want to use for XTTS trainining !",
+            #     type="filepath",
+            # )
+            upload_file = gr.File(
+                file_count="multiple",
+                label="Select here the audio files that you want to use for XTTS trainining (Supported formats: wav, mp3, and flac)",
+            )
+            audio_folder_path = gr.Textbox(
+                label="Path to the folder with audio files (optional):",
+                value="",
+            )
+            whisper_model = gr.Dropdown(
+                label="Whisper Model",
+                value="large-v3",
+                choices=[
+                    "large-v3",
+                    "large-v2",
+                    "large",
+                    "medium",
+                    "small"
+                ],
+            )
+            lang = gr.Dropdown(
+                label="Dataset Language",
+                value="en",
+                choices=[
+                    "en",
+                    "es",
+                    "fr",
+                    "de",
+                    "it",
+                    "pt",
+                    "pl",
+                    "tr",
+                    "ru",
+                    "nl",
+                    "cs",
+                    "ar",
+                    "zh",
+                    "hu",
+                    "ko",
+                    "ja"
+                ],
+            )
+            progress_data = gr.Label(
+                label="Progress:"
+            )
+            # demo.load(read_logs, None, logs, every=1)
+            prompt_compute_btn = gr.Button(value="Step 1 - Create dataset")
+            def preprocess_dataset(audio_path, audio_folder_path, language, whisper_model, out_path, train_csv, eval_csv, progress=gr.Progress(track_tqdm=True)):
+                clear_gpu_cache()
+                train_csv = ""
+                eval_csv = ""
+                out_path = os.path.join(out_path, "dataset")
+                os.makedirs(out_path, exist_ok=True)
+                if audio_folder_path:
+                    audio_files = list(list_audios(audio_folder_path))
+                else:
+                    audio_files = audio_path
+                if not audio_files:
+                    return "No audio files found! Please provide files via Gradio or specify a folder path.", "", ""
+                else:
+                    try:
+                        # Loading Whisper
+                        device = "cuda" if torch.cuda.is_available() else "cpu"
+                        # Detect compute type
+                        if torch.cuda.is_available():
+                            compute_type = "float16"
+                        else:
+                            compute_type = "float32"
+                        asr_model = WhisperModel(whisper_model, device=device, compute_type=compute_type)
+                        train_meta, eval_meta, audio_total_size = format_audio_list(audio_files, asr_model=asr_model, target_language=language, out_path=out_path, gradio_progress=progress)
+                    except:
+                        traceback.print_exc()
+                        error = traceback.format_exc()
+                        return f"The data processing was interrupted due an error !! Please check the console to verify the full error message! \n Error summary: {error}", "", ""
+                # clear_gpu_cache()
+                # if audio total len is less than 2 minutes raise an error
+                if audio_total_size < 120:
+                    message = "The sum of the duration of the audios that you provided should be at least 2 minutes!"
+                    print(message)
+                    return message, "", ""
+                print("Dataset Processed!")
+                return "Dataset Processed!", train_meta, eval_meta
+        with gr.Tab("2 - Fine-tuning XTTS Encoder"):
+            load_params_btn = gr.Button(value="Load Params from output folder")
+            version = gr.Dropdown(
+                label="XTTS base version",
+                value="v2.0.2",
+                choices=[
+                    "v2.0.3",
+                    "v2.0.2",
+                    "v2.0.1",
+                    "v2.0.0",
+                    "main"
+                ],
+            )
+            train_csv = gr.Textbox(
+                label="Train CSV:",
+            )
+            eval_csv = gr.Textbox(
+                label="Eval CSV:",
+            )
+            custom_model = gr.Textbox(
+                label="(Optional) Custom model.pth file , leave blank if you want to use the base file.",
+                value="",
+            )
+            num_epochs =  gr.Slider(
+                label="Number of epochs:",
+                minimum=1,
+                maximum=100,
+                step=1,
+                value=args.num_epochs,
+            )
+            batch_size = gr.Slider(
+                label="Batch size:",
+                minimum=2,
+                maximum=512,
+                step=1,
+                value=args.batch_size,
+            )
+            grad_acumm = gr.Slider(
+                label="Grad accumulation steps:",
+                minimum=2,
+                maximum=128,
+                step=1,
+                value=args.grad_acumm,
+            )
+            max_audio_length = gr.Slider(
+                label="Max permitted audio size in seconds:",
+                minimum=2,
+                maximum=20,
+                step=1,
+                value=args.max_audio_length,
+            )
+            clear_train_data = gr.Dropdown(
+                label="Clear train data, you will delete selected folder, after optimizing",
+                value="none",
+                choices=[
+                    "none",
+                    "run",
+                    "dataset",
+                    "all"
+                ])
+            progress_train = gr.Label(
+                label="Progress:"
+            )
+            # demo.load(read_logs, None, logs_tts_train, every=1)
+            train_btn = gr.Button(value="Step 2 - Run the training")
+            optimize_model_btn = gr.Button(value="Step 2.5 - Optimize the model")
+            def train_model(custom_model,version,language, train_csv, eval_csv, num_epochs, batch_size, grad_acumm, output_path, max_audio_length):
+                clear_gpu_cache()
+                run_dir = Path(output_path) / "run"
+                # # Remove train dir
+                if run_dir.exists():
+                    os.remove(run_dir)
+                # Check if the dataset language matches the language you specified
+                lang_file_path = Path(output_path) / "dataset" / "lang.txt"
+                # Check if lang.txt already exists and contains a different language
+                current_language = None
+                if lang_file_path.exists():
+                    with open(lang_file_path, 'r', encoding='utf-8') as existing_lang_file:
+                        current_language = existing_lang_file.read().strip()
+                        if current_language != language:
+                            print("The language that was prepared for the dataset does not match the specified language. Change the language to the one specified in the dataset")
+                            language = current_language
+                if not train_csv or not eval_csv:
+                    return "You need to run the data processing step or manually set `Train CSV` and `Eval CSV` fields !", "", "", "", ""
+                try:
+                    # convert seconds to waveform frames
+                    max_audio_length = int(max_audio_length * 22050)
+                    speaker_xtts_path,config_path, original_xtts_checkpoint, vocab_file, exp_path, speaker_wav = train_gpt(custom_model,version,language, num_epochs, batch_size, grad_acumm, train_csv, eval_csv, output_path=output_path, max_audio_length=max_audio_length)
+                except:
+                    traceback.print_exc()
+                    error = traceback.format_exc()
+                    return f"The training was interrupted due an error !! Please check the console to check the full error message! \n Error summary: {error}", "", "", "", ""
+                # copy original files to avoid parameters changes issues
+                # os.system(f"cp {config_path} {exp_path}")
+                # os.system(f"cp {vocab_file} {exp_path}")
+                ready_dir = Path(output_path) / "ready"
+                ft_xtts_checkpoint = os.path.join(exp_path, "best_model.pth")
+                shutil.copy(ft_xtts_checkpoint, ready_dir / "unoptimize_model.pth")
+                # os.remove(ft_xtts_checkpoint)
+                ft_xtts_checkpoint = os.path.join(ready_dir, "unoptimize_model.pth")
+                # Reference
+                # Move reference audio to output folder and rename it
+                speaker_reference_path = Path(speaker_wav)
+                speaker_reference_new_path = ready_dir / "reference.wav"
+                shutil.copy(speaker_reference_path, speaker_reference_new_path)
+                print("Model training done!")
+                # clear_gpu_cache()
+                return "Model training done!", config_path, vocab_file, ft_xtts_checkpoint,speaker_xtts_path, speaker_reference_new_path
+            def optimize_model(out_path, clear_train_data):
+                # print(out_path)
+                out_path = Path(out_path)  # Ensure that out_path is a Path object.
+                ready_dir = out_path / "ready"
+                run_dir = out_path / "run"
+                dataset_dir = out_path / "dataset"
+                # Clear specified training data directories.
+                if clear_train_data in {"run", "all"} and run_dir.exists():
+                    try:
+                        shutil.rmtree(run_dir)
+                    except PermissionError as e:
+                        print(f"An error occurred while deleting {run_dir}: {e}")
+                if clear_train_data in {"dataset", "all"} and dataset_dir.exists():
+                    try:
+                        shutil.rmtree(dataset_dir)
+                    except PermissionError as e:
+                        print(f"An error occurred while deleting {dataset_dir}: {e}")
+                # Get full path to model
+                model_path = ready_dir / "unoptimize_model.pth"
+                if not model_path.is_file():
+                    return "Unoptimized model not found in ready folder", ""
+                # Load the checkpoint and remove unnecessary parts.
+                checkpoint = torch.load(model_path, map_location=torch.device("cpu"))
+                del checkpoint["optimizer"]
+                for key in list(checkpoint["model"].keys()):
+                    if "dvae" in key:
+                        del checkpoint["model"][key]
+                # Make sure out_path is a Path object or convert it to Path
+                os.remove(model_path)
+                  # Save the optimized model.
+                optimized_model_file_name="model.pth"
+                optimized_model=ready_dir/optimized_model_file_name
+                torch.save(checkpoint, optimized_model)
+                ft_xtts_checkpoint=str(optimized_model)
+                clear_gpu_cache()
+                return f"Model optimized and saved at {ft_xtts_checkpoint}!", ft_xtts_checkpoint
+            def load_params(out_path):
+                path_output = Path(out_path)
+                dataset_path = path_output / "dataset"
+                if not dataset_path.exists():
+                    return "The output folder does not exist!", "", ""
+                eval_train = dataset_path / "metadata_train.csv"
+                eval_csv = dataset_path / "metadata_eval.csv"
+                # Write the target language to lang.txt in the output directory
+                lang_file_path =  dataset_path / "lang.txt"
+                # Check if lang.txt already exists and contains a different language
+                current_language = None
+                if os.path.exists(lang_file_path):
+                    with open(lang_file_path, 'r', encoding='utf-8') as existing_lang_file:
+                        current_language = existing_lang_file.read().strip()
+                clear_gpu_cache()
+                print(current_language)
+                return "The data has been updated", eval_train, eval_csv, current_language
+        with gr.Tab("3 - Inference"):
+            with gr.Row():
+                with gr.Column() as col1:
+                    load_params_tts_btn = gr.Button(value="Load params for TTS from output folder")
+                    xtts_checkpoint = gr.Textbox(
+                        label="XTTS checkpoint path:",
+                        value="",
+                    )
+                    xtts_config = gr.Textbox(
+                        label="XTTS config path:",
+                        value="",
+                    )
+                    xtts_vocab = gr.Textbox(
+                        label="XTTS vocab path:",
+                        value="",
+                    )
+                    xtts_speaker = gr.Textbox(
+                        label="XTTS speaker path:",
+                        value="",
+                    )
+                    progress_load = gr.Label(
+                        label="Progress:"
+                    )
+                    load_btn = gr.Button(value="Step 3 - Load Fine-tuned XTTS model")
+                with gr.Column() as col2:
+                    speaker_reference_audio = gr.Textbox(
+                        label="Speaker reference audio:",
+                        value="",
+                    )
+                    tts_language = gr.Dropdown(
+                        label="Language",
+                        value="en",
+                        choices=[
+                            "en",
+                            "es",
+                            "fr",
+                            "de",
+                            "it",
+                            "pt",
+                            "pl",
+                            "tr",
+                            "ru",
+                            "nl",
+                            "cs",
+                            "ar",
+                            "zh",
+                            "hu",
+                            "ko",
+                            "ja",
+                        ]
+                    )
+                    tts_text = gr.Textbox(
+                        label="Input Text.",
+                        value="This model sounds really good and above all, it's reasonably fast.",
+                    )
+                    with gr.Accordion("Advanced settings", open=False) as acr:
+                        temperature = gr.Slider(
+                            label="temperature",
+                            minimum=0,
+                            maximum=1,
+                            step=0.05,
+                            value=0.75,
+                        )
+                        length_penalty  = gr.Slider(
+                            label="length_penalty",
+                            minimum=-10.0,
+                            maximum=10.0,
+                            step=0.5,
+                            value=1,
+                        )
+                        repetition_penalty = gr.Slider(
+                            label="repetition penalty",
+                            minimum=1,
+                            maximum=10,
+                            step=0.5,
+                            value=5,
+                        )
+                        top_k = gr.Slider(
+                            label="top_k",
+                            minimum=1,
+                            maximum=100,
+                            step=1,
+                            value=50,
+                        )
+                        top_p = gr.Slider(
+                            label="top_p",
+                            minimum=0,
+                            maximum=1,
+                            step=0.05,
+                            value=0.85,
+                        )
+                        sentence_split = gr.Checkbox(
+                            label="Enable text splitting",
+                            value=True,
+                        )
+                        use_config = gr.Checkbox(
+                            label="Use Inference settings from config, if disabled use the settings above",
+                            value=False,
+                        )
+                    tts_btn = gr.Button(value="Step 4 - Inference")
+                with gr.Column() as col3:
+                    progress_gen = gr.Label(
+                        label="Progress:"
+                    )
+                    tts_output_audio = gr.Audio(label="Generated Audio.")
+                    reference_audio = gr.Audio(label="Reference audio used.")
+            prompt_compute_btn.click(
+                fn=preprocess_dataset,
+                inputs=[
+                    upload_file,
+                    audio_folder_path,
+                    lang,
+                    whisper_model,
+                    out_path,
+                    train_csv,
+                    eval_csv
+                ],
+                outputs=[
+                    progress_data,
+                    train_csv,
+                    eval_csv,
+                ],
+            )
+            load_params_btn.click(
+                fn=load_params,
+                inputs=[out_path],
+                outputs=[
+                    progress_train,
+                    train_csv,
+                    eval_csv,
+                    lang
+                ]
+            )
+            train_btn.click(
+                fn=train_model,
+                inputs=[
+                    custom_model,
+                    version,
+                    lang,
+                    train_csv,
+                    eval_csv,
+                    num_epochs,
+                    batch_size,
+                    grad_acumm,
+                    out_path,
+                    max_audio_length,
+                ],
+                outputs=[progress_train, xtts_config, xtts_vocab, xtts_checkpoint,xtts_speaker, speaker_reference_audio],
+            )
+            optimize_model_btn.click(
+                fn=optimize_model,
+                inputs=[
+                    out_path,
+                    clear_train_data
+                ],
+                outputs=[progress_train,xtts_checkpoint],
+            )
+            load_btn.click(
+                fn=load_model,
+                inputs=[
+                    xtts_checkpoint,
+                    xtts_config,
+                    xtts_vocab,
+                    xtts_speaker
+                ],
+                outputs=[progress_load],
+            )
+            tts_btn.click(
+                fn=run_tts,
+                inputs=[
+                    tts_language,
+                    tts_text,
+                    speaker_reference_audio,
+                    temperature,
+                    length_penalty,
+                    repetition_penalty,
+                    top_k,
+                    top_p,
+                    sentence_split,
+                    use_config
+                ],
+                outputs=[progress_gen, tts_output_audio,reference_audio],
+            )
+            load_params_tts_btn.click(
+                fn=load_params_tts,
+                inputs=[
+                    out_path,
+                    version
+                    ],
+                outputs=[progress_load,xtts_checkpoint,xtts_config,xtts_vocab,xtts_speaker,speaker_reference_audio],
+            )
+    demo.launch(
+        share=False,
+        debug=False,
+        server_port=args.port,
+        # inweb=True,
+        # server_name="localhost"
+    )