vits-simple-api

Simply call the vits api

# Feature - [x] VITS text-to-speech, voice conversion - [x] HuBert-soft VITS - [x] [vits_chinese](https://github.com/PlayVoice/vits_chinese) - [x] [Bert-VITS2](https://github.com/Stardust-minus/Bert-VITS2) - [x] W2V2 VITS / [emotional-vits](https://github.com/innnky/emotional-vits) dimensional emotion model - [x] Support for loading multiple models - [x] Automatic language recognition and processing,set the scope of language type recognition according to model's cleaner,support for custom language type range - [x] Customize default parameters - [x] Long text batch processing - [x] GPU accelerated inference - [x] SSML (Speech Synthesis Markup Language) work in progress... ## demo [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/Artrajz/vits-simple-api) Please note that different IDs may support different languages.[speakers](https://artrajz-vits-simple-api.hf.space/voice/speakers) - `https://artrajz-vits-simple-api.hf.space/voice/vits?text=你好,こんにちは&id=164` - `https://artrajz-vits-simple-api.hf.space/voice/vits?text=Difficult the first time, easy the second.&id=4` - excited:`https://artrajz-vits-simple-api.hf.space/voice/w2v2-vits?text=こんにちは&id=3&emotion=111` - whispered:`https://artrajz-vits-simple-api.hf.space/w2v2-vits?text=こんにちは&id=3&emotion=2077` https://user-images.githubusercontent.com/73542220/237995061-c1f25b4e-dd86-438a-9363-4bb1fe65b425.mov # Deploy ## Docker(Recommended for Linux) ### Docker image pull script ``` bash -c "$(wget -O- https://raw.githubusercontent.com/Artrajz/vits-simple-api/main/vits-simple-api-installer-latest.sh)" ``` - The platforms currently supported by Docker images are `linux/amd64` and `linux/arm64`.(arm64 only has a CPU version) - After a successful pull, the vits model needs to be imported before use. Please follow the steps below to import the model. ### Download VITS model Put the model into `/usr/local/vits-simple-api/Model`

Folder structure


│  hubert-soft-0d54a1f4.pt
│  model.onnx
│  model.yaml
│
├─g
│      config.json
│      G_953000.pth
│
├─louise
│      360_epochs.pth
│      config.json
│
├─Nene_Nanami_Rong_Tang
│      1374_epochs.pth
│      config.json
│
├─Zero_no_tsukaima
│       1158_epochs.pth
│       config.json
│
└─npy
       25ecb3f6-f968-11ed-b094-e0d4e84af078.npy
       all_emotions.npy

### Modify model path Modify in `/usr/local/vits-simple-api/config.py`

config.py


# Fill in the model path here
MODEL_LIST = [
    # VITS
    [ABS_PATH + "/Model/Nene_Nanami_Rong_Tang/1374_epochs.pth", ABS_PATH + "/Model/Nene_Nanami_Rong_Tang/config.json"],
    [ABS_PATH + "/Model/Zero_no_tsukaima/1158_epochs.pth", ABS_PATH + "/Model/Zero_no_tsukaima/config.json"],
    [ABS_PATH + "/Model/g/G_953000.pth", ABS_PATH + "/Model/g/config.json"],
    # HuBert-VITS (Need to configure HUBERT_SOFT_MODEL)
    [ABS_PATH + "/Model/louise/360_epochs.pth", ABS_PATH + "/Model/louise/config.json"],
    # W2V2-VITS (Need to configure DIMENSIONAL_EMOTION_NPY)
    [ABS_PATH + "/Model/w2v2-vits/1026_epochs.pth", ABS_PATH + "/Model/w2v2-vits/config.json"],
]
# hubert-vits: hubert soft model
HUBERT_SOFT_MODEL = ABS_PATH + "/Model/hubert-soft-0d54a1f4.pt"
# w2v2-vits: Dimensional emotion npy file
# load single npy: ABS_PATH+"/all_emotions.npy
# load mutiple npy: [ABS_PATH + "/emotions1.npy", ABS_PATH + "/emotions2.npy"]
# load mutiple npy from folder: ABS_PATH + "/Model/npy"
DIMENSIONAL_EMOTION_NPY = ABS_PATH + "/Model/npy"
# w2v2-vits: Need to have both `model.onnx` and `model.yaml` files in the same path.
DIMENSIONAL_EMOTION_MODEL = ABS_PATH + "/Model/model.yaml"

### Startup `docker compose up -d` Or execute the pull script again ### Image update Run the docker image pull script again ## Virtual environment deployment ### Clone `git clone https://github.com/Artrajz/vits-simple-api.git` ### Download python dependencies A python virtual environment is recommended `pip install -r requirements.txt` Fasttext may not be installed on windows, you can install it with the following command,or download wheels [here](https://www.lfd.uci.edu/~gohlke/pythonlibs/#fasttext) ``` # python3.10 win_amd64 pip install https://github.com/Artrajz/archived/raw/main/fasttext/fasttext-0.9.2-cp310-cp310-win_amd64.whl ``` ### Download VITS model Put the model into `/path/to/vits-simple-api/Model`

Folder structure


│  hubert-soft-0d54a1f4.pt
│  model.onnx
│  model.yaml
│
├─g
│      config.json
│      G_953000.pth
│
├─louise
│      360_epochs.pth
│      config.json
│
├─Nene_Nanami_Rong_Tang
│      1374_epochs.pth
│      config.json
│
├─Zero_no_tsukaima
│       1158_epochs.pth
│       config.json
│
└─npy
       25ecb3f6-f968-11ed-b094-e0d4e84af078.npy
       all_emotions.npy

### Modify model path Modify in `/path/to/vits-simple-api/config.py`

config.py


# Fill in the model path here
MODEL_LIST = [
    # VITS
    [ABS_PATH + "/Model/Nene_Nanami_Rong_Tang/1374_epochs.pth", ABS_PATH + "/Model/Nene_Nanami_Rong_Tang/config.json"],
    [ABS_PATH + "/Model/Zero_no_tsukaima/1158_epochs.pth", ABS_PATH + "/Model/Zero_no_tsukaima/config.json"],
    [ABS_PATH + "/Model/g/G_953000.pth", ABS_PATH + "/Model/g/config.json"],
    # HuBert-VITS (Need to configure HUBERT_SOFT_MODEL)
    [ABS_PATH + "/Model/louise/360_epochs.pth", ABS_PATH + "/Model/louise/config.json"],
    # W2V2-VITS (Need to configure DIMENSIONAL_EMOTION_NPY)
    [ABS_PATH + "/Model/w2v2-vits/1026_epochs.pth", ABS_PATH + "/Model/w2v2-vits/config.json"],
]
# hubert-vits: hubert soft model
HUBERT_SOFT_MODEL = ABS_PATH + "/Model/hubert-soft-0d54a1f4.pt"
# w2v2-vits: Dimensional emotion npy file
# load single npy: ABS_PATH+"/all_emotions.npy
# load mutiple npy: [ABS_PATH + "/emotions1.npy", ABS_PATH + "/emotions2.npy"]
# load mutiple npy from folder: ABS_PATH + "/Model/npy"
DIMENSIONAL_EMOTION_NPY = ABS_PATH + "/Model/npy"
# w2v2-vits: Need to have both `model.onnx` and `model.yaml` files in the same path.
DIMENSIONAL_EMOTION_MODEL = ABS_PATH + "/Model/model.yaml"

### Startup `python app.py` # GPU accelerated ## Windows ### Install CUDA Check the highest version of CUDA supported by your graphics card: ``` nvidia-smi ``` Taking CUDA 11.7 as an example, download it from the [official website](https://developer.nvidia.com/cuda-11-7-0-download-archive?target_os=Windows&target_arch=x86_64&target_version=10&target_type=exe_local) ### Install GPU version of PyTorch 1.13.1+cu117 is recommended, other versions may have memory instability issues. ``` pip install torch==1.13.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117 ``` ## Linux The installation process is similar, but I don't have the environment to test it. # Dependency Installation Issues Since pypi.org does not have the `pyopenjtalk` whl file, it usually needs to be installed from the source code. This process might be troublesome for some people. Therefore, you can also use the whl I built for installation. ``` pip install pyopenjtalk -i https://pypi.artrajz.cn/simple ``` # API ## GET #### speakers list - GET http://127.0.0.1:23456/voice/speakers Returns the mapping table of role IDs to speaker names. #### voice vits - GET http://127.0.0.1:23456/voice/vits?text=text Default values are used when other parameters are not specified. - GET http://127.0.0.1:23456/voice/vits?text=[ZH]text[ZH][JA]text[JA]&lang=mix When lang=mix, the text needs to be annotated. - GET http://127.0.0.1:23456/voice/vits?text=text&id=142&format=wav&lang=zh&length=1.4 The text is "text", the role ID is 142, the audio format is wav, the text language is zh, the speech length is 1.4, and the other parameters are default. #### check - GET http://127.0.0.1:23456/voice/check?id=0&model=vits ## POST - See `api_test.py` ## API KEY Set `API_KEY_ENABLED = True` in `config.py` to enable API key authentication. The API key is `API_KEY = "api-key"`. After enabling it, you need to add the `api_key` parameter in GET requests and add the `X-API-KEY` parameter in the header for POST requests. # Parameter ## VITS | Name | Parameter | Is must | Default | Type | Instruction | | ---------------------- | --------- | ------- | ---------------- | ----- | ------------------------------------------------------------ | | Synthesized text | text | true | | str | Text needed for voice synthesis. | | Speaker ID | id | false | From `config.py` | int | The speaker ID. | | Audio format | format | false | From `config.py` | str | Support for wav,ogg,silk,mp3,flac | | Text language | lang | false | From `config.py` | str | The language of the text to be synthesized. Available options include auto, zh, ja, and mix. When lang=mix, the text should be wrapped in [ZH] or [JA].The default mode is auto, which automatically detects the language of the text | | Audio length | length | false | From `config.py` | float | Adjusts the length of the synthesized speech, which is equivalent to adjusting the speed of the speech. The larger the value, the slower the speed. | | Noise | noise | false | From `config.py` | float | Sample noise, controlling the randomness of the synthesis. | | SDP noise | noisew | false | From `config.py` | float | Stochastic Duration Predictor noise, controlling the length of phoneme pronunciation. | | Segmentation threshold | max | false | v | int | Divide the text into paragraphs based on punctuation marks, and combine them into one paragraph when the length exceeds max. If max<=0, the text will not be divided into paragraphs. | | Streaming response | streaming | false | false | bool | Streamed synthesized speech with faster initial response. | ## VITS voice conversion | Name | Parameter | Is must | Default | Type | Instruction | | -------------- | ----------- | ------- | ------- | ---- | --------------------------------------------------------- | | Uploaded Audio | upload | true | | file | The audio file to be uploaded. It should be in wav or ogg | | Source Role ID | original_id | true | | int | The ID of the role used to upload the audio file. | | Target Role ID | target_id | true | | int | The ID of the target role to convert the audio to. | ## HuBert-VITS | Name | Parameter | Is must | Default | Type | Instruction | | ----------------- | --------- | ------- | ------- | ----- | ------------------------------------------------------------ | | Uploaded Audio | upload | true | | file | The audio file to be uploaded. It should be in wav or ogg format. | | Target speaker ID | id | true | | int | The target speaker ID. | | Audio format | format | true | | str | wav,ogg,silk | | Audio length | length | true | | float | Adjusts the length of the synthesized speech, which is equivalent to adjusting the speed of the speech. The larger the value, the slower the speed. | | Noise | noise | true | | float | Sample noise, controlling the randomness of the synthesis. | | sdp noise | noisew | true | | float | Stochastic Duration Predictor noise, controlling the length of phoneme pronunciation. | ## W2V2-VITS | Name | Parameter | Is must | Default | Type | Instruction | | ---------------------- | --------- | ------- | ---------------- | ----- | ------------------------------------------------------------ | | Synthesized text | text | true | | str | Text needed for voice synthesis. | | Speaker ID | id | false | From `config.py` | int | The speaker ID. | | Audio format | format | false | From `config.py` | str | Support for wav,ogg,silk,mp3,flac | | Text language | lang | false | From `config.py` | str | The language of the text to be synthesized. Available options include auto, zh, ja, and mix. When lang=mix, the text should be wrapped in [ZH] or [JA].The default mode is auto, which automatically detects the language of the text | | Audio length | length | false | From `config.py` | float | Adjusts the length of the synthesized speech, which is equivalent to adjusting the speed of the speech. The larger the value, the slower the speed. | | Noise | noise | false | From `config.py` | float | Sample noise, controlling the randomness of the synthesis. | | SDP noise | noisew | false | From `config.py` | float | Stochastic Duration Predictor noise, controlling the length of phoneme pronunciation. | | Segmentation threshold | max | false | From `config.py` | int | Divide the text into paragraphs based on punctuation marks, and combine them into one paragraph when the length exceeds max. If max<=0, the text will not be divided into paragraphs. | | Dimensional emotion | emotion | false | 0 | int | The range depends on the emotion reference file in npy format, such as the range of the [innnky](https://huggingface.co/spaces/innnky/nene-emotion/tree/main)'s model all_emotions.npy, which is 0-5457. | ## Dimensional emotion | Name | Parameter | Is must | Default | Type | Instruction | | -------------- | --------- | ------- | ------- | ---- | ------------------------------------------------------------ | | Uploaded Audio | upload | true | | file | Return the npy file that stores the dimensional emotion vectors. | ## Bert-VITS2 | Name | Parameter | Is must | Default | Type | Instruction | | ---------------------- | --------- | ------- | ---------------- | ----- | ------------------------------------------------------------ | | Synthesized text | text | true | | str | Text needed for voice synthesis. | | Speaker ID | id | false | From `config.py` | int | The speaker ID. | | Audio format | format | false | From `config.py` | str | Support for wav,ogg,silk,mp3,flac | | Text language | lang | false | From `config.py` | str | "Auto" is a mode for automatic language detection and is also the default mode. However, it currently only supports detecting the language of an entire text passage and cannot distinguish languages on a per-sentence basis. The other available language options are "zh" and "ja". | | Audio length | length | false | From `config.py` | float | Adjusts the length of the synthesized speech, which is equivalent to adjusting the speed of the speech. The larger the value, the slower the speed. | | Noise | noise | false | From `config.py` | float | Sample noise, controlling the randomness of the synthesis. | | SDP noise | noisew | false | From `config.py` | float | Stochastic Duration Predictor noise, controlling the length of phoneme pronunciation. | | Segmentation threshold | max | false | From `config.py` | int | Divide the text into paragraphs based on punctuation marks, and combine them into one paragraph when the length exceeds max. If max<=0, the text will not be divided into paragraphs. | | SDP/DP mix ratio | sdp_ratio | false | From `config.py` | int | The theoretical proportion of SDP during synthesis, the higher the ratio, the larger the variance in synthesized voice tone. | ## SSML (Speech Synthesis Markup Language) Supported Elements and Attributes `speak` Element | Attribute | Instruction | Is must | | --------- | ------------------------------------------------------------ | ------- | | id | Default value is retrieved from `config.py` | false | | lang | Default value is retrieved from `config.py` | false | | length | Default value is retrieved from `config.py` | false | | noise | Default value is retrieved from `config.py` | false | | noisew | Default value is retrieved from `config.py` | false | | max | Splits text into segments based on punctuation marks. When the sum of segment lengths exceeds `max`, it is treated as one segment. `max<=0` means no segmentation. The default value is 0. | false | | model | Default is `vits`. Options: `w2v2-vits`, `emotion-vits` | false | | emotion | Only effective when using `w2v2-vits` or `emotion-vits`. The range depends on the npy emotion reference file. | false | `voice` Element Higher priority than `speak`. | Attribute | Instruction | Is must | | --------- | ------------------------------------------------------------ | ------- | | id | Default value is retrieved from `config.py` | false | | lang | Default value is retrieved from `config.py` | false | | length | Default value is retrieved from `config.py` | false | | noise | Default value is retrieved from `config.py` | false | | noisew | Default value is retrieved from `config.py` | false | | max | Splits text into segments based on punctuation marks. When the sum of segment lengths exceeds `max`, it is treated as one segment. `max<=0` means no segmentation. The default value is 0. | false | | model | Default is `vits`. Options: `w2v2-vits`, `emotion-vits` | false | | emotion | Only effective when using `w2v2-vits` or `emotion-vits` | false | `break` Element | Attribute | Instruction | Is must | | --------- | ------------------------------------------------------------ | ------- | | strength | x-weak, weak, medium (default), strong, x-strong | false | | time | The absolute duration of a pause in seconds (such as `2s`) or milliseconds (such as `500ms`). Valid values range from 0 to 5000 milliseconds. If you set a value greater than the supported maximum, the service will use `5000ms`. If the `time` attribute is set, the `strength` attribute is ignored. | false | | Strength | Relative Duration | | :------- | :---------------- | | x-weak | 250 ms | | weak | 500 ms | | medium | 750 ms | | strong | 1000 ms | | x-strong | 1250 ms | Example ```xml 这几天心里颇不宁静。今晚在院子里坐着乘凉，忽然想起日日走过的荷塘，在这满月的光里，总该另有一番样子吧。月亮渐渐地升高了，墙外马路上孩子们的欢笑，已经听不见了；妻在屋里拍着闰儿，迷迷糊糊地哼着眠歌。我悄悄地披了大衫，带上门出去。沿着荷塘，是一条曲折的小煤屑路。这是一条幽僻的路；白天也少人走，夜晚更加寂寞。荷塘四面，长着许多树，蓊蓊郁郁的。路的一旁，是些杨柳，和一些不知道名字的树。没有月光的晚上，这路上阴森森的，有些怕人。今晚却很好，虽然月光也还是淡淡的。路上只我一个人，背着手踱着。这一片天地好像是我的；我也像超出了平常的自己，到了另一个世界里。我爱热闹，也爱冷静；爱群居，也爱独处。像今晚上，一个人在这苍茫的月下，什么都可以想，什么都可以不想，便觉是个自由的人。白天里一定要做的事，一定要说的话，现在都可不理。这是独处的妙处，我且受用这无边的荷香月色好了。 ``` # Communication Learning and communication,now there is only Chinese [QQ group](https://qm.qq.com/cgi-bin/qm/qr?k=-1GknIe4uXrkmbDKBGKa1aAUteq40qs_&jump_from=webapi&authKey=x5YYt6Dggs1ZqWxvZqvj3fV8VUnxRyXm5S5Kzntc78+Nv3iXOIawplGip9LWuNR/) # Acknowledgements - vits:https://github.com/jaywalnut310/vits - MoeGoe:https://github.com/CjangCjengh/MoeGoe - emotional-vits:https://github.com/innnky/emotional-vits - vits-uma-genshin-honkai:https://huggingface.co/spaces/zomehwh/vits-uma-genshin-honkai - vits_chinese:https://github.com/PlayVoice/vits_chinese - Bert_VITS2:https://github.com/fishaudio/Bert-VITS2