# Feature
- [x] VITS text-to-speech, voice conversion
- [x] HuBert-soft VITS
- [x] [vits_chinese](https://github.com/PlayVoice/vits_chinese)
- [x] [Bert-VITS2](https://github.com/Stardust-minus/Bert-VITS2)
- [x] W2V2 VITS / [emotional-vits](https://github.com/innnky/emotional-vits) dimensional emotion model
- [x] Support for loading multiple models
- [x] Automatic language recognition and processing,set the scope of language type recognition according to model's cleaner,support for custom language type range
- [x] Customize default parameters
- [x] Long text batch processing
- [x] GPU accelerated inference
- [x] SSML (Speech Synthesis Markup Language) work in progress...
## demo
[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/Artrajz/vits-simple-api)
Please note that different IDs may support different languages.[speakers](https://artrajz-vits-simple-api.hf.space/voice/speakers)
- `https://artrajz-vits-simple-api.hf.space/voice/vits?text=你好,こんにちは&id=164`
- `https://artrajz-vits-simple-api.hf.space/voice/vits?text=Difficult the first time, easy the second.&id=4`
- excited:`https://artrajz-vits-simple-api.hf.space/voice/w2v2-vits?text=こんにちは&id=3&emotion=111`
- whispered:`https://artrajz-vits-simple-api.hf.space/w2v2-vits?text=こんにちは&id=3&emotion=2077`
https://user-images.githubusercontent.com/73542220/237995061-c1f25b4e-dd86-438a-9363-4bb1fe65b425.mov
# Deploy
## Docker(Recommended for Linux)
### Docker image pull script
```
bash -c "$(wget -O- https://raw.githubusercontent.com/Artrajz/vits-simple-api/main/vits-simple-api-installer-latest.sh)"
```
- The platforms currently supported by Docker images are `linux/amd64` and `linux/arm64`.(arm64 only has a CPU version)
- After a successful pull, the vits model needs to be imported before use. Please follow the steps below to import the model.
### Download VITS model
Put the model into `/usr/local/vits-simple-api/Model`
Folder structure
│ hubert-soft-0d54a1f4.pt
│ model.onnx
│ model.yaml
│
├─g
│ config.json
│ G_953000.pth
│
├─louise
│ 360_epochs.pth
│ config.json
│
├─Nene_Nanami_Rong_Tang
│ 1374_epochs.pth
│ config.json
│
├─Zero_no_tsukaima
│ 1158_epochs.pth
│ config.json
│
└─npy
25ecb3f6-f968-11ed-b094-e0d4e84af078.npy
all_emotions.npy
### Modify model path
Modify in `/usr/local/vits-simple-api/config.py`
config.py
# Fill in the model path here
MODEL_LIST = [
# VITS
[ABS_PATH + "/Model/Nene_Nanami_Rong_Tang/1374_epochs.pth", ABS_PATH + "/Model/Nene_Nanami_Rong_Tang/config.json"],
[ABS_PATH + "/Model/Zero_no_tsukaima/1158_epochs.pth", ABS_PATH + "/Model/Zero_no_tsukaima/config.json"],
[ABS_PATH + "/Model/g/G_953000.pth", ABS_PATH + "/Model/g/config.json"],
# HuBert-VITS (Need to configure HUBERT_SOFT_MODEL)
[ABS_PATH + "/Model/louise/360_epochs.pth", ABS_PATH + "/Model/louise/config.json"],
# W2V2-VITS (Need to configure DIMENSIONAL_EMOTION_NPY)
[ABS_PATH + "/Model/w2v2-vits/1026_epochs.pth", ABS_PATH + "/Model/w2v2-vits/config.json"],
]
# hubert-vits: hubert soft model
HUBERT_SOFT_MODEL = ABS_PATH + "/Model/hubert-soft-0d54a1f4.pt"
# w2v2-vits: Dimensional emotion npy file
# load single npy: ABS_PATH+"/all_emotions.npy
# load mutiple npy: [ABS_PATH + "/emotions1.npy", ABS_PATH + "/emotions2.npy"]
# load mutiple npy from folder: ABS_PATH + "/Model/npy"
DIMENSIONAL_EMOTION_NPY = ABS_PATH + "/Model/npy"
# w2v2-vits: Need to have both `model.onnx` and `model.yaml` files in the same path.
DIMENSIONAL_EMOTION_MODEL = ABS_PATH + "/Model/model.yaml"
### Startup
`docker compose up -d`
Or execute the pull script again
### Image update
Run the docker image pull script again
## Virtual environment deployment
### Clone
`git clone https://github.com/Artrajz/vits-simple-api.git`
### Download python dependencies
A python virtual environment is recommended
`pip install -r requirements.txt`
Fasttext may not be installed on windows, you can install it with the following command,or download wheels [here](https://www.lfd.uci.edu/~gohlke/pythonlibs/#fasttext)
```
# python3.10 win_amd64
pip install https://github.com/Artrajz/archived/raw/main/fasttext/fasttext-0.9.2-cp310-cp310-win_amd64.whl
```
### Download VITS model
Put the model into `/path/to/vits-simple-api/Model`
Folder structure
│ hubert-soft-0d54a1f4.pt
│ model.onnx
│ model.yaml
│
├─g
│ config.json
│ G_953000.pth
│
├─louise
│ 360_epochs.pth
│ config.json
│
├─Nene_Nanami_Rong_Tang
│ 1374_epochs.pth
│ config.json
│
├─Zero_no_tsukaima
│ 1158_epochs.pth
│ config.json
│
└─npy
25ecb3f6-f968-11ed-b094-e0d4e84af078.npy
all_emotions.npy
### Modify model path
Modify in `/path/to/vits-simple-api/config.py`
config.py
# Fill in the model path here
MODEL_LIST = [
# VITS
[ABS_PATH + "/Model/Nene_Nanami_Rong_Tang/1374_epochs.pth", ABS_PATH + "/Model/Nene_Nanami_Rong_Tang/config.json"],
[ABS_PATH + "/Model/Zero_no_tsukaima/1158_epochs.pth", ABS_PATH + "/Model/Zero_no_tsukaima/config.json"],
[ABS_PATH + "/Model/g/G_953000.pth", ABS_PATH + "/Model/g/config.json"],
# HuBert-VITS (Need to configure HUBERT_SOFT_MODEL)
[ABS_PATH + "/Model/louise/360_epochs.pth", ABS_PATH + "/Model/louise/config.json"],
# W2V2-VITS (Need to configure DIMENSIONAL_EMOTION_NPY)
[ABS_PATH + "/Model/w2v2-vits/1026_epochs.pth", ABS_PATH + "/Model/w2v2-vits/config.json"],
]
# hubert-vits: hubert soft model
HUBERT_SOFT_MODEL = ABS_PATH + "/Model/hubert-soft-0d54a1f4.pt"
# w2v2-vits: Dimensional emotion npy file
# load single npy: ABS_PATH+"/all_emotions.npy
# load mutiple npy: [ABS_PATH + "/emotions1.npy", ABS_PATH + "/emotions2.npy"]
# load mutiple npy from folder: ABS_PATH + "/Model/npy"
DIMENSIONAL_EMOTION_NPY = ABS_PATH + "/Model/npy"
# w2v2-vits: Need to have both `model.onnx` and `model.yaml` files in the same path.
DIMENSIONAL_EMOTION_MODEL = ABS_PATH + "/Model/model.yaml"
### Startup
`python app.py`
# GPU accelerated
## Windows
### Install CUDA
Check the highest version of CUDA supported by your graphics card:
```
nvidia-smi
```
Taking CUDA 11.7 as an example, download it from the [official website](https://developer.nvidia.com/cuda-11-7-0-download-archive?target_os=Windows&target_arch=x86_64&target_version=10&target_type=exe_local)
### Install GPU version of PyTorch
1.13.1+cu117 is recommended, other versions may have memory instability issues.
```
pip install torch==1.13.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117
```
## Linux
The installation process is similar, but I don't have the environment to test it.
# Dependency Installation Issues
Since pypi.org does not have the `pyopenjtalk` whl file, it usually needs to be installed from the source code. This process might be troublesome for some people. Therefore, you can also use the whl I built for installation.
```
pip install pyopenjtalk -i https://pypi.artrajz.cn/simple
```
# API
## GET
#### speakers list
- GET http://127.0.0.1:23456/voice/speakers
Returns the mapping table of role IDs to speaker names.
#### voice vits
- GET http://127.0.0.1:23456/voice/vits?text=text
Default values are used when other parameters are not specified.
- GET http://127.0.0.1:23456/voice/vits?text=[ZH]text[ZH][JA]text[JA]&lang=mix
When lang=mix, the text needs to be annotated.
- GET http://127.0.0.1:23456/voice/vits?text=text&id=142&format=wav&lang=zh&length=1.4
The text is "text", the role ID is 142, the audio format is wav, the text language is zh, the speech length is 1.4, and the other parameters are default.
#### check
- GET http://127.0.0.1:23456/voice/check?id=0&model=vits
## POST
- See `api_test.py`
## API KEY
Set `API_KEY_ENABLED = True` in `config.py` to enable API key authentication. The API key is `API_KEY = "api-key"`.
After enabling it, you need to add the `api_key` parameter in GET requests and add the `X-API-KEY` parameter in the header for POST requests.
# Parameter
## VITS
| Name | Parameter | Is must | Default | Type | Instruction |
| ---------------------- | --------- | ------- | ---------------- | ----- | ------------------------------------------------------------ |
| Synthesized text | text | true | | str | Text needed for voice synthesis. |
| Speaker ID | id | false | From `config.py` | int | The speaker ID. |
| Audio format | format | false | From `config.py` | str | Support for wav,ogg,silk,mp3,flac |
| Text language | lang | false | From `config.py` | str | The language of the text to be synthesized. Available options include auto, zh, ja, and mix. When lang=mix, the text should be wrapped in [ZH] or [JA].The default mode is auto, which automatically detects the language of the text |
| Audio length | length | false | From `config.py` | float | Adjusts the length of the synthesized speech, which is equivalent to adjusting the speed of the speech. The larger the value, the slower the speed. |
| Noise | noise | false | From `config.py` | float | Sample noise, controlling the randomness of the synthesis. |
| SDP noise | noisew | false | From `config.py` | float | Stochastic Duration Predictor noise, controlling the length of phoneme pronunciation. |
| Segmentation threshold | max | false | v | int | Divide the text into paragraphs based on punctuation marks, and combine them into one paragraph when the length exceeds max. If max<=0, the text will not be divided into paragraphs. |
| Streaming response | streaming | false | false | bool | Streamed synthesized speech with faster initial response. |
## VITS voice conversion
| Name | Parameter | Is must | Default | Type | Instruction |
| -------------- | ----------- | ------- | ------- | ---- | --------------------------------------------------------- |
| Uploaded Audio | upload | true | | file | The audio file to be uploaded. It should be in wav or ogg |
| Source Role ID | original_id | true | | int | The ID of the role used to upload the audio file. |
| Target Role ID | target_id | true | | int | The ID of the target role to convert the audio to. |
## HuBert-VITS
| Name | Parameter | Is must | Default | Type | Instruction |
| ----------------- | --------- | ------- | ------- | ----- | ------------------------------------------------------------ |
| Uploaded Audio | upload | true | | file | The audio file to be uploaded. It should be in wav or ogg format. |
| Target speaker ID | id | true | | int | The target speaker ID. |
| Audio format | format | true | | str | wav,ogg,silk |
| Audio length | length | true | | float | Adjusts the length of the synthesized speech, which is equivalent to adjusting the speed of the speech. The larger the value, the slower the speed. |
| Noise | noise | true | | float | Sample noise, controlling the randomness of the synthesis. |
| sdp noise | noisew | true | | float | Stochastic Duration Predictor noise, controlling the length of phoneme pronunciation. |
## W2V2-VITS
| Name | Parameter | Is must | Default | Type | Instruction |
| ---------------------- | --------- | ------- | ---------------- | ----- | ------------------------------------------------------------ |
| Synthesized text | text | true | | str | Text needed for voice synthesis. |
| Speaker ID | id | false | From `config.py` | int | The speaker ID. |
| Audio format | format | false | From `config.py` | str | Support for wav,ogg,silk,mp3,flac |
| Text language | lang | false | From `config.py` | str | The language of the text to be synthesized. Available options include auto, zh, ja, and mix. When lang=mix, the text should be wrapped in [ZH] or [JA].The default mode is auto, which automatically detects the language of the text |
| Audio length | length | false | From `config.py` | float | Adjusts the length of the synthesized speech, which is equivalent to adjusting the speed of the speech. The larger the value, the slower the speed. |
| Noise | noise | false | From `config.py` | float | Sample noise, controlling the randomness of the synthesis. |
| SDP noise | noisew | false | From `config.py` | float | Stochastic Duration Predictor noise, controlling the length of phoneme pronunciation. |
| Segmentation threshold | max | false | From `config.py` | int | Divide the text into paragraphs based on punctuation marks, and combine them into one paragraph when the length exceeds max. If max<=0, the text will not be divided into paragraphs. |
| Dimensional emotion | emotion | false | 0 | int | The range depends on the emotion reference file in npy format, such as the range of the [innnky](https://huggingface.co/spaces/innnky/nene-emotion/tree/main)'s model all_emotions.npy, which is 0-5457. |
## Dimensional emotion
| Name | Parameter | Is must | Default | Type | Instruction |
| -------------- | --------- | ------- | ------- | ---- | ------------------------------------------------------------ |
| Uploaded Audio | upload | true | | file | Return the npy file that stores the dimensional emotion vectors. |
## Bert-VITS2
| Name | Parameter | Is must | Default | Type | Instruction |
| ---------------------- | --------- | ------- | ---------------- | ----- | ------------------------------------------------------------ |
| Synthesized text | text | true | | str | Text needed for voice synthesis. |
| Speaker ID | id | false | From `config.py` | int | The speaker ID. |
| Audio format | format | false | From `config.py` | str | Support for wav,ogg,silk,mp3,flac |
| Text language | lang | false | From `config.py` | str | "Auto" is a mode for automatic language detection and is also the default mode. However, it currently only supports detecting the language of an entire text passage and cannot distinguish languages on a per-sentence basis. The other available language options are "zh" and "ja". |
| Audio length | length | false | From `config.py` | float | Adjusts the length of the synthesized speech, which is equivalent to adjusting the speed of the speech. The larger the value, the slower the speed. |
| Noise | noise | false | From `config.py` | float | Sample noise, controlling the randomness of the synthesis. |
| SDP noise | noisew | false | From `config.py` | float | Stochastic Duration Predictor noise, controlling the length of phoneme pronunciation. |
| Segmentation threshold | max | false | From `config.py` | int | Divide the text into paragraphs based on punctuation marks, and combine them into one paragraph when the length exceeds max. If max<=0, the text will not be divided into paragraphs. |
| SDP/DP mix ratio | sdp_ratio | false | From `config.py` | int | The theoretical proportion of SDP during synthesis, the higher the ratio, the larger the variance in synthesized voice tone. |
## SSML (Speech Synthesis Markup Language)
Supported Elements and Attributes
`speak` Element
| Attribute | Instruction | Is must |
| --------- | ------------------------------------------------------------ | ------- |
| id | Default value is retrieved from `config.py` | false |
| lang | Default value is retrieved from `config.py` | false |
| length | Default value is retrieved from `config.py` | false |
| noise | Default value is retrieved from `config.py` | false |
| noisew | Default value is retrieved from `config.py` | false |
| max | Splits text into segments based on punctuation marks. When the sum of segment lengths exceeds `max`, it is treated as one segment. `max<=0` means no segmentation. The default value is 0. | false |
| model | Default is `vits`. Options: `w2v2-vits`, `emotion-vits` | false |
| emotion | Only effective when using `w2v2-vits` or `emotion-vits`. The range depends on the npy emotion reference file. | false |
`voice` Element
Higher priority than `speak`.
| Attribute | Instruction | Is must |
| --------- | ------------------------------------------------------------ | ------- |
| id | Default value is retrieved from `config.py` | false |
| lang | Default value is retrieved from `config.py` | false |
| length | Default value is retrieved from `config.py` | false |
| noise | Default value is retrieved from `config.py` | false |
| noisew | Default value is retrieved from `config.py` | false |
| max | Splits text into segments based on punctuation marks. When the sum of segment lengths exceeds `max`, it is treated as one segment. `max<=0` means no segmentation. The default value is 0. | false |
| model | Default is `vits`. Options: `w2v2-vits`, `emotion-vits` | false |
| emotion | Only effective when using `w2v2-vits` or `emotion-vits` | false |
`break` Element
| Attribute | Instruction | Is must |
| --------- | ------------------------------------------------------------ | ------- |
| strength | x-weak, weak, medium (default), strong, x-strong | false |
| time | The absolute duration of a pause in seconds (such as `2s`) or milliseconds (such as `500ms`). Valid values range from 0 to 5000 milliseconds. If you set a value greater than the supported maximum, the service will use `5000ms`. If the `time` attribute is set, the `strength` attribute is ignored. | false |
| Strength | Relative Duration |
| :------- | :---------------- |
| x-weak | 250 ms |
| weak | 500 ms |
| medium | 750 ms |
| strong | 1000 ms |
| x-strong | 1250 ms |
Example
```xml
这几天心里颇不宁静。
今晚在院子里坐着乘凉,忽然想起日日走过的荷塘,在这满月的光里,总该另有一番样子吧。
月亮渐渐地升高了,墙外马路上孩子们的欢笑,已经听不见了;
妻在屋里拍着闰儿,迷迷糊糊地哼着眠歌。
我悄悄地披了大衫,带上门出去。
沿着荷塘,是一条曲折的小煤屑路。
这是一条幽僻的路;白天也少人走,夜晚更加寂寞。
荷塘四面,长着许多树,蓊蓊郁郁的。
路的一旁,是些杨柳,和一些不知道名字的树。
没有月光的晚上,这路上阴森森的,有些怕人。
今晚却很好,虽然月光也还是淡淡的。
路上只我一个人,背着手踱着。
这一片天地好像是我的;我也像超出了平常的自己,到了另一个世界里。
我爱热闹,也爱冷静;爱群居,也爱独处。
像今晚上,一个人在这苍茫的月下,什么都可以想,什么都可以不想,便觉是个自由的人。
白天里一定要做的事,一定要说的话,现在都可不理。
这是独处的妙处,我且受用这无边的荷香月色好了。
```
# Communication
Learning and communication,now there is only Chinese [QQ group](https://qm.qq.com/cgi-bin/qm/qr?k=-1GknIe4uXrkmbDKBGKa1aAUteq40qs_&jump_from=webapi&authKey=x5YYt6Dggs1ZqWxvZqvj3fV8VUnxRyXm5S5Kzntc78+Nv3iXOIawplGip9LWuNR/)
# Acknowledgements
- vits:https://github.com/jaywalnut310/vits
- MoeGoe:https://github.com/CjangCjengh/MoeGoe
- emotional-vits:https://github.com/innnky/emotional-vits
- vits-uma-genshin-honkai:https://huggingface.co/spaces/zomehwh/vits-uma-genshin-honkai
- vits_chinese:https://github.com/PlayVoice/vits_chinese
- Bert_VITS2:https://github.com/fishaudio/Bert-VITS2