Spaces:
Runtime error
Runtime error
aoxiang1221
commited on
Commit
•
c395ff4
1
Parent(s):
7613654
update
Browse files- README.md +7 -436
- bert_vits2/Model/Azuma/G_17400.pth +3 -0
- bert_vits2/Model/Azuma/config.json +95 -0
- bert_vits2/bert/pytorch_model.bin +3 -0
- config.py +2 -0
README.md
CHANGED
@@ -1,436 +1,7 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
<img src="https://img.shields.io/badge/python-3.10-green">
|
9 |
-
<a href="https://hub.docker.com/r/artrajz/vits-simple-api">
|
10 |
-
<img src="https://img.shields.io/docker/pulls/artrajz/vits-simple-api"></a>
|
11 |
-
</p>
|
12 |
-
<a href="https://github.com/Artrajz/vits-simple-api/blob/main/README.md">English</a>|<a href="https://github.com/Artrajz/vits-simple-api/blob/main/README_zh.md">中文文档</a>
|
13 |
-
<br/>
|
14 |
-
</div>
|
15 |
-
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
|
20 |
-
# Feature
|
21 |
-
|
22 |
-
- [x] VITS text-to-speech, voice conversion
|
23 |
-
- [x] HuBert-soft VITS
|
24 |
-
- [x] [vits_chinese](https://github.com/PlayVoice/vits_chinese)
|
25 |
-
- [x] [Bert-VITS2](https://github.com/Stardust-minus/Bert-VITS2)
|
26 |
-
- [x] W2V2 VITS / [emotional-vits](https://github.com/innnky/emotional-vits) dimensional emotion model
|
27 |
-
- [x] Support for loading multiple models
|
28 |
-
- [x] Automatic language recognition and processing,set the scope of language type recognition according to model's cleaner,support for custom language type range
|
29 |
-
- [x] Customize default parameters
|
30 |
-
- [x] Long text batch processing
|
31 |
-
- [x] GPU accelerated inference
|
32 |
-
- [x] SSML (Speech Synthesis Markup Language) work in progress...
|
33 |
-
|
34 |
-
|
35 |
-
## demo
|
36 |
-
|
37 |
-
[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/Artrajz/vits-simple-api)
|
38 |
-
|
39 |
-
Please note that different IDs may support different languages.[speakers](https://artrajz-vits-simple-api.hf.space/voice/speakers)
|
40 |
-
|
41 |
-
- `https://artrajz-vits-simple-api.hf.space/voice/vits?text=你好,こんにちは&id=164`
|
42 |
-
- `https://artrajz-vits-simple-api.hf.space/voice/vits?text=Difficult the first time, easy the second.&id=4`
|
43 |
-
- excited:`https://artrajz-vits-simple-api.hf.space/voice/w2v2-vits?text=こんにちは&id=3&emotion=111`
|
44 |
-
- whispered:`https://artrajz-vits-simple-api.hf.space/w2v2-vits?text=こんにちは&id=3&emotion=2077`
|
45 |
-
|
46 |
-
https://user-images.githubusercontent.com/73542220/237995061-c1f25b4e-dd86-438a-9363-4bb1fe65b425.mov
|
47 |
-
|
48 |
-
# Deploy
|
49 |
-
|
50 |
-
## Docker(Recommended for Linux)
|
51 |
-
|
52 |
-
### Docker image pull script
|
53 |
-
|
54 |
-
```
|
55 |
-
bash -c "$(wget -O- https://raw.githubusercontent.com/Artrajz/vits-simple-api/main/vits-simple-api-installer-latest.sh)"
|
56 |
-
```
|
57 |
-
|
58 |
-
- The platforms currently supported by Docker images are `linux/amd64` and `linux/arm64`.(arm64 only has a CPU version)
|
59 |
-
- After a successful pull, the vits model needs to be imported before use. Please follow the steps below to import the model.
|
60 |
-
|
61 |
-
### Download VITS model
|
62 |
-
|
63 |
-
Put the model into `/usr/local/vits-simple-api/Model`
|
64 |
-
|
65 |
-
<details><summary>Folder structure</summary><pre><code>
|
66 |
-
│ hubert-soft-0d54a1f4.pt
|
67 |
-
│ model.onnx
|
68 |
-
│ model.yaml
|
69 |
-
│
|
70 |
-
├─g
|
71 |
-
│ config.json
|
72 |
-
│ G_953000.pth
|
73 |
-
│
|
74 |
-
├─louise
|
75 |
-
│ 360_epochs.pth
|
76 |
-
│ config.json
|
77 |
-
│
|
78 |
-
├─Nene_Nanami_Rong_Tang
|
79 |
-
│ 1374_epochs.pth
|
80 |
-
│ config.json
|
81 |
-
│
|
82 |
-
├─Zero_no_tsukaima
|
83 |
-
│ 1158_epochs.pth
|
84 |
-
│ config.json
|
85 |
-
│
|
86 |
-
└─npy
|
87 |
-
25ecb3f6-f968-11ed-b094-e0d4e84af078.npy
|
88 |
-
all_emotions.npy
|
89 |
-
</code></pre></details>
|
90 |
-
|
91 |
-
|
92 |
-
|
93 |
-
|
94 |
-
|
95 |
-
### Modify model path
|
96 |
-
|
97 |
-
Modify in `/usr/local/vits-simple-api/config.py`
|
98 |
-
|
99 |
-
<details><summary>config.py</summary><pre><code>
|
100 |
-
# Fill in the model path here
|
101 |
-
MODEL_LIST = [
|
102 |
-
# VITS
|
103 |
-
[ABS_PATH + "/Model/Nene_Nanami_Rong_Tang/1374_epochs.pth", ABS_PATH + "/Model/Nene_Nanami_Rong_Tang/config.json"],
|
104 |
-
[ABS_PATH + "/Model/Zero_no_tsukaima/1158_epochs.pth", ABS_PATH + "/Model/Zero_no_tsukaima/config.json"],
|
105 |
-
[ABS_PATH + "/Model/g/G_953000.pth", ABS_PATH + "/Model/g/config.json"],
|
106 |
-
# HuBert-VITS (Need to configure HUBERT_SOFT_MODEL)
|
107 |
-
[ABS_PATH + "/Model/louise/360_epochs.pth", ABS_PATH + "/Model/louise/config.json"],
|
108 |
-
# W2V2-VITS (Need to configure DIMENSIONAL_EMOTION_NPY)
|
109 |
-
[ABS_PATH + "/Model/w2v2-vits/1026_epochs.pth", ABS_PATH + "/Model/w2v2-vits/config.json"],
|
110 |
-
]
|
111 |
-
# hubert-vits: hubert soft model
|
112 |
-
HUBERT_SOFT_MODEL = ABS_PATH + "/Model/hubert-soft-0d54a1f4.pt"
|
113 |
-
# w2v2-vits: Dimensional emotion npy file
|
114 |
-
# load single npy: ABS_PATH+"/all_emotions.npy
|
115 |
-
# load mutiple npy: [ABS_PATH + "/emotions1.npy", ABS_PATH + "/emotions2.npy"]
|
116 |
-
# load mutiple npy from folder: ABS_PATH + "/Model/npy"
|
117 |
-
DIMENSIONAL_EMOTION_NPY = ABS_PATH + "/Model/npy"
|
118 |
-
# w2v2-vits: Need to have both `model.onnx` and `model.yaml` files in the same path.
|
119 |
-
DIMENSIONAL_EMOTION_MODEL = ABS_PATH + "/Model/model.yaml"
|
120 |
-
</code></pre></details>
|
121 |
-
|
122 |
-
|
123 |
-
|
124 |
-
|
125 |
-
|
126 |
-
### Startup
|
127 |
-
|
128 |
-
`docker compose up -d`
|
129 |
-
|
130 |
-
Or execute the pull script again
|
131 |
-
|
132 |
-
### Image update
|
133 |
-
|
134 |
-
Run the docker image pull script again
|
135 |
-
|
136 |
-
## Virtual environment deployment
|
137 |
-
|
138 |
-
### Clone
|
139 |
-
|
140 |
-
`git clone https://github.com/Artrajz/vits-simple-api.git`
|
141 |
-
|
142 |
-
### Download python dependencies
|
143 |
-
|
144 |
-
A python virtual environment is recommended
|
145 |
-
|
146 |
-
`pip install -r requirements.txt`
|
147 |
-
|
148 |
-
Fasttext may not be installed on windows, you can install it with the following command,or download wheels [here](https://www.lfd.uci.edu/~gohlke/pythonlibs/#fasttext)
|
149 |
-
|
150 |
-
```
|
151 |
-
# python3.10 win_amd64
|
152 |
-
pip install https://github.com/Artrajz/archived/raw/main/fasttext/fasttext-0.9.2-cp310-cp310-win_amd64.whl
|
153 |
-
```
|
154 |
-
|
155 |
-
### Download VITS model
|
156 |
-
|
157 |
-
Put the model into `/path/to/vits-simple-api/Model`
|
158 |
-
|
159 |
-
<details><summary>Folder structure</summary><pre><code>
|
160 |
-
│ hubert-soft-0d54a1f4.pt
|
161 |
-
│ model.onnx
|
162 |
-
│ model.yaml
|
163 |
-
│
|
164 |
-
├─g
|
165 |
-
│ config.json
|
166 |
-
│ G_953000.pth
|
167 |
-
│
|
168 |
-
├─louise
|
169 |
-
│ 360_epochs.pth
|
170 |
-
│ config.json
|
171 |
-
│
|
172 |
-
├─Nene_Nanami_Rong_Tang
|
173 |
-
│ 1374_epochs.pth
|
174 |
-
│ config.json
|
175 |
-
│
|
176 |
-
├─Zero_no_tsukaima
|
177 |
-
│ 1158_epochs.pth
|
178 |
-
│ config.json
|
179 |
-
│
|
180 |
-
└─npy
|
181 |
-
25ecb3f6-f968-11ed-b094-e0d4e84af078.npy
|
182 |
-
all_emotions.npy
|
183 |
-
</code></pre></details>
|
184 |
-
|
185 |
-
|
186 |
-
|
187 |
-
### Modify model path
|
188 |
-
|
189 |
-
Modify in `/path/to/vits-simple-api/config.py`
|
190 |
-
|
191 |
-
<details><summary>config.py</summary><pre><code>
|
192 |
-
# Fill in the model path here
|
193 |
-
MODEL_LIST = [
|
194 |
-
# VITS
|
195 |
-
[ABS_PATH + "/Model/Nene_Nanami_Rong_Tang/1374_epochs.pth", ABS_PATH + "/Model/Nene_Nanami_Rong_Tang/config.json"],
|
196 |
-
[ABS_PATH + "/Model/Zero_no_tsukaima/1158_epochs.pth", ABS_PATH + "/Model/Zero_no_tsukaima/config.json"],
|
197 |
-
[ABS_PATH + "/Model/g/G_953000.pth", ABS_PATH + "/Model/g/config.json"],
|
198 |
-
# HuBert-VITS (Need to configure HUBERT_SOFT_MODEL)
|
199 |
-
[ABS_PATH + "/Model/louise/360_epochs.pth", ABS_PATH + "/Model/louise/config.json"],
|
200 |
-
# W2V2-VITS (Need to configure DIMENSIONAL_EMOTION_NPY)
|
201 |
-
[ABS_PATH + "/Model/w2v2-vits/1026_epochs.pth", ABS_PATH + "/Model/w2v2-vits/config.json"],
|
202 |
-
]
|
203 |
-
# hubert-vits: hubert soft model
|
204 |
-
HUBERT_SOFT_MODEL = ABS_PATH + "/Model/hubert-soft-0d54a1f4.pt"
|
205 |
-
# w2v2-vits: Dimensional emotion npy file
|
206 |
-
# load single npy: ABS_PATH+"/all_emotions.npy
|
207 |
-
# load mutiple npy: [ABS_PATH + "/emotions1.npy", ABS_PATH + "/emotions2.npy"]
|
208 |
-
# load mutiple npy from folder: ABS_PATH + "/Model/npy"
|
209 |
-
DIMENSIONAL_EMOTION_NPY = ABS_PATH + "/Model/npy"
|
210 |
-
# w2v2-vits: Need to have both `model.onnx` and `model.yaml` files in the same path.
|
211 |
-
DIMENSIONAL_EMOTION_MODEL = ABS_PATH + "/Model/model.yaml"
|
212 |
-
</code></pre></details>
|
213 |
-
|
214 |
-
|
215 |
-
|
216 |
-
### Startup
|
217 |
-
|
218 |
-
`python app.py`
|
219 |
-
|
220 |
-
# GPU accelerated
|
221 |
-
|
222 |
-
## Windows
|
223 |
-
### Install CUDA
|
224 |
-
Check the highest version of CUDA supported by your graphics card:
|
225 |
-
```
|
226 |
-
nvidia-smi
|
227 |
-
```
|
228 |
-
Taking CUDA 11.7 as an example, download it from the [official website](https://developer.nvidia.com/cuda-11-7-0-download-archive?target_os=Windows&target_arch=x86_64&target_version=10&target_type=exe_local)
|
229 |
-
### Install GPU version of PyTorch
|
230 |
-
|
231 |
-
1.13.1+cu117 is recommended, other versions may have memory instability issues.
|
232 |
-
|
233 |
-
```
|
234 |
-
pip install torch==1.13.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117
|
235 |
-
```
|
236 |
-
## Linux
|
237 |
-
The installation process is similar, but I don't have the environment to test it.
|
238 |
-
|
239 |
-
# Dependency Installation Issues
|
240 |
-
|
241 |
-
Since pypi.org does not have the `pyopenjtalk` whl file, it usually needs to be installed from the source code. This process might be troublesome for some people. Therefore, you can also use the whl I built for installation.
|
242 |
-
|
243 |
-
```
|
244 |
-
pip install pyopenjtalk -i https://pypi.artrajz.cn/simple
|
245 |
-
```
|
246 |
-
|
247 |
-
# API
|
248 |
-
|
249 |
-
## GET
|
250 |
-
|
251 |
-
#### speakers list
|
252 |
-
|
253 |
-
- GET http://127.0.0.1:23456/voice/speakers
|
254 |
-
|
255 |
-
Returns the mapping table of role IDs to speaker names.
|
256 |
-
|
257 |
-
#### voice vits
|
258 |
-
|
259 |
-
- GET http://127.0.0.1:23456/voice/vits?text=text
|
260 |
-
|
261 |
-
Default values are used when other parameters are not specified.
|
262 |
-
|
263 |
-
- GET http://127.0.0.1:23456/voice/vits?text=[ZH]text[ZH][JA]text[JA]&lang=mix
|
264 |
-
|
265 |
-
When lang=mix, the text needs to be annotated.
|
266 |
-
|
267 |
-
- GET http://127.0.0.1:23456/voice/vits?text=text&id=142&format=wav&lang=zh&length=1.4
|
268 |
-
|
269 |
-
The text is "text", the role ID is 142, the audio format is wav, the text language is zh, the speech length is 1.4, and the other parameters are default.
|
270 |
-
|
271 |
-
#### check
|
272 |
-
|
273 |
-
- GET http://127.0.0.1:23456/voice/check?id=0&model=vits
|
274 |
-
|
275 |
-
## POST
|
276 |
-
|
277 |
-
- See `api_test.py`
|
278 |
-
|
279 |
-
## API KEY
|
280 |
-
|
281 |
-
Set `API_KEY_ENABLED = True` in `config.py` to enable API key authentication. The API key is `API_KEY = "api-key"`.
|
282 |
-
After enabling it, you need to add the `api_key` parameter in GET requests and add the `X-API-KEY` parameter in the header for POST requests.
|
283 |
-
|
284 |
-
# Parameter
|
285 |
-
|
286 |
-
## VITS
|
287 |
-
|
288 |
-
| Name | Parameter | Is must | Default | Type | Instruction |
|
289 |
-
| ---------------------- | --------- | ------- | ---------------- | ----- | ------------------------------------------------------------ |
|
290 |
-
| Synthesized text | text | true | | str | Text needed for voice synthesis. |
|
291 |
-
| Speaker ID | id | false | From `config.py` | int | The speaker ID. |
|
292 |
-
| Audio format | format | false | From `config.py` | str | Support for wav,ogg,silk,mp3,flac |
|
293 |
-
| Text language | lang | false | From `config.py` | str | The language of the text to be synthesized. Available options include auto, zh, ja, and mix. When lang=mix, the text should be wrapped in [ZH] or [JA].The default mode is auto, which automatically detects the language of the text |
|
294 |
-
| Audio length | length | false | From `config.py` | float | Adjusts the length of the synthesized speech, which is equivalent to adjusting the speed of the speech. The larger the value, the slower the speed. |
|
295 |
-
| Noise | noise | false | From `config.py` | float | Sample noise, controlling the randomness of the synthesis. |
|
296 |
-
| SDP noise | noisew | false | From `config.py` | float | Stochastic Duration Predictor noise, controlling the length of phoneme pronunciation. |
|
297 |
-
| Segmentation threshold | max | false | v | int | Divide the text into paragraphs based on punctuation marks, and combine them into one paragraph when the length exceeds max. If max<=0, the text will not be divided into paragraphs. |
|
298 |
-
| Streaming response | streaming | false | false | bool | Streamed synthesized speech with faster initial response. |
|
299 |
-
|
300 |
-
## VITS voice conversion
|
301 |
-
|
302 |
-
| Name | Parameter | Is must | Default | Type | Instruction |
|
303 |
-
| -------------- | ----------- | ------- | ------- | ---- | --------------------------------------------------------- |
|
304 |
-
| Uploaded Audio | upload | true | | file | The audio file to be uploaded. It should be in wav or ogg |
|
305 |
-
| Source Role ID | original_id | true | | int | The ID of the role used to upload the audio file. |
|
306 |
-
| Target Role ID | target_id | true | | int | The ID of the target role to convert the audio to. |
|
307 |
-
|
308 |
-
## HuBert-VITS
|
309 |
-
|
310 |
-
| Name | Parameter | Is must | Default | Type | Instruction |
|
311 |
-
| ----------------- | --------- | ------- | ------- | ----- | ------------------------------------------------------------ |
|
312 |
-
| Uploaded Audio | upload | true | | file | The audio file to be uploaded. It should be in wav or ogg format. |
|
313 |
-
| Target speaker ID | id | true | | int | The target speaker ID. |
|
314 |
-
| Audio format | format | true | | str | wav,ogg,silk |
|
315 |
-
| Audio length | length | true | | float | Adjusts the length of the synthesized speech, which is equivalent to adjusting the speed of the speech. The larger the value, the slower the speed. |
|
316 |
-
| Noise | noise | true | | float | Sample noise, controlling the randomness of the synthesis. |
|
317 |
-
| sdp noise | noisew | true | | float | Stochastic Duration Predictor noise, controlling the length of phoneme pronunciation. |
|
318 |
-
|
319 |
-
## W2V2-VITS
|
320 |
-
|
321 |
-
| Name | Parameter | Is must | Default | Type | Instruction |
|
322 |
-
| ---------------------- | --------- | ------- | ---------------- | ----- | ------------------------------------------------------------ |
|
323 |
-
| Synthesized text | text | true | | str | Text needed for voice synthesis. |
|
324 |
-
| Speaker ID | id | false | From `config.py` | int | The speaker ID. |
|
325 |
-
| Audio format | format | false | From `config.py` | str | Support for wav,ogg,silk,mp3,flac |
|
326 |
-
| Text language | lang | false | From `config.py` | str | The language of the text to be synthesized. Available options include auto, zh, ja, and mix. When lang=mix, the text should be wrapped in [ZH] or [JA].The default mode is auto, which automatically detects the language of the text |
|
327 |
-
| Audio length | length | false | From `config.py` | float | Adjusts the length of the synthesized speech, which is equivalent to adjusting the speed of the speech. The larger the value, the slower the speed. |
|
328 |
-
| Noise | noise | false | From `config.py` | float | Sample noise, controlling the randomness of the synthesis. |
|
329 |
-
| SDP noise | noisew | false | From `config.py` | float | Stochastic Duration Predictor noise, controlling the length of phoneme pronunciation. |
|
330 |
-
| Segmentation threshold | max | false | From `config.py` | int | Divide the text into paragraphs based on punctuation marks, and combine them into one paragraph when the length exceeds max. If max<=0, the text will not be divided into paragraphs. |
|
331 |
-
| Dimensional emotion | emotion | false | 0 | int | The range depends on the emotion reference file in npy format, such as the range of the [innnky](https://huggingface.co/spaces/innnky/nene-emotion/tree/main)'s model all_emotions.npy, which is 0-5457. |
|
332 |
-
|
333 |
-
## Dimensional emotion
|
334 |
-
|
335 |
-
| Name | Parameter | Is must | Default | Type | Instruction |
|
336 |
-
| -------------- | --------- | ------- | ------- | ---- | ------------------------------------------------------------ |
|
337 |
-
| Uploaded Audio | upload | true | | file | Return the npy file that stores the dimensional emotion vectors. |
|
338 |
-
|
339 |
-
## Bert-VITS2
|
340 |
-
|
341 |
-
| Name | Parameter | Is must | Default | Type | Instruction |
|
342 |
-
| ---------------------- | --------- | ------- | ---------------- | ----- | ------------------------------------------------------------ |
|
343 |
-
| Synthesized text | text | true | | str | Text needed for voice synthesis. |
|
344 |
-
| Speaker ID | id | false | From `config.py` | int | The speaker ID. |
|
345 |
-
| Audio format | format | false | From `config.py` | str | Support for wav,ogg,silk,mp3,flac |
|
346 |
-
| Text language | lang | false | From `config.py` | str | "Auto" is a mode for automatic language detection and is also the default mode. However, it currently only supports detecting the language of an entire text passage and cannot distinguish languages on a per-sentence basis. The other available language options are "zh" and "ja". |
|
347 |
-
| Audio length | length | false | From `config.py` | float | Adjusts the length of the synthesized speech, which is equivalent to adjusting the speed of the speech. The larger the value, the slower the speed. |
|
348 |
-
| Noise | noise | false | From `config.py` | float | Sample noise, controlling the randomness of the synthesis. |
|
349 |
-
| SDP noise | noisew | false | From `config.py` | float | Stochastic Duration Predictor noise, controlling the length of phoneme pronunciation. |
|
350 |
-
| Segmentation threshold | max | false | From `config.py` | int | Divide the text into paragraphs based on punctuation marks, and combine them into one paragraph when the length exceeds max. If max<=0, the text will not be divided into paragraphs. |
|
351 |
-
| SDP/DP mix ratio | sdp_ratio | false | From `config.py` | int | The theoretical proportion of SDP during synthesis, the higher the ratio, the larger the variance in synthesized voice tone. |
|
352 |
-
|
353 |
-
## SSML (Speech Synthesis Markup Language)
|
354 |
-
|
355 |
-
Supported Elements and Attributes
|
356 |
-
|
357 |
-
`speak` Element
|
358 |
-
|
359 |
-
| Attribute | Instruction | Is must |
|
360 |
-
| --------- | ------------------------------------------------------------ | ------- |
|
361 |
-
| id | Default value is retrieved from `config.py` | false |
|
362 |
-
| lang | Default value is retrieved from `config.py` | false |
|
363 |
-
| length | Default value is retrieved from `config.py` | false |
|
364 |
-
| noise | Default value is retrieved from `config.py` | false |
|
365 |
-
| noisew | Default value is retrieved from `config.py` | false |
|
366 |
-
| max | Splits text into segments based on punctuation marks. When the sum of segment lengths exceeds `max`, it is treated as one segment. `max<=0` means no segmentation. The default value is 0. | false |
|
367 |
-
| model | Default is `vits`. Options: `w2v2-vits`, `emotion-vits` | false |
|
368 |
-
| emotion | Only effective when using `w2v2-vits` or `emotion-vits`. The range depends on the npy emotion reference file. | false |
|
369 |
-
|
370 |
-
`voice` Element
|
371 |
-
|
372 |
-
Higher priority than `speak`.
|
373 |
-
|
374 |
-
| Attribute | Instruction | Is must |
|
375 |
-
| --------- | ------------------------------------------------------------ | ------- |
|
376 |
-
| id | Default value is retrieved from `config.py` | false |
|
377 |
-
| lang | Default value is retrieved from `config.py` | false |
|
378 |
-
| length | Default value is retrieved from `config.py` | false |
|
379 |
-
| noise | Default value is retrieved from `config.py` | false |
|
380 |
-
| noisew | Default value is retrieved from `config.py` | false |
|
381 |
-
| max | Splits text into segments based on punctuation marks. When the sum of segment lengths exceeds `max`, it is treated as one segment. `max<=0` means no segmentation. The default value is 0. | false |
|
382 |
-
| model | Default is `vits`. Options: `w2v2-vits`, `emotion-vits` | false |
|
383 |
-
| emotion | Only effective when using `w2v2-vits` or `emotion-vits` | false |
|
384 |
-
|
385 |
-
`break` Element
|
386 |
-
|
387 |
-
| Attribute | Instruction | Is must |
|
388 |
-
| --------- | ------------------------------------------------------------ | ------- |
|
389 |
-
| strength | x-weak, weak, medium (default), strong, x-strong | false |
|
390 |
-
| time | The absolute duration of a pause in seconds (such as `2s`) or milliseconds (such as `500ms`). Valid values range from 0 to 5000 milliseconds. If you set a value greater than the supported maximum, the service will use `5000ms`. If the `time` attribute is set, the `strength` attribute is ignored. | false |
|
391 |
-
|
392 |
-
| Strength | Relative Duration |
|
393 |
-
| :------- | :---------------- |
|
394 |
-
| x-weak | 250 ms |
|
395 |
-
| weak | 500 ms |
|
396 |
-
| medium | 750 ms |
|
397 |
-
| strong | 1000 ms |
|
398 |
-
| x-strong | 1250 ms |
|
399 |
-
|
400 |
-
Example
|
401 |
-
|
402 |
-
```xml
|
403 |
-
<speak lang="zh" format="mp3" length="1.2">
|
404 |
-
<voice id="92" >这几天心里颇不宁静。</voice>
|
405 |
-
<voice id="125">今晚在院子里坐着乘凉,忽然想起日日走过的荷塘,在这满月的光里,总该另有一番样子吧。</voice>
|
406 |
-
<voice id="142">月亮渐渐地升高了,墙外马路上孩子们的欢笑,已经听不见了;</voice>
|
407 |
-
<voice id="98">妻在屋里拍着闰儿,迷迷糊糊地哼着眠歌。</voice>
|
408 |
-
<voice id="120">我悄悄地披了大衫,带上门出去。</voice><break time="2s"/>
|
409 |
-
<voice id="121">沿着荷塘,是一条曲折的小煤屑路。</voice>
|
410 |
-
<voice id="122">这是一条幽僻的路;白天也少人走,夜晚更加寂寞。</voice>
|
411 |
-
<voice id="123">荷塘四面,长着许多树,蓊蓊郁郁的。</voice>
|
412 |
-
<voice id="124">路的一旁,是些杨柳,和一些不知道名字的树。</voice>
|
413 |
-
<voice id="125">没有月光的晚上,这路上阴森森的,有些怕人。</voice>
|
414 |
-
<voice id="126">今晚却很好,虽然月光也还是淡淡的。</voice><break time="2s"/>
|
415 |
-
<voice id="127">路上只我一个人,背着手踱着。</voice>
|
416 |
-
<voice id="128">这一片天地好像是我的;我也像超出了平常的自己,到了另一个世界里。</voice>
|
417 |
-
<voice id="129">我爱热闹,也爱冷静;<break strength="x-weak"/>爱群居,也爱独处。</voice>
|
418 |
-
<voice id="130">像今晚上,一个人在这苍茫的月下,什么都可以想,什么都可以不想,便觉是个自由的人。</voice>
|
419 |
-
<voice id="131">白天里一定要做的事,一定要说的话,现在都可不理。</voice>
|
420 |
-
<voice id="132">这是独处的妙处,我且受用这无边的荷香月色好了。</voice>
|
421 |
-
</speak>
|
422 |
-
```
|
423 |
-
|
424 |
-
# Communication
|
425 |
-
|
426 |
-
Learning and communication,now there is only Chinese [QQ group](https://qm.qq.com/cgi-bin/qm/qr?k=-1GknIe4uXrkmbDKBGKa1aAUteq40qs_&jump_from=webapi&authKey=x5YYt6Dggs1ZqWxvZqvj3fV8VUnxRyXm5S5Kzntc78+Nv3iXOIawplGip9LWuNR/)
|
427 |
-
|
428 |
-
# Acknowledgements
|
429 |
-
|
430 |
-
- vits:https://github.com/jaywalnut310/vits
|
431 |
-
- MoeGoe:https://github.com/CjangCjengh/MoeGoe
|
432 |
-
- emotional-vits:https://github.com/innnky/emotional-vits
|
433 |
-
- vits-uma-genshin-honkai:https://huggingface.co/spaces/zomehwh/vits-uma-genshin-honkai
|
434 |
-
- vits_chinese:https://github.com/PlayVoice/vits_chinese
|
435 |
-
- Bert_VITS2:https://github.com/fishaudio/Bert-VITS2
|
436 |
-
|
|
|
1 |
+
license: mit
|
2 |
+
title: vits-simple-api
|
3 |
+
sdk: gradio
|
4 |
+
pinned: true
|
5 |
+
python_version: 3.10.11
|
6 |
+
emoji: 👀
|
7 |
+
app_file: app.py
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
bert_vits2/Model/Azuma/G_17400.pth
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:184324a03109748e68f7e6587433ef2889e1c57aab84ebbe7825bf3bf0fbfc63
|
3 |
+
size 629537628
|
bert_vits2/Model/Azuma/config.json
ADDED
@@ -0,0 +1,95 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"train": {
|
3 |
+
"log_interval": 10,
|
4 |
+
"eval_interval": 100,
|
5 |
+
"seed": 52,
|
6 |
+
"epochs": 10000,
|
7 |
+
"learning_rate": 0.0003,
|
8 |
+
"betas": [
|
9 |
+
0.8,
|
10 |
+
0.99
|
11 |
+
],
|
12 |
+
"eps": 1e-09,
|
13 |
+
"batch_size": 18,
|
14 |
+
"fp16_run": false,
|
15 |
+
"lr_decay": 0.999875,
|
16 |
+
"segment_size": 16384,
|
17 |
+
"init_lr_ratio": 1,
|
18 |
+
"warmup_epochs": 0,
|
19 |
+
"c_mel": 45,
|
20 |
+
"c_kl": 1.0
|
21 |
+
},
|
22 |
+
"data": {
|
23 |
+
"use_mel_posterior_encoder": false,
|
24 |
+
"training_files": "filelists/train.list",
|
25 |
+
"validation_files": "filelists/val.list",
|
26 |
+
"max_wav_value": 32768.0,
|
27 |
+
"sampling_rate": 44100,
|
28 |
+
"filter_length": 2048,
|
29 |
+
"hop_length": 512,
|
30 |
+
"win_length": 2048,
|
31 |
+
"n_mel_channels": 128,
|
32 |
+
"mel_fmin": 0.0,
|
33 |
+
"mel_fmax": null,
|
34 |
+
"add_blank": true,
|
35 |
+
"n_speakers": 1,
|
36 |
+
"cleaned_text": true,
|
37 |
+
"spk2id": {
|
38 |
+
"Azuma": 0
|
39 |
+
}
|
40 |
+
},
|
41 |
+
"model": {
|
42 |
+
"use_spk_conditioned_encoder": true,
|
43 |
+
"use_noise_scaled_mas": true,
|
44 |
+
"use_mel_posterior_encoder": false,
|
45 |
+
"use_duration_discriminator": true,
|
46 |
+
"inter_channels": 192,
|
47 |
+
"hidden_channels": 192,
|
48 |
+
"filter_channels": 768,
|
49 |
+
"n_heads": 2,
|
50 |
+
"n_layers": 6,
|
51 |
+
"kernel_size": 3,
|
52 |
+
"p_dropout": 0.1,
|
53 |
+
"resblock": "1",
|
54 |
+
"resblock_kernel_sizes": [
|
55 |
+
3,
|
56 |
+
7,
|
57 |
+
11
|
58 |
+
],
|
59 |
+
"resblock_dilation_sizes": [
|
60 |
+
[
|
61 |
+
1,
|
62 |
+
3,
|
63 |
+
5
|
64 |
+
],
|
65 |
+
[
|
66 |
+
1,
|
67 |
+
3,
|
68 |
+
5
|
69 |
+
],
|
70 |
+
[
|
71 |
+
1,
|
72 |
+
3,
|
73 |
+
5
|
74 |
+
]
|
75 |
+
],
|
76 |
+
"upsample_rates": [
|
77 |
+
8,
|
78 |
+
8,
|
79 |
+
2,
|
80 |
+
2,
|
81 |
+
2
|
82 |
+
],
|
83 |
+
"upsample_initial_channel": 512,
|
84 |
+
"upsample_kernel_sizes": [
|
85 |
+
16,
|
86 |
+
16,
|
87 |
+
8,
|
88 |
+
2,
|
89 |
+
2
|
90 |
+
],
|
91 |
+
"n_layers_q": 3,
|
92 |
+
"use_spectral_norm": false,
|
93 |
+
"gin_channels": 256
|
94 |
+
}
|
95 |
+
}
|
bert_vits2/bert/pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:4ac62d49144d770c5ca9a5d1d3039c4995665a080febe63198189857c6bd11cd
|
3 |
+
size 1306484351
|
config.py
CHANGED
@@ -67,6 +67,8 @@ MODEL_LIST = [
|
|
67 |
# [ABS_PATH + "/Model/w2v2-vits/1026_epochs.pth", ABS_PATH + "/Model/w2v2-vits/config.json"],
|
68 |
# Bert-VITS2
|
69 |
# [ABS_PATH + "/Model/bert_vits2/G_9000.pth", ABS_PATH + "/Model/bert_vits2/config.json"],
|
|
|
|
|
70 |
]
|
71 |
|
72 |
# hubert-vits: hubert soft model
|
|
|
67 |
# [ABS_PATH + "/Model/w2v2-vits/1026_epochs.pth", ABS_PATH + "/Model/w2v2-vits/config.json"],
|
68 |
# Bert-VITS2
|
69 |
# [ABS_PATH + "/Model/bert_vits2/G_9000.pth", ABS_PATH + "/Model/bert_vits2/config.json"],
|
70 |
+
|
71 |
+
[ABS_PATH + "/bert_vits2/Model/Azuma/G_17400.pth", ABS_PATH + "/bert_vits2/Model/Azuma/config.json"]
|
72 |
]
|
73 |
|
74 |
# hubert-vits: hubert soft model
|