ZzWater commited on
Commit
0368eb5
·
verified ·
1 Parent(s): 6d189ed

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +180 -3
README.md CHANGED
@@ -1,3 +1,180 @@
1
- ---
2
- license: cc-by-nc-sa-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-sa-4.0
3
+ datasets:
4
+ - amphion/Emilia-Dataset
5
+ - parler-tts/libritts_r_filtered
6
+ - simon3000/genshin-voice
7
+ - simon3000/starrail-voice
8
+ language:
9
+ - zh
10
+ - en
11
+ base_model:
12
+ - Qwen/Qwen2-0.5B
13
+ tags:
14
+ - text_to_speech
15
+ - TTS
16
+ ---
17
+ # <center>ViiTor-Voice</center>
18
+ ### <center>An LLM based TTS Engine</center>
19
+
20
+ <p align="center">
21
+ <img src="asserts/post_1.png" alt="Viitor-Voice Cover">
22
+ </p>
23
+
24
+ ## Update
25
+ - **2024.12.14**:
26
+ - Adjusted model input by removing speaker embeddings (we found that existing open-source speaker models struggle to capture speaker characteristics effectively and have limited generalization capabilities).
27
+ - Supports zero-shot voice cloning.
28
+ - Supports both Chinese and English languages.
29
+ ## Features
30
+
31
+ - **Lightweight Design**
32
+
33
+ The model is simple and efficient, compatible with most LLM inference engines. With only 0.5B parameters, it achieves extreme optimization of computational resources while maintaining high performance. This design allows the model to be deployed not only on servers but also on mobile devices and edge computing environments, meeting diverse deployment needs.
34
+
35
+ - **Real-time Streaming Output, Low Latency Experience**
36
+
37
+ The model supports real-time speech generation, suitable for applications that demand low latency. On the Tesla T4 platform, it achieves an industry-leading first-frame latency of 200ms, providing users with nearly imperceptible instant feedback, ideal for interactive applications requiring quick response.
38
+
39
+ - **Rich Voice Library**
40
+
41
+ Offers more than 300 different voice options, allowing you to choose the most suitable speech style according to your needs and preferences. Whether it’s a formal business presentation or casual entertainment content, the model provides perfect voice matching.
42
+
43
+ - **Flexible Speech Rate Adjustment**
44
+
45
+ The model supports natural variations in speech rate, allowing users to easily adjust it based on content requirements and audience preferences. Whether speeding up for efficient information delivery or slowing down to enhance emotional depth, it maintains natural speech fluency.
46
+
47
+ - **Zero-shot Voice Cloning (Under Research)**
48
+
49
+ Decoder-only architecture naturally supports Zero-shot cloning, with future support for rapid voice cloning based on minimal voice samples.
50
+
51
+ ---
52
+
53
+ ## Output Samples
54
+
55
+ Below are examples of speech generated by this project:
56
+
57
+ ### English Female Voice 1:
58
+
59
+ https://github.com/user-attachments/assets/395bcdeb-1899-43b2-aff9-358bdc5f1c29
60
+
61
+ ### English Male Voice 1:
62
+
63
+ https://github.com/user-attachments/assets/d373f2fd-4b35-4b42-983f-3a5f0c25779d
64
+
65
+ ### Chinese Female Voice 1:
66
+
67
+ https://github.com/user-attachments/assets/94d6da03-bc71-4f7c-8453-9312a1eb6d1e
68
+
69
+ ### Chinese Male Voice 1:
70
+
71
+ https://github.com/user-attachments/assets/8a03785b-8100-48fe-8d64-fd98406aab1d
72
+
73
+
74
+ ---
75
+
76
+ ## Environment Setup
77
+
78
+ ```commandline
79
+ conda create -n viitor_voice python=3.10
80
+ conda activate viitor_voice
81
+ pip install -r requirements.txt
82
+
83
+ ### Due to the issue with vllm's tokenizer length calculation, the token limit cannot take effect.
84
+ python_package_path=`pip show pip | egrep Location | awk -F ' ' '{print $2}'`
85
+ cp viitor_voice/utils/patch.py $python_package_path/vllm/entrypoints/openai/logits_processors.py
86
+ ```
87
+
88
+ ---
89
+
90
+ ## Inference
91
+ ### Pretrained Models
92
+ - ~~[English](https://huggingface.co/ZzWater/viitor-voice-en)~~(deprecated)
93
+ - ~~[Chinese](https://huggingface.co/ZzWater/viitor-voice-chs)~~(deprecated)
94
+ - [Chinese & English](https://huggingface.co/ZzWater/viitor-voice-mix)
95
+ ### Offline Inference
96
+
97
+ **For GPU users**
98
+ ```python
99
+ from viitor_voice.inference.vllm_engine import VllmEngine
100
+ import torchaudio
101
+
102
+ tts_engine = VllmEngine(model_path="ZzWater/viitor-voice-mix")
103
+
104
+ ## chinese example
105
+ ref_audio = "reference_samples/reference_samples/chinese_female.wav"
106
+ ref_text = "博士,您工作辛苦了!"
107
+ text_list = ["我觉得我还能抢救一下的!", "我…我才不要和你一起!"]
108
+ audios = tts_engine.batch_infer(text_list, ref_audio, ref_text)
109
+ for i, audio in enumerate(audios):
110
+ torchaudio.save('test_chinese_{}.wav'.format(i), audios[0], 24000)
111
+
112
+
113
+ # english example
114
+ ref_audio = "reference_samples/reference_samples/english_female.wav"
115
+ ref_text = "At dinner, he informed me that he was a trouble shooter for a huge international organization."
116
+ text_list = ["Working overtime feels like running a marathon with no finish line in sight—just endless tasks and a growing sense that my life is being lived in the office instead of the real world."]
117
+ audios = tts_engine.batch_infer(text_list, ref_audio, ref_text)
118
+ for i, audio in enumerate(audios):
119
+ torchaudio.save('test_english_{}.wav'.format(i), audios[0], 24000)
120
+
121
+ ```
122
+ **For CPU users**
123
+ ```python
124
+ from viitor_voice.inference.transformers_engine import TransformersEngine
125
+ import torchaudio
126
+
127
+ tts_engine = TransformersEngine(model_path="ZzWater/viitor-voice-mix", device='cpu')
128
+
129
+ ## chinese example
130
+ ref_audio = "reference_samples/reference_samples/chinese_female.wav"
131
+ ref_text = "博士,您工作辛苦了!"
132
+ text_list = ["我觉得我还能抢救一下的!", "我…我才不要和你一起!"]
133
+ audios = tts_engine.batch_infer(text_list, ref_audio, ref_text)
134
+ for i, audio in enumerate(audios):
135
+ torchaudio.save('test_chinese_{}.wav'.format(i), audios[0], 24000)
136
+
137
+
138
+ # english example
139
+ ref_audio = "reference_samples/reference_samples/english_female.wav"
140
+ ref_text = "At dinner, he informed me that he was a trouble shooter for a huge international organization."
141
+ text_list = [" Working overtime feels like running a marathon with no finish line in sight", " Just endless tasks and a growing sense that my life is being lived in the office instead of the real world."]
142
+ audios = tts_engine.batch_infer(text_list, ref_audio, ref_text)
143
+ for i, audio in enumerate(audios):
144
+ torchaudio.save('test_english_{}.wav'.format(i), audios[0], 24000)
145
+
146
+ ```
147
+ ### Gradio Demo
148
+ ```bash
149
+ python gradio_demo.py
150
+ ```
151
+
152
+ ### Demo Inference
153
+ - [ViiTor AI](https://www.viitor.io/text-to-speech)
154
+ ### Streaming Inference (TODO)
155
+
156
+ ---
157
+ ## Training
158
+ - [example](./train_example.md)
159
+ ## Join Our Community
160
+ [![Join Discord](https://img.shields.io/discord/your-discord-id?logo=discord&style=for-the-badge)](https://discord.gg/MbxgFn7BN8)
161
+
162
+ Have questions about the project? Want to discuss new features, report bugs, or just chat with other contributors? Join our Discord community!
163
+ ## References
164
+
165
+ - [SNAC](https://github.com/hubertsiuzdak/snac)
166
+ - [mini-omni](https://github.com/gpt-omni/mini-omni)
167
+ - [open-gpt-4-o](https://laion.ai/notes/open-gpt-4-o/)
168
+ - [Qwen](https://huggingface.co/Qwen/Qwen2-0.5B)
169
+ - [cosyvoice](https://huggingface.co/FunAudioLLM/CosyVoice-300M)
170
+
171
+ ## License
172
+
173
+ This project is licensed under [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/).
174
+ You are free to share and modify the code of this project for non-commercial purposes, under the following conditions:
175
+
176
+ 1. **Attribution**: You must give appropriate credit, provide a link to the license, and indicate if changes were made.
177
+ 2. **Non-Commercial**: You may not use the material for commercial purposes.
178
+
179
+ **Copyright Notice:**
180
+ © 2024 Livedata. All Rights Reserved.