mthsk commited on
Commit
d4b2b22
·
1 Parent(s): 6844b9e

Add TTS from upstream

Browse files
LICENSE CHANGED
@@ -1,407 +1,21 @@
1
- Attribution-NonCommercial 4.0 International
2
-
3
- =======================================================================
4
-
5
- Creative Commons Corporation ("Creative Commons") is not a law firm and
6
- does not provide legal services or legal advice. Distribution of
7
- Creative Commons public licenses does not create a lawyer-client or
8
- other relationship. Creative Commons makes its licenses and related
9
- information available on an "as-is" basis. Creative Commons gives no
10
- warranties regarding its licenses, any material licensed under their
11
- terms and conditions, or any related information. Creative Commons
12
- disclaims all liability for damages resulting from their use to the
13
- fullest extent possible.
14
-
15
- Using Creative Commons Public Licenses
16
-
17
- Creative Commons public licenses provide a standard set of terms and
18
- conditions that creators and other rights holders may use to share
19
- original works of authorship and other material subject to copyright
20
- and certain other rights specified in the public license below. The
21
- following considerations are for informational purposes only, are not
22
- exhaustive, and do not form part of our licenses.
23
-
24
- Considerations for licensors: Our public licenses are
25
- intended for use by those authorized to give the public
26
- permission to use material in ways otherwise restricted by
27
- copyright and certain other rights. Our licenses are
28
- irrevocable. Licensors should read and understand the terms
29
- and conditions of the license they choose before applying it.
30
- Licensors should also secure all rights necessary before
31
- applying our licenses so that the public can reuse the
32
- material as expected. Licensors should clearly mark any
33
- material not subject to the license. This includes other CC-
34
- licensed material, or material used under an exception or
35
- limitation to copyright. More considerations for licensors:
36
- wiki.creativecommons.org/Considerations_for_licensors
37
-
38
- Considerations for the public: By using one of our public
39
- licenses, a licensor grants the public permission to use the
40
- licensed material under specified terms and conditions. If
41
- the licensor's permission is not necessary for any reason--for
42
- example, because of any applicable exception or limitation to
43
- copyright--then that use is not regulated by the license. Our
44
- licenses grant only permissions under copyright and certain
45
- other rights that a licensor has authority to grant. Use of
46
- the licensed material may still be restricted for other
47
- reasons, including because others have copyright or other
48
- rights in the material. A licensor may make special requests,
49
- such as asking that all changes be marked or described.
50
- Although not required by our licenses, you are encouraged to
51
- respect those requests where reasonable. More considerations
52
- for the public:
53
- wiki.creativecommons.org/Considerations_for_licensees
54
-
55
- =======================================================================
56
-
57
- Creative Commons Attribution-NonCommercial 4.0 International Public
58
- License
59
-
60
- By exercising the Licensed Rights (defined below), You accept and agree
61
- to be bound by the terms and conditions of this Creative Commons
62
- Attribution-NonCommercial 4.0 International Public License ("Public
63
- License"). To the extent this Public License may be interpreted as a
64
- contract, You are granted the Licensed Rights in consideration of Your
65
- acceptance of these terms and conditions, and the Licensor grants You
66
- such rights in consideration of benefits the Licensor receives from
67
- making the Licensed Material available under these terms and
68
- conditions.
69
-
70
-
71
- Section 1 -- Definitions.
72
-
73
- a. Adapted Material means material subject to Copyright and Similar
74
- Rights that is derived from or based upon the Licensed Material
75
- and in which the Licensed Material is translated, altered,
76
- arranged, transformed, or otherwise modified in a manner requiring
77
- permission under the Copyright and Similar Rights held by the
78
- Licensor. For purposes of this Public License, where the Licensed
79
- Material is a musical work, performance, or sound recording,
80
- Adapted Material is always produced where the Licensed Material is
81
- synched in timed relation with a moving image.
82
-
83
- b. Adapter's License means the license You apply to Your Copyright
84
- and Similar Rights in Your contributions to Adapted Material in
85
- accordance with the terms and conditions of this Public License.
86
-
87
- c. Copyright and Similar Rights means copyright and/or similar rights
88
- closely related to copyright including, without limitation,
89
- performance, broadcast, sound recording, and Sui Generis Database
90
- Rights, without regard to how the rights are labeled or
91
- categorized. For purposes of this Public License, the rights
92
- specified in Section 2(b)(1)-(2) are not Copyright and Similar
93
- Rights.
94
- d. Effective Technological Measures means those measures that, in the
95
- absence of proper authority, may not be circumvented under laws
96
- fulfilling obligations under Article 11 of the WIPO Copyright
97
- Treaty adopted on December 20, 1996, and/or similar international
98
- agreements.
99
-
100
- e. Exceptions and Limitations means fair use, fair dealing, and/or
101
- any other exception or limitation to Copyright and Similar Rights
102
- that applies to Your use of the Licensed Material.
103
-
104
- f. Licensed Material means the artistic or literary work, database,
105
- or other material to which the Licensor applied this Public
106
- License.
107
-
108
- g. Licensed Rights means the rights granted to You subject to the
109
- terms and conditions of this Public License, which are limited to
110
- all Copyright and Similar Rights that apply to Your use of the
111
- Licensed Material and that the Licensor has authority to license.
112
-
113
- h. Licensor means the individual(s) or entity(ies) granting rights
114
- under this Public License.
115
-
116
- i. NonCommercial means not primarily intended for or directed towards
117
- commercial advantage or monetary compensation. For purposes of
118
- this Public License, the exchange of the Licensed Material for
119
- other material subject to Copyright and Similar Rights by digital
120
- file-sharing or similar means is NonCommercial provided there is
121
- no payment of monetary compensation in connection with the
122
- exchange.
123
-
124
- j. Share means to provide material to the public by any means or
125
- process that requires permission under the Licensed Rights, such
126
- as reproduction, public display, public performance, distribution,
127
- dissemination, communication, or importation, and to make material
128
- available to the public including in ways that members of the
129
- public may access the material from a place and at a time
130
- individually chosen by them.
131
-
132
- k. Sui Generis Database Rights means rights other than copyright
133
- resulting from Directive 96/9/EC of the European Parliament and of
134
- the Council of 11 March 1996 on the legal protection of databases,
135
- as amended and/or succeeded, as well as other essentially
136
- equivalent rights anywhere in the world.
137
-
138
- l. You means the individual or entity exercising the Licensed Rights
139
- under this Public License. Your has a corresponding meaning.
140
-
141
-
142
- Section 2 -- Scope.
143
-
144
- a. License grant.
145
-
146
- 1. Subject to the terms and conditions of this Public License,
147
- the Licensor hereby grants You a worldwide, royalty-free,
148
- non-sublicensable, non-exclusive, irrevocable license to
149
- exercise the Licensed Rights in the Licensed Material to:
150
-
151
- a. reproduce and Share the Licensed Material, in whole or
152
- in part, for NonCommercial purposes only; and
153
-
154
- b. produce, reproduce, and Share Adapted Material for
155
- NonCommercial purposes only.
156
-
157
- 2. Exceptions and Limitations. For the avoidance of doubt, where
158
- Exceptions and Limitations apply to Your use, this Public
159
- License does not apply, and You do not need to comply with
160
- its terms and conditions.
161
-
162
- 3. Term. The term of this Public License is specified in Section
163
- 6(a).
164
-
165
- 4. Media and formats; technical modifications allowed. The
166
- Licensor authorizes You to exercise the Licensed Rights in
167
- all media and formats whether now known or hereafter created,
168
- and to make technical modifications necessary to do so. The
169
- Licensor waives and/or agrees not to assert any right or
170
- authority to forbid You from making technical modifications
171
- necessary to exercise the Licensed Rights, including
172
- technical modifications necessary to circumvent Effective
173
- Technological Measures. For purposes of this Public License,
174
- simply making modifications authorized by this Section 2(a)
175
- (4) never produces Adapted Material.
176
-
177
- 5. Downstream recipients.
178
-
179
- a. Offer from the Licensor -- Licensed Material. Every
180
- recipient of the Licensed Material automatically
181
- receives an offer from the Licensor to exercise the
182
- Licensed Rights under the terms and conditions of this
183
- Public License.
184
-
185
- b. No downstream restrictions. You may not offer or impose
186
- any additional or different terms or conditions on, or
187
- apply any Effective Technological Measures to, the
188
- Licensed Material if doing so restricts exercise of the
189
- Licensed Rights by any recipient of the Licensed
190
- Material.
191
-
192
- 6. No endorsement. Nothing in this Public License constitutes or
193
- may be construed as permission to assert or imply that You
194
- are, or that Your use of the Licensed Material is, connected
195
- with, or sponsored, endorsed, or granted official status by,
196
- the Licensor or others designated to receive attribution as
197
- provided in Section 3(a)(1)(A)(i).
198
-
199
- b. Other rights.
200
-
201
- 1. Moral rights, such as the right of integrity, are not
202
- licensed under this Public License, nor are publicity,
203
- privacy, and/or other similar personality rights; however, to
204
- the extent possible, the Licensor waives and/or agrees not to
205
- assert any such rights held by the Licensor to the limited
206
- extent necessary to allow You to exercise the Licensed
207
- Rights, but not otherwise.
208
-
209
- 2. Patent and trademark rights are not licensed under this
210
- Public License.
211
-
212
- 3. To the extent possible, the Licensor waives any right to
213
- collect royalties from You for the exercise of the Licensed
214
- Rights, whether directly or through a collecting society
215
- under any voluntary or waivable statutory or compulsory
216
- licensing scheme. In all other cases the Licensor expressly
217
- reserves any right to collect such royalties, including when
218
- the Licensed Material is used other than for NonCommercial
219
- purposes.
220
-
221
-
222
- Section 3 -- License Conditions.
223
-
224
- Your exercise of the Licensed Rights is expressly made subject to the
225
- following conditions.
226
-
227
- a. Attribution.
228
-
229
- 1. If You Share the Licensed Material (including in modified
230
- form), You must:
231
-
232
- a. retain the following if it is supplied by the Licensor
233
- with the Licensed Material:
234
-
235
- i. identification of the creator(s) of the Licensed
236
- Material and any others designated to receive
237
- attribution, in any reasonable manner requested by
238
- the Licensor (including by pseudonym if
239
- designated);
240
-
241
- ii. a copyright notice;
242
-
243
- iii. a notice that refers to this Public License;
244
-
245
- iv. a notice that refers to the disclaimer of
246
- warranties;
247
-
248
- v. a URI or hyperlink to the Licensed Material to the
249
- extent reasonably practicable;
250
-
251
- b. indicate if You modified the Licensed Material and
252
- retain an indication of any previous modifications; and
253
-
254
- c. indicate the Licensed Material is licensed under this
255
- Public License, and include the text of, or the URI or
256
- hyperlink to, this Public License.
257
-
258
- 2. You may satisfy the conditions in Section 3(a)(1) in any
259
- reasonable manner based on the medium, means, and context in
260
- which You Share the Licensed Material. For example, it may be
261
- reasonable to satisfy the conditions by providing a URI or
262
- hyperlink to a resource that includes the required
263
- information.
264
-
265
- 3. If requested by the Licensor, You must remove any of the
266
- information required by Section 3(a)(1)(A) to the extent
267
- reasonably practicable.
268
-
269
- 4. If You Share Adapted Material You produce, the Adapter's
270
- License You apply must not prevent recipients of the Adapted
271
- Material from complying with this Public License.
272
-
273
-
274
- Section 4 -- Sui Generis Database Rights.
275
-
276
- Where the Licensed Rights include Sui Generis Database Rights that
277
- apply to Your use of the Licensed Material:
278
-
279
- a. for the avoidance of doubt, Section 2(a)(1) grants You the right
280
- to extract, reuse, reproduce, and Share all or a substantial
281
- portion of the contents of the database for NonCommercial purposes
282
- only;
283
-
284
- b. if You include all or a substantial portion of the database
285
- contents in a database in which You have Sui Generis Database
286
- Rights, then the database in which You have Sui Generis Database
287
- Rights (but not its individual contents) is Adapted Material; and
288
-
289
- c. You must comply with the conditions in Section 3(a) if You Share
290
- all or a substantial portion of the contents of the database.
291
-
292
- For the avoidance of doubt, this Section 4 supplements and does not
293
- replace Your obligations under this Public License where the Licensed
294
- Rights include other Copyright and Similar Rights.
295
-
296
-
297
- Section 5 -- Disclaimer of Warranties and Limitation of Liability.
298
-
299
- a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
300
- EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
301
- AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
302
- ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
303
- IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
304
- WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
305
- PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
306
- ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
307
- KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
308
- ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.
309
-
310
- b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
311
- TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
312
- NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
313
- INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
314
- COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
315
- USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
316
- ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
317
- DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
318
- IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.
319
-
320
- c. The disclaimer of warranties and limitation of liability provided
321
- above shall be interpreted in a manner that, to the extent
322
- possible, most closely approximates an absolute disclaimer and
323
- waiver of all liability.
324
-
325
-
326
- Section 6 -- Term and Termination.
327
-
328
- a. This Public License applies for the term of the Copyright and
329
- Similar Rights licensed here. However, if You fail to comply with
330
- this Public License, then Your rights under this Public License
331
- terminate automatically.
332
-
333
- b. Where Your right to use the Licensed Material has terminated under
334
- Section 6(a), it reinstates:
335
-
336
- 1. automatically as of the date the violation is cured, provided
337
- it is cured within 30 days of Your discovery of the
338
- violation; or
339
-
340
- 2. upon express reinstatement by the Licensor.
341
-
342
- For the avoidance of doubt, this Section 6(b) does not affect any
343
- right the Licensor may have to seek remedies for Your violations
344
- of this Public License.
345
-
346
- c. For the avoidance of doubt, the Licensor may also offer the
347
- Licensed Material under separate terms or conditions or stop
348
- distributing the Licensed Material at any time; however, doing so
349
- will not terminate this Public License.
350
-
351
- d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
352
- License.
353
-
354
-
355
- Section 7 -- Other Terms and Conditions.
356
-
357
- a. The Licensor shall not be bound by any additional or different
358
- terms or conditions communicated by You unless expressly agreed.
359
-
360
- b. Any arrangements, understandings, or agreements regarding the
361
- Licensed Material not stated herein are separate from and
362
- independent of the terms and conditions of this Public License.
363
-
364
-
365
- Section 8 -- Interpretation.
366
-
367
- a. For the avoidance of doubt, this Public License does not, and
368
- shall not be interpreted to, reduce, limit, restrict, or impose
369
- conditions on any use of the Licensed Material that could lawfully
370
- be made without permission under this Public License.
371
-
372
- b. To the extent possible, if any provision of this Public License is
373
- deemed unenforceable, it shall be automatically reformed to the
374
- minimum extent necessary to make it enforceable. If the provision
375
- cannot be reformed, it shall be severed from this Public License
376
- without affecting the enforceability of the remaining terms and
377
- conditions.
378
-
379
- c. No term or condition of this Public License will be waived and no
380
- failure to comply consented to unless expressly agreed to by the
381
- Licensor.
382
-
383
- d. Nothing in this Public License constitutes or may be interpreted
384
- as a limitation upon, or waiver of, any privileges and immunities
385
- that apply to the Licensor or You, including from the legal
386
- processes of any jurisdiction or authority.
387
-
388
- =======================================================================
389
-
390
- Creative Commons is not a party to its public
391
- licenses. Notwithstanding, Creative Commons may elect to apply one of
392
- its public licenses to material it publishes and in those instances
393
- will be considered the “Licensor.” The text of the Creative Commons
394
- public licenses is dedicated to the public domain under the CC0 Public
395
- Domain Dedication. Except for the limited purpose of indicating that
396
- material is shared under a Creative Commons public license or as
397
- otherwise permitted by the Creative Commons policies published at
398
- creativecommons.org/policies, Creative Commons does not authorize the
399
- use of the trademark "Creative Commons" or any other trademark or logo
400
- of Creative Commons without its prior written consent including,
401
- without limitation, in connection with any unauthorized modifications
402
- to any of its public licenses or any other arrangements,
403
- understandings, or agreements concerning use of licensed material. For
404
- the avoidance of doubt, this paragraph does not form part of the
405
- public licenses.
406
-
407
- Creative Commons may be contacted at creativecommons.org.
 
1
+ MIT License
2
+
3
+ Copyright (c) 2021 Jingyi Li
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
app-slice.py CHANGED
@@ -1,7 +1,6 @@
1
  import os
2
  import gradio as gr
3
- import librosa
4
- import numpy as np
5
  from pathlib import Path
6
  import inference.infer_tool as infer_tool
7
  import utils
@@ -9,6 +8,8 @@ from inference.infer_tool import Svc
9
  import logging
10
  import webbrowser
11
  import argparse
 
 
12
  import soundfile
13
  import gradio.processing_utils as gr_processing_utils
14
  logging.getLogger('numba').setLevel(logging.WARNING)
@@ -29,14 +30,24 @@ def audio_postprocess(self, y):
29
 
30
  gr.Audio.postprocess = audio_postprocess
31
  def create_vc_fn(model, sid):
32
- def vc_fn(input_audio, vc_transform, auto_f0, slice_db, noise_scale, pad_seconds):
33
- if input_audio is None:
34
- return "You need to select an audio", None
35
- raw_audio_path = f"raw/{input_audio}"
36
- if "." not in raw_audio_path:
37
- raw_audio_path += ".wav"
38
- infer_tool.format_wav(raw_audio_path)
39
- wav_path = Path(raw_audio_path).with_suffix('.wav')
 
 
 
 
 
 
 
 
 
 
40
  _audio = model.slice_inference(
41
  wav_path, sid, vc_transform, slice_db,
42
  cluster_infer_ratio=0,
@@ -50,6 +61,11 @@ def create_vc_fn(model, sid):
50
  def refresh_raw_wav():
51
  return gr.Dropdown.update(choices=os.listdir("raw"))
52
 
 
 
 
 
 
53
 
54
  if __name__ == '__main__':
55
  parser = argparse.ArgumentParser()
@@ -60,10 +76,14 @@ if __name__ == '__main__':
60
  args = parser.parse_args()
61
  hubert_model = utils.get_hubert_model().to(args.device)
62
  models = []
 
 
 
 
63
  raw = os.listdir("raw")
64
  for f in os.listdir("models"):
65
  name = f
66
- model = Svc(fr"models/{f}/{f}.pth", f"models/{f}/config.json", device=args.device, hubert_model=hubert_model)
67
  cover = f"models/{f}/cover.png" if os.path.exists(f"models/{f}/cover.png") else None
68
  models.append((name, cover, create_vc_fn(model, name)))
69
  with gr.Blocks() as app:
@@ -100,12 +120,16 @@ if __name__ == '__main__':
100
  noise_scale = gr.Number(label="noise_scale", value=0.4)
101
  pad_seconds = gr.Number(label="pad_seconds", value=0.5)
102
  auto_f0 = gr.Checkbox(label="auto_f0", value=False)
 
 
 
103
  vc_submit = gr.Button("Generate", variant="primary")
104
  with gr.Column():
105
  vc_output1 = gr.Textbox(label="Output Message")
106
  vc_output2 = gr.Audio(label="Output Audio")
107
- vc_submit.click(vc_fn, [vc_input, vc_transform, auto_f0, slice_db, noise_scale, pad_seconds], [vc_output1, vc_output2])
108
  vc_refresh.click(refresh_raw_wav, [], [vc_input])
 
109
  if args.colab:
110
  webbrowser.open("http://127.0.0.1:7860")
111
  app.queue(concurrency_count=1, api_open=args.api).launch(share=args.share)
 
1
  import os
2
  import gradio as gr
3
+ import edge_tts
 
4
  from pathlib import Path
5
  import inference.infer_tool as infer_tool
6
  import utils
 
8
  import logging
9
  import webbrowser
10
  import argparse
11
+ import asyncio
12
+ import librosa
13
  import soundfile
14
  import gradio.processing_utils as gr_processing_utils
15
  logging.getLogger('numba').setLevel(logging.WARNING)
 
30
 
31
  gr.Audio.postprocess = audio_postprocess
32
  def create_vc_fn(model, sid):
33
+ def vc_fn(input_audio, vc_transform, auto_f0, slice_db, noise_scale, pad_seconds, tts_text, tts_voice, tts_mode):
34
+ if tts_mode:
35
+ if len(tts_text) > 100 and limitation:
36
+ return "Text is too long", None
37
+ if tts_text is None or tts_voice is None:
38
+ return "You need to enter text and select a voice", None
39
+ asyncio.run(edge_tts.Communicate(tts_text, "-".join(tts_voice.split('-')[:-1])).save("tts.mp3"))
40
+ audio, sr = librosa.load("tts.mp3")
41
+ soundfile.write("tts.wav", audio, 24000, format="wav")
42
+ wav_path = "tts.wav"
43
+ else:
44
+ if input_audio is None:
45
+ return "You need to select an audio", None
46
+ raw_audio_path = f"raw/{input_audio}"
47
+ if "." not in raw_audio_path:
48
+ raw_audio_path += ".wav"
49
+ infer_tool.format_wav(raw_audio_path)
50
+ wav_path = Path(raw_audio_path).with_suffix('.wav')
51
  _audio = model.slice_inference(
52
  wav_path, sid, vc_transform, slice_db,
53
  cluster_infer_ratio=0,
 
61
  def refresh_raw_wav():
62
  return gr.Dropdown.update(choices=os.listdir("raw"))
63
 
64
+ def change_to_tts_mode(tts_mode):
65
+ if tts_mode:
66
+ return gr.Audio.update(visible=False), gr.Button.update(visible=False), gr.Textbox.update(visible=True), gr.Dropdown.update(visible=True)
67
+ else:
68
+ return gr.Audio.update(visible=True), gr.Button.update(visible=True), gr.Textbox.update(visible=False), gr.Dropdown.update(visible=False)
69
 
70
  if __name__ == '__main__':
71
  parser = argparse.ArgumentParser()
 
76
  args = parser.parse_args()
77
  hubert_model = utils.get_hubert_model().to(args.device)
78
  models = []
79
+ voices = []
80
+ tts_voice_list = asyncio.get_event_loop().run_until_complete(edge_tts.list_voices())
81
+ for r in tts_voice_list:
82
+ voices.append(f"{r['ShortName']}-{r['Gender']}")
83
  raw = os.listdir("raw")
84
  for f in os.listdir("models"):
85
  name = f
86
+ model = Svc(fr"models/{f}/{f}.pth", f"models/{f}/config.json", device=args.device)
87
  cover = f"models/{f}/cover.png" if os.path.exists(f"models/{f}/cover.png") else None
88
  models.append((name, cover, create_vc_fn(model, name)))
89
  with gr.Blocks() as app:
 
120
  noise_scale = gr.Number(label="noise_scale", value=0.4)
121
  pad_seconds = gr.Number(label="pad_seconds", value=0.5)
122
  auto_f0 = gr.Checkbox(label="auto_f0", value=False)
123
+ tts_mode = gr.Checkbox(label="tts (use edge-tts as input)", value=False)
124
+ tts_text = gr.Textbox(visible=False,label="TTS text (100 words limitation)" if limitation else "TTS text")
125
+ tts_voice = gr.Dropdown(choices=voices, visible=False)
126
  vc_submit = gr.Button("Generate", variant="primary")
127
  with gr.Column():
128
  vc_output1 = gr.Textbox(label="Output Message")
129
  vc_output2 = gr.Audio(label="Output Audio")
130
+ vc_submit.click(vc_fn, [vc_input, vc_transform, auto_f0, slice_db, noise_scale, pad_seconds, tts_text, tts_voice, tts_mode], [vc_output1, vc_output2])
131
  vc_refresh.click(refresh_raw_wav, [], [vc_input])
132
+ tts_mode.change(change_to_tts_mode, [tts_mode], [vc_input, vc_refresh, tts_text, tts_voice])
133
  if args.colab:
134
  webbrowser.open("http://127.0.0.1:7860")
135
  app.queue(concurrency_count=1, api_open=args.api).launch(share=args.share)
app.py CHANGED
@@ -7,7 +7,9 @@ import utils
7
  from inference.infer_tool import Svc
8
  import logging
9
  import soundfile
 
10
  import argparse
 
11
  import gradio.processing_utils as gr_processing_utils
12
  logging.getLogger('numba').setLevel(logging.WARNING)
13
  logging.getLogger('markdown_it').setLevel(logging.WARNING)
@@ -27,7 +29,21 @@ def audio_postprocess(self, y):
27
 
28
  gr.Audio.postprocess = audio_postprocess
29
  def create_vc_fn(model, sid):
30
- def vc_fn(input_audio, vc_transform, auto_f0):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
  if input_audio is None:
32
  return "You need to upload an audio", None
33
  sampling_rate, audio = input_audio
@@ -48,6 +64,12 @@ def create_vc_fn(model, sid):
48
  return "Success", (44100, out_audio.cpu().numpy())
49
  return vc_fn
50
 
 
 
 
 
 
 
51
  if __name__ == '__main__':
52
  parser = argparse.ArgumentParser()
53
  parser.add_argument('--device', type=str, default='cpu')
@@ -56,16 +78,27 @@ if __name__ == '__main__':
56
  args = parser.parse_args()
57
  hubert_model = utils.get_hubert_model().to(args.device)
58
  models = []
 
 
 
 
 
 
 
 
 
 
59
  for f in os.listdir("models"):
60
  name = f
61
- model = Svc(fr"models/{f}/{f}.pth", f"models/{f}/config.json", device=args.device, hubert_model=hubert_model)
62
  cover = f"models/{f}/cover.png" if os.path.exists(f"models/{f}/cover.png") else None
63
  models.append((name, cover, create_vc_fn(model, name)))
64
  with gr.Blocks() as app:
65
  gr.Markdown(
66
  "# <center> Sovits Models\n"
67
- "## <center> The input audio should be clean and pure voice without background music.\n\n"
68
- "[Original Repo](https://github.com/svc-develop-team/so-vits-svc)\n\n"
 
69
 
70
  )
71
  with gr.Tabs():
@@ -82,9 +115,25 @@ if __name__ == '__main__':
82
  vc_input = gr.Audio(label="Input audio"+' (less than 20 seconds)' if limitation else '')
83
  vc_transform = gr.Number(label="vc_transform", value=0)
84
  auto_f0 = gr.Checkbox(label="auto_f0", value=False)
 
 
 
85
  vc_submit = gr.Button("Generate", variant="primary")
86
  with gr.Column():
87
  vc_output1 = gr.Textbox(label="Output Message")
88
  vc_output2 = gr.Audio(label="Output Audio")
89
- vc_submit.click(vc_fn, [vc_input, vc_transform, auto_f0], [vc_output1, vc_output2])
90
- app.queue(concurrency_count=1, api_open=args.api).launch(share=args.share)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  from inference.infer_tool import Svc
8
  import logging
9
  import soundfile
10
+ import asyncio
11
  import argparse
12
+ import edge_tts
13
  import gradio.processing_utils as gr_processing_utils
14
  logging.getLogger('numba').setLevel(logging.WARNING)
15
  logging.getLogger('markdown_it').setLevel(logging.WARNING)
 
29
 
30
  gr.Audio.postprocess = audio_postprocess
31
  def create_vc_fn(model, sid):
32
+ def vc_fn(input_audio, vc_transform, auto_f0, tts_text, tts_voice, tts_mode):
33
+ if tts_mode:
34
+ if len(tts_text) > 100 and limitation:
35
+ return "Text is too long", None
36
+ if tts_text is None or tts_voice is None:
37
+ return "You need to enter text and select a voice", None
38
+ asyncio.run(edge_tts.Communicate(tts_text, "-".join(tts_voice.split('-')[:-1])).save("tts.mp3"))
39
+ audio, sr = librosa.load("tts.mp3", sr=16000, mono=True)
40
+ raw_path = io.BytesIO()
41
+ soundfile.write(raw_path, audio, 16000, format="wav")
42
+ raw_path.seek(0)
43
+ out_audio, out_sr = model.infer(sid, vc_transform, raw_path,
44
+ auto_predict_f0=auto_f0,
45
+ )
46
+ return "Success", (44100, out_audio.cpu().numpy())
47
  if input_audio is None:
48
  return "You need to upload an audio", None
49
  sampling_rate, audio = input_audio
 
64
  return "Success", (44100, out_audio.cpu().numpy())
65
  return vc_fn
66
 
67
+ def change_to_tts_mode(tts_mode):
68
+ if tts_mode:
69
+ return gr.Audio.update(visible=False), gr.Textbox.update(visible=True), gr.Dropdown.update(visible=True), gr.Checkbox.update(value=True)
70
+ else:
71
+ return gr.Audio.update(visible=True), gr.Textbox.update(visible=False), gr.Dropdown.update(visible=False), gr.Checkbox.update(value=False)
72
+
73
  if __name__ == '__main__':
74
  parser = argparse.ArgumentParser()
75
  parser.add_argument('--device', type=str, default='cpu')
 
78
  args = parser.parse_args()
79
  hubert_model = utils.get_hubert_model().to(args.device)
80
  models = []
81
+ others = {
82
+ "rudolf": "https://huggingface.co/spaces/sayashi/sovits-rudolf",
83
+ "teio": "https://huggingface.co/spaces/sayashi/sovits-teio",
84
+ "goldship": "https://huggingface.co/spaces/sayashi/sovits-goldship",
85
+ "tannhauser": "https://huggingface.co/spaces/sayashi/sovits-tannhauser"
86
+ }
87
+ voices = []
88
+ tts_voice_list = asyncio.get_event_loop().run_until_complete(edge_tts.list_voices())
89
+ for r in tts_voice_list:
90
+ voices.append(f"{r['ShortName']}-{r['Gender']}")
91
  for f in os.listdir("models"):
92
  name = f
93
+ model = Svc(fr"models/{f}/{f}.pth", f"models/{f}/config.json", device=args.device)
94
  cover = f"models/{f}/cover.png" if os.path.exists(f"models/{f}/cover.png") else None
95
  models.append((name, cover, create_vc_fn(model, name)))
96
  with gr.Blocks() as app:
97
  gr.Markdown(
98
  "# <center> Sovits Models\n"
99
+ "## <center> The input audio should be clean and pure voice without background music.\n"
100
+ "![visitor badge](https://visitor-badge.glitch.me/badge?page_id=mthsk.sovits-models)\n\n"
101
+ "[![Original Repo](https://badgen.net/badge/icon/github?icon=github&label=Original%20Repo)](https://github.com/svc-develop-team/so-vits-svc)\n\n"
102
 
103
  )
104
  with gr.Tabs():
 
115
  vc_input = gr.Audio(label="Input audio"+' (less than 20 seconds)' if limitation else '')
116
  vc_transform = gr.Number(label="vc_transform", value=0)
117
  auto_f0 = gr.Checkbox(label="auto_f0", value=False)
118
+ tts_mode = gr.Checkbox(label="tts (use edge-tts as input)", value=False)
119
+ tts_text = gr.Textbox(visible=False, label="TTS text (100 words limitation)" if limitation else "TTS text")
120
+ tts_voice = gr.Dropdown(choices=voices, visible=False)
121
  vc_submit = gr.Button("Generate", variant="primary")
122
  with gr.Column():
123
  vc_output1 = gr.Textbox(label="Output Message")
124
  vc_output2 = gr.Audio(label="Output Audio")
125
+ vc_submit.click(vc_fn, [vc_input, vc_transform, auto_f0, tts_text, tts_voice, tts_mode], [vc_output1, vc_output2])
126
+ tts_mode.change(change_to_tts_mode, [tts_mode], [vc_input, tts_text, tts_voice, auto_f0])
127
+ for category, link in others.items():
128
+ with gr.TabItem(category):
129
+ gr.Markdown(
130
+ f'''
131
+ <center>
132
+ <h2>Click to Go</h2>
133
+ <a href="{link}">
134
+ <img src="https://huggingface.co/datasets/huggingface/badges/raw/main/open-in-hf-spaces-xl-dark.svg"
135
+ </a>
136
+ </center>
137
+ '''
138
+ )
139
+ app.queue(concurrency_count=1, api_open=args.api).launch(share=args.share)
cluster/__pycache__/__init__.cpython-38.pyc CHANGED
Binary files a/cluster/__pycache__/__init__.cpython-38.pyc and b/cluster/__pycache__/__init__.cpython-38.pyc differ
 
hubert/__pycache__/__init__.cpython-38.pyc CHANGED
Binary files a/hubert/__pycache__/__init__.cpython-38.pyc and b/hubert/__pycache__/__init__.cpython-38.pyc differ
 
hubert/__pycache__/hubert_model.cpython-38.pyc CHANGED
Binary files a/hubert/__pycache__/hubert_model.cpython-38.pyc and b/hubert/__pycache__/hubert_model.cpython-38.pyc differ
 
inference/__pycache__/infer_tool.cpython-38.pyc CHANGED
Binary files a/inference/__pycache__/infer_tool.cpython-38.pyc and b/inference/__pycache__/infer_tool.cpython-38.pyc differ
 
inference/infer_tool.py CHANGED
@@ -108,8 +108,11 @@ def split_list_by_n(list_collection, n, pre=0):
108
  yield list_collection[i-pre if i-pre>=0 else i: i + n]
109
 
110
 
 
 
 
111
  class Svc(object):
112
- def __init__(self, net_g_path, config_path, hubert_model,
113
  device=None,
114
  cluster_model_path="logs/44k/kmeans_10000.pt"):
115
  self.net_g_path = net_g_path
@@ -123,7 +126,7 @@ class Svc(object):
123
  self.hop_size = self.hps_ms.data.hop_length
124
  self.spk2id = self.hps_ms.spk
125
  # 加载hubert
126
- self.hubert_model = hubert_model
127
  self.load_model()
128
  if os.path.exists(cluster_model_path):
129
  self.cluster_model = cluster.get_cluster_model(cluster_model_path)
@@ -142,12 +145,24 @@ class Svc(object):
142
 
143
 
144
 
145
- def get_unit_f0(self, in_path, tran, cluster_infer_ratio, speaker):
 
146
  wav, sr = librosa.load(in_path, sr=self.target_sample)
147
- f0 = utils.compute_f0_parselmouth(wav, sampling_rate=self.target_sample, hop_length=self.hop_size)
148
- f0, uv = utils.interpolate_f0(f0)
149
- f0 = torch.FloatTensor(f0)
150
- uv = torch.FloatTensor(uv)
 
 
 
 
 
 
 
 
 
 
 
151
  f0 = f0 * 2 ** (tran / 12)
152
  f0 = f0.unsqueeze(0).to(self.dev)
153
  uv = uv.unsqueeze(0).to(self.dev)
@@ -157,7 +172,7 @@ class Svc(object):
157
  c = utils.get_hubert_content(self.hubert_model, wav_16k_tensor=wav16k)
158
  c = utils.repeat_expand_2d(c.squeeze(0), f0.shape[1])
159
 
160
- if cluster_infer_ratio != 0:
161
  cluster_c = cluster.get_cluster_center_result(self.cluster_model, c.cpu().numpy().T, speaker).T
162
  cluster_c = torch.FloatTensor(cluster_c).to(self.dev)
163
  c = cluster_infer_ratio * cluster_c + (1 - cluster_infer_ratio) * c
@@ -168,13 +183,17 @@ class Svc(object):
168
  def infer(self, speaker, tran, raw_path,
169
  cluster_infer_ratio=0,
170
  auto_predict_f0=False,
171
- noice_scale=0.4):
 
 
 
 
172
  speaker_id = self.spk2id.__dict__.get(speaker)
173
  if not speaker_id and type(speaker) is int:
174
  if len(self.spk2id.__dict__) >= speaker:
175
  speaker_id = speaker
176
  sid = torch.LongTensor([int(speaker_id)]).to(self.dev).unsqueeze(0)
177
- c, f0, uv = self.get_unit_f0(raw_path, tran, cluster_infer_ratio, speaker)
178
  if "half" in self.net_g_path and torch.cuda.is_available():
179
  c = c.half()
180
  with torch.no_grad():
@@ -183,23 +202,35 @@ class Svc(object):
183
  use_time = time.time() - start
184
  print("vits use time:{}".format(use_time))
185
  return audio, audio.shape[-1]
186
-
187
  def clear_empty(self):
188
  # 清理显存
189
  torch.cuda.empty_cache()
190
 
191
- def slice_inference(self, raw_audio_path, spk, tran, slice_db, cluster_infer_ratio, auto_predict_f0, noice_scale,
192
- pad_seconds=0.5, clip_seconds=0, lg_num=0, lgr_num=0.75):
 
 
 
 
 
 
 
 
 
 
 
 
193
  wav_path = raw_audio_path
194
  chunks = slicer.cut(wav_path, db_thresh=slice_db)
195
  audio_data, audio_sr = slicer.chunks2audio(wav_path, chunks)
196
- per_size = int(clip_seconds * audio_sr)
197
- lg_size = int(lg_num * audio_sr)
198
- lg_size_r = int(lg_size * lgr_num)
199
- lg_size_c_l = (lg_size - lg_size_r) // 2
200
- lg_size_c_r = lg_size - lg_size_r - lg_size_c_l
201
- lg = np.linspace(0, 1, lg_size_r) if lg_size != 0 else 0
202
-
203
  audio = []
204
  for (slice_tag, data) in audio_data:
205
  print(f'#=====segment start, {round(len(data) / audio_sr, 3)}s======')
@@ -211,12 +242,12 @@ class Svc(object):
211
  audio.extend(list(pad_array(_audio, length)))
212
  continue
213
  if per_size != 0:
214
- datas = split_list_by_n(data, per_size, lg_size)
215
  else:
216
  datas = [data]
217
- for k, dat in enumerate(datas):
218
- per_length = int(np.ceil(len(dat) / audio_sr * self.target_sample)) if clip_seconds != 0 else length
219
- if clip_seconds != 0: print(f'###=====segment clip start, {round(len(dat) / audio_sr, 3)}s======')
220
  # padd
221
  pad_len = int(audio_sr * pad_seconds)
222
  dat = np.concatenate([np.zeros([pad_len]), dat, np.zeros([pad_len])])
@@ -224,25 +255,25 @@ class Svc(object):
224
  soundfile.write(raw_path, dat, audio_sr, format="wav")
225
  raw_path.seek(0)
226
  out_audio, out_sr = self.infer(spk, tran, raw_path,
227
- cluster_infer_ratio=cluster_infer_ratio,
228
- auto_predict_f0=auto_predict_f0,
229
- noice_scale=noice_scale
230
- )
 
231
  _audio = out_audio.cpu().numpy()
232
  pad_len = int(self.target_sample * pad_seconds)
233
  _audio = _audio[pad_len:-pad_len]
234
  _audio = pad_array(_audio, per_length)
235
- if lg_size != 0 and k != 0:
236
- lg1 = audio[-(lg_size_r + lg_size_c_r):-lg_size_c_r] if lgr_num != 1 else audio[-lg_size:]
237
- lg2 = _audio[lg_size_c_l:lg_size_c_l + lg_size_r] if lgr_num != 1 else _audio[0:lg_size]
238
- lg_pre = lg1 * (1 - lg) + lg2 * lg
239
- audio = audio[0:-(lg_size_r + lg_size_c_r)] if lgr_num != 1 else audio[0:-lg_size]
240
  audio.extend(lg_pre)
241
- _audio = _audio[lg_size_c_l + lg_size_r:] if lgr_num != 1 else _audio[lg_size:]
242
  audio.extend(list(_audio))
243
  return np.array(audio)
244
 
245
-
246
  class RealTimeVC:
247
  def __init__(self):
248
  self.last_chunk = None
@@ -252,14 +283,25 @@ class RealTimeVC:
252
 
253
  """输入输出都是1维numpy 音频波形数组"""
254
 
255
- def process(self, svc_model, speaker_id, f_pitch_change, input_wav_path):
 
 
 
 
 
256
  import maad
257
  audio, sr = torchaudio.load(input_wav_path)
258
  audio = audio.cpu().numpy()[0]
259
  temp_wav = io.BytesIO()
260
  if self.last_chunk is None:
261
  input_wav_path.seek(0)
262
- audio, sr = svc_model.infer(speaker_id, f_pitch_change, input_wav_path)
 
 
 
 
 
 
263
  audio = audio.cpu().numpy()
264
  self.last_chunk = audio[-self.pre_len:]
265
  self.last_o = audio
@@ -268,7 +310,13 @@ class RealTimeVC:
268
  audio = np.concatenate([self.last_chunk, audio])
269
  soundfile.write(temp_wav, audio, sr, format="wav")
270
  temp_wav.seek(0)
271
- audio, sr = svc_model.infer(speaker_id, f_pitch_change, temp_wav)
 
 
 
 
 
 
272
  audio = audio.cpu().numpy()
273
  ret = maad.util.crossfade(self.last_o, audio, self.pre_len)
274
  self.last_chunk = audio[-self.pre_len:]
 
108
  yield list_collection[i-pre if i-pre>=0 else i: i + n]
109
 
110
 
111
+ class F0FilterException(Exception):
112
+ pass
113
+
114
  class Svc(object):
115
+ def __init__(self, net_g_path, config_path,
116
  device=None,
117
  cluster_model_path="logs/44k/kmeans_10000.pt"):
118
  self.net_g_path = net_g_path
 
126
  self.hop_size = self.hps_ms.data.hop_length
127
  self.spk2id = self.hps_ms.spk
128
  # 加载hubert
129
+ self.hubert_model = utils.get_hubert_model().to(self.dev)
130
  self.load_model()
131
  if os.path.exists(cluster_model_path):
132
  self.cluster_model = cluster.get_cluster_model(cluster_model_path)
 
145
 
146
 
147
 
148
+ def get_unit_f0(self, in_path, tran, cluster_infer_ratio, speaker, f0_filter ,F0_mean_pooling):
149
+
150
  wav, sr = librosa.load(in_path, sr=self.target_sample)
151
+
152
+ if F0_mean_pooling == True:
153
+ f0, uv = utils.compute_f0_uv_torchcrepe(torch.FloatTensor(wav), sampling_rate=self.target_sample, hop_length=self.hop_size,device=self.dev)
154
+ if f0_filter and sum(f0) == 0:
155
+ raise F0FilterException("未检测到人声")
156
+ f0 = torch.FloatTensor(list(f0))
157
+ uv = torch.FloatTensor(list(uv))
158
+ if F0_mean_pooling == False:
159
+ f0 = utils.compute_f0_parselmouth(wav, sampling_rate=self.target_sample, hop_length=self.hop_size)
160
+ if f0_filter and sum(f0) == 0:
161
+ raise F0FilterException("未检测到人声")
162
+ f0, uv = utils.interpolate_f0(f0)
163
+ f0 = torch.FloatTensor(f0)
164
+ uv = torch.FloatTensor(uv)
165
+
166
  f0 = f0 * 2 ** (tran / 12)
167
  f0 = f0.unsqueeze(0).to(self.dev)
168
  uv = uv.unsqueeze(0).to(self.dev)
 
172
  c = utils.get_hubert_content(self.hubert_model, wav_16k_tensor=wav16k)
173
  c = utils.repeat_expand_2d(c.squeeze(0), f0.shape[1])
174
 
175
+ if cluster_infer_ratio !=0:
176
  cluster_c = cluster.get_cluster_center_result(self.cluster_model, c.cpu().numpy().T, speaker).T
177
  cluster_c = torch.FloatTensor(cluster_c).to(self.dev)
178
  c = cluster_infer_ratio * cluster_c + (1 - cluster_infer_ratio) * c
 
183
  def infer(self, speaker, tran, raw_path,
184
  cluster_infer_ratio=0,
185
  auto_predict_f0=False,
186
+ noice_scale=0.4,
187
+ f0_filter=False,
188
+ F0_mean_pooling=False
189
+ ):
190
+
191
  speaker_id = self.spk2id.__dict__.get(speaker)
192
  if not speaker_id and type(speaker) is int:
193
  if len(self.spk2id.__dict__) >= speaker:
194
  speaker_id = speaker
195
  sid = torch.LongTensor([int(speaker_id)]).to(self.dev).unsqueeze(0)
196
+ c, f0, uv = self.get_unit_f0(raw_path, tran, cluster_infer_ratio, speaker, f0_filter,F0_mean_pooling)
197
  if "half" in self.net_g_path and torch.cuda.is_available():
198
  c = c.half()
199
  with torch.no_grad():
 
202
  use_time = time.time() - start
203
  print("vits use time:{}".format(use_time))
204
  return audio, audio.shape[-1]
205
+
206
  def clear_empty(self):
207
  # 清理显存
208
  torch.cuda.empty_cache()
209
 
210
+ def slice_inference(self,
211
+ raw_audio_path,
212
+ spk,
213
+ tran,
214
+ slice_db,
215
+ cluster_infer_ratio,
216
+ auto_predict_f0,
217
+ noice_scale,
218
+ pad_seconds=0.5,
219
+ clip_seconds=0,
220
+ lg_num=0,
221
+ lgr_num =0.75,
222
+ F0_mean_pooling = False
223
+ ):
224
  wav_path = raw_audio_path
225
  chunks = slicer.cut(wav_path, db_thresh=slice_db)
226
  audio_data, audio_sr = slicer.chunks2audio(wav_path, chunks)
227
+ per_size = int(clip_seconds*audio_sr)
228
+ lg_size = int(lg_num*audio_sr)
229
+ lg_size_r = int(lg_size*lgr_num)
230
+ lg_size_c_l = (lg_size-lg_size_r)//2
231
+ lg_size_c_r = lg_size-lg_size_r-lg_size_c_l
232
+ lg = np.linspace(0,1,lg_size_r) if lg_size!=0 else 0
233
+
234
  audio = []
235
  for (slice_tag, data) in audio_data:
236
  print(f'#=====segment start, {round(len(data) / audio_sr, 3)}s======')
 
242
  audio.extend(list(pad_array(_audio, length)))
243
  continue
244
  if per_size != 0:
245
+ datas = split_list_by_n(data, per_size,lg_size)
246
  else:
247
  datas = [data]
248
+ for k,dat in enumerate(datas):
249
+ per_length = int(np.ceil(len(dat) / audio_sr * self.target_sample)) if clip_seconds!=0 else length
250
+ if clip_seconds!=0: print(f'###=====segment clip start, {round(len(dat) / audio_sr, 3)}s======')
251
  # padd
252
  pad_len = int(audio_sr * pad_seconds)
253
  dat = np.concatenate([np.zeros([pad_len]), dat, np.zeros([pad_len])])
 
255
  soundfile.write(raw_path, dat, audio_sr, format="wav")
256
  raw_path.seek(0)
257
  out_audio, out_sr = self.infer(spk, tran, raw_path,
258
+ cluster_infer_ratio=cluster_infer_ratio,
259
+ auto_predict_f0=auto_predict_f0,
260
+ noice_scale=noice_scale,
261
+ F0_mean_pooling = F0_mean_pooling
262
+ )
263
  _audio = out_audio.cpu().numpy()
264
  pad_len = int(self.target_sample * pad_seconds)
265
  _audio = _audio[pad_len:-pad_len]
266
  _audio = pad_array(_audio, per_length)
267
+ if lg_size!=0 and k!=0:
268
+ lg1 = audio[-(lg_size_r+lg_size_c_r):-lg_size_c_r] if lgr_num != 1 else audio[-lg_size:]
269
+ lg2 = _audio[lg_size_c_l:lg_size_c_l+lg_size_r] if lgr_num != 1 else _audio[0:lg_size]
270
+ lg_pre = lg1*(1-lg)+lg2*lg
271
+ audio = audio[0:-(lg_size_r+lg_size_c_r)] if lgr_num != 1 else audio[0:-lg_size]
272
  audio.extend(lg_pre)
273
+ _audio = _audio[lg_size_c_l+lg_size_r:] if lgr_num != 1 else _audio[lg_size:]
274
  audio.extend(list(_audio))
275
  return np.array(audio)
276
 
 
277
  class RealTimeVC:
278
  def __init__(self):
279
  self.last_chunk = None
 
283
 
284
  """输入输出都是1维numpy 音频波形数组"""
285
 
286
+ def process(self, svc_model, speaker_id, f_pitch_change, input_wav_path,
287
+ cluster_infer_ratio=0,
288
+ auto_predict_f0=False,
289
+ noice_scale=0.4,
290
+ f0_filter=False):
291
+
292
  import maad
293
  audio, sr = torchaudio.load(input_wav_path)
294
  audio = audio.cpu().numpy()[0]
295
  temp_wav = io.BytesIO()
296
  if self.last_chunk is None:
297
  input_wav_path.seek(0)
298
+
299
+ audio, sr = svc_model.infer(speaker_id, f_pitch_change, input_wav_path,
300
+ cluster_infer_ratio=cluster_infer_ratio,
301
+ auto_predict_f0=auto_predict_f0,
302
+ noice_scale=noice_scale,
303
+ f0_filter=f0_filter)
304
+
305
  audio = audio.cpu().numpy()
306
  self.last_chunk = audio[-self.pre_len:]
307
  self.last_o = audio
 
310
  audio = np.concatenate([self.last_chunk, audio])
311
  soundfile.write(temp_wav, audio, sr, format="wav")
312
  temp_wav.seek(0)
313
+
314
+ audio, sr = svc_model.infer(speaker_id, f_pitch_change, temp_wav,
315
+ cluster_infer_ratio=cluster_infer_ratio,
316
+ auto_predict_f0=auto_predict_f0,
317
+ noice_scale=noice_scale,
318
+ f0_filter=f0_filter)
319
+
320
  audio = audio.cpu().numpy()
321
  ret = maad.util.crossfade(self.last_o, audio, self.pre_len)
322
  self.last_chunk = audio[-self.pre_len:]
inference_main.py CHANGED
@@ -23,17 +23,19 @@ def main():
23
  parser = argparse.ArgumentParser(description='sovits4 inference')
24
 
25
  # 一定要设置的部分
26
- parser.add_argument('-m', '--model_path', type=str, default="/Volumes/Extend/下载/G_20800.pth", help='模型路径')
27
  parser.add_argument('-c', '--config_path', type=str, default="configs/config.json", help='配置文件路径')
28
- parser.add_argument('-n', '--clean_names', type=str, nargs='+', default=["君の知らない物語-src"], help='wav文件名列表,放在raw文件夹下')
 
29
  parser.add_argument('-t', '--trans', type=int, nargs='+', default=[0], help='音高调整,支持正负(半音)')
30
- parser.add_argument('-s', '--spk_list', type=str, nargs='+', default=['nyaru'], help='合成目标说话人名称')
31
 
32
  # 可选项部分
33
- parser.add_argument('-a', '--auto_predict_f0', action='store_true', default=False,
34
- help='语音转换自动预测音高,转换歌声时不要打开这个会严重跑调')
35
- parser.add_argument('-cm', '--cluster_model_path', type=str, default="/Volumes/Extend/下载/so-vits-svc-4.0/logs/44k/kmeans_10000.pt", help='聚类模型路径,如果没有训练聚类则随便填')
36
- parser.add_argument('-cr', '--cluster_infer_ratio', type=float, default=1, help='聚类方案占比,范围0-1,若没有训练聚类模型则填0即可')
 
37
 
38
  # 不用动的部分
39
  parser.add_argument('-sd', '--slice_db', type=int, default=-40, help='默认-40,嘈杂的音频可以-30,干声保留呼吸可以-50')
@@ -41,6 +43,7 @@ def main():
41
  parser.add_argument('-ns', '--noice_scale', type=float, default=0.4, help='噪音级别,会影响咬字和音质,较为玄学')
42
  parser.add_argument('-p', '--pad_seconds', type=float, default=0.5, help='推理音频pad秒数,由于未知原因开头结尾会有异响,pad一小段静音段后就不会出现')
43
  parser.add_argument('-wf', '--wav_format', type=str, default='flac', help='音频输出格式')
 
44
 
45
  args = parser.parse_args()
46
 
@@ -55,6 +58,10 @@ def main():
55
  cluster_infer_ratio = args.cluster_infer_ratio
56
  noice_scale = args.noice_scale
57
  pad_seconds = args.pad_seconds
 
 
 
 
58
 
59
  infer_tool.fill_a_to_b(trans, clean_names)
60
  for clean_name, tran in zip(clean_names, trans):
@@ -65,35 +72,58 @@ def main():
65
  wav_path = Path(raw_audio_path).with_suffix('.wav')
66
  chunks = slicer.cut(wav_path, db_thresh=slice_db)
67
  audio_data, audio_sr = slicer.chunks2audio(wav_path, chunks)
 
 
 
 
 
 
68
 
69
  for spk in spk_list:
70
  audio = []
71
  for (slice_tag, data) in audio_data:
72
  print(f'#=====segment start, {round(len(data) / audio_sr, 3)}s======')
73
- # padd
74
- pad_len = int(audio_sr * pad_seconds)
75
- data = np.concatenate([np.zeros([pad_len]), data, np.zeros([pad_len])])
76
  length = int(np.ceil(len(data) / audio_sr * svc_model.target_sample))
77
- raw_path = io.BytesIO()
78
- soundfile.write(raw_path, data, audio_sr, format="wav")
79
- raw_path.seek(0)
80
  if slice_tag:
81
  print('jump empty segment')
82
  _audio = np.zeros(length)
 
 
 
 
83
  else:
 
 
 
 
 
 
 
 
 
 
84
  out_audio, out_sr = svc_model.infer(spk, tran, raw_path,
85
  cluster_infer_ratio=cluster_infer_ratio,
86
  auto_predict_f0=auto_predict_f0,
87
- noice_scale=noice_scale
 
88
  )
89
  _audio = out_audio.cpu().numpy()
90
-
91
- pad_len = int(svc_model.target_sample * pad_seconds)
92
- _audio = _audio[pad_len:-pad_len]
93
- audio.extend(list(_audio))
 
 
 
 
 
 
 
94
  key = "auto" if auto_predict_f0 else f"{tran}key"
95
  cluster_name = "" if cluster_infer_ratio == 0 else f"_{cluster_infer_ratio}"
96
- res_path = f'./results/old——{clean_name}_{key}_{spk}{cluster_name}.{wav_format}'
97
  soundfile.write(res_path, audio, svc_model.target_sample, format=wav_format)
98
 
99
  if __name__ == '__main__':
 
23
  parser = argparse.ArgumentParser(description='sovits4 inference')
24
 
25
  # 一定要设置的部分
26
+ parser.add_argument('-m', '--model_path', type=str, default="logs/44k/G_0.pth", help='模型路径')
27
  parser.add_argument('-c', '--config_path', type=str, default="configs/config.json", help='配置文件路径')
28
+ parser.add_argument('-cl', '--clip', type=float, default=0, help='音频强制切片,默认0为自动切片,单位为秒/s')
29
+ parser.add_argument('-n', '--clean_names', type=str, nargs='+', default=["君の知らない物語-src.wav"], help='wav文件名列表,放在raw文件夹下')
30
  parser.add_argument('-t', '--trans', type=int, nargs='+', default=[0], help='音高调整,支持正负(半音)')
31
+ parser.add_argument('-s', '--spk_list', type=str, nargs='+', default=['nen'], help='合成目标说话人名称')
32
 
33
  # 可选项部分
34
+ parser.add_argument('-a', '--auto_predict_f0', action='store_true', default=False,help='语音转换自动预测音高,转换歌声时不要打开这个会严重跑调')
35
+ parser.add_argument('-cm', '--cluster_model_path', type=str, default="logs/44k/kmeans_10000.pt", help='聚类模型路径,如果没有训练聚类则随便填')
36
+ parser.add_argument('-cr', '--cluster_infer_ratio', type=float, default=0, help='聚类方案占比,范围0-1,若没有训练聚类模型则默认0即可')
37
+ parser.add_argument('-lg', '--linear_gradient', type=float, default=0, help='两段音频切片的交叉淡入长度,如果强制切片后出现人声不连贯可调整该数值,如果连贯建议采用默认值0,单位为秒')
38
+ parser.add_argument('-fmp', '--f0_mean_pooling', type=bool, default=False, help='是否对F0使用均值滤波器(池化),对部分哑音有改善。注意,启动该选项会导致推理速度下降,默认关闭')
39
 
40
  # 不用动的部分
41
  parser.add_argument('-sd', '--slice_db', type=int, default=-40, help='默认-40,嘈杂的音频可以-30,干声保留呼吸可以-50')
 
43
  parser.add_argument('-ns', '--noice_scale', type=float, default=0.4, help='噪音级别,会影响咬字和音质,较为玄学')
44
  parser.add_argument('-p', '--pad_seconds', type=float, default=0.5, help='推理音频pad秒数,由于未知原因开头结尾会有异响,pad一小段静音段后就不会出现')
45
  parser.add_argument('-wf', '--wav_format', type=str, default='flac', help='音频输出格式')
46
+ parser.add_argument('-lgr', '--linear_gradient_retain', type=float, default=0.75, help='自动音频切片后,需要舍弃每段切片的头尾。该参数设置交叉长度保留的比例,范围0-1,左开右闭')
47
 
48
  args = parser.parse_args()
49
 
 
58
  cluster_infer_ratio = args.cluster_infer_ratio
59
  noice_scale = args.noice_scale
60
  pad_seconds = args.pad_seconds
61
+ clip = args.clip
62
+ lg = args.linear_gradient
63
+ lgr = args.linear_gradient_retain
64
+ F0_mean_pooling = args.f0_mean_pooling
65
 
66
  infer_tool.fill_a_to_b(trans, clean_names)
67
  for clean_name, tran in zip(clean_names, trans):
 
72
  wav_path = Path(raw_audio_path).with_suffix('.wav')
73
  chunks = slicer.cut(wav_path, db_thresh=slice_db)
74
  audio_data, audio_sr = slicer.chunks2audio(wav_path, chunks)
75
+ per_size = int(clip*audio_sr)
76
+ lg_size = int(lg*audio_sr)
77
+ lg_size_r = int(lg_size*lgr)
78
+ lg_size_c_l = (lg_size-lg_size_r)//2
79
+ lg_size_c_r = lg_size-lg_size_r-lg_size_c_l
80
+ lg = np.linspace(0,1,lg_size_r) if lg_size!=0 else 0
81
 
82
  for spk in spk_list:
83
  audio = []
84
  for (slice_tag, data) in audio_data:
85
  print(f'#=====segment start, {round(len(data) / audio_sr, 3)}s======')
86
+
 
 
87
  length = int(np.ceil(len(data) / audio_sr * svc_model.target_sample))
 
 
 
88
  if slice_tag:
89
  print('jump empty segment')
90
  _audio = np.zeros(length)
91
+ audio.extend(list(infer_tool.pad_array(_audio, length)))
92
+ continue
93
+ if per_size != 0:
94
+ datas = infer_tool.split_list_by_n(data, per_size,lg_size)
95
  else:
96
+ datas = [data]
97
+ for k,dat in enumerate(datas):
98
+ per_length = int(np.ceil(len(dat) / audio_sr * svc_model.target_sample)) if clip!=0 else length
99
+ if clip!=0: print(f'###=====segment clip start, {round(len(dat) / audio_sr, 3)}s======')
100
+ # padd
101
+ pad_len = int(audio_sr * pad_seconds)
102
+ dat = np.concatenate([np.zeros([pad_len]), dat, np.zeros([pad_len])])
103
+ raw_path = io.BytesIO()
104
+ soundfile.write(raw_path, dat, audio_sr, format="wav")
105
+ raw_path.seek(0)
106
  out_audio, out_sr = svc_model.infer(spk, tran, raw_path,
107
  cluster_infer_ratio=cluster_infer_ratio,
108
  auto_predict_f0=auto_predict_f0,
109
+ noice_scale=noice_scale,
110
+ F0_mean_pooling = F0_mean_pooling
111
  )
112
  _audio = out_audio.cpu().numpy()
113
+ pad_len = int(svc_model.target_sample * pad_seconds)
114
+ _audio = _audio[pad_len:-pad_len]
115
+ _audio = infer_tool.pad_array(_audio, per_length)
116
+ if lg_size!=0 and k!=0:
117
+ lg1 = audio[-(lg_size_r+lg_size_c_r):-lg_size_c_r] if lgr != 1 else audio[-lg_size:]
118
+ lg2 = _audio[lg_size_c_l:lg_size_c_l+lg_size_r] if lgr != 1 else _audio[0:lg_size]
119
+ lg_pre = lg1*(1-lg)+lg2*lg
120
+ audio = audio[0:-(lg_size_r+lg_size_c_r)] if lgr != 1 else audio[0:-lg_size]
121
+ audio.extend(lg_pre)
122
+ _audio = _audio[lg_size_c_l+lg_size_r:] if lgr != 1 else _audio[lg_size:]
123
+ audio.extend(list(_audio))
124
  key = "auto" if auto_predict_f0 else f"{tran}key"
125
  cluster_name = "" if cluster_infer_ratio == 0 else f"_{cluster_infer_ratio}"
126
+ res_path = f'./results/{clean_name}_{key}_{spk}{cluster_name}.{wav_format}'
127
  soundfile.write(res_path, audio, svc_model.target_sample, format=wav_format)
128
 
129
  if __name__ == '__main__':
onnx/model_onnx.py DELETED
@@ -1,328 +0,0 @@
1
- import copy
2
- import math
3
- import torch
4
- from torch import nn
5
- from torch.nn import functional as F
6
-
7
- import modules.attentions as attentions
8
- import modules.commons as commons
9
- import modules.modules as modules
10
-
11
- from torch.nn import Conv1d, ConvTranspose1d, AvgPool1d, Conv2d
12
- from torch.nn.utils import weight_norm, remove_weight_norm, spectral_norm
13
- from modules.commons import init_weights, get_padding
14
- from vdecoder.hifigan.models import Generator
15
- from utils import f0_to_coarse
16
-
17
- class ResidualCouplingBlock(nn.Module):
18
- def __init__(self,
19
- channels,
20
- hidden_channels,
21
- kernel_size,
22
- dilation_rate,
23
- n_layers,
24
- n_flows=4,
25
- gin_channels=0):
26
- super().__init__()
27
- self.channels = channels
28
- self.hidden_channels = hidden_channels
29
- self.kernel_size = kernel_size
30
- self.dilation_rate = dilation_rate
31
- self.n_layers = n_layers
32
- self.n_flows = n_flows
33
- self.gin_channels = gin_channels
34
-
35
- self.flows = nn.ModuleList()
36
- for i in range(n_flows):
37
- self.flows.append(modules.ResidualCouplingLayer(channels, hidden_channels, kernel_size, dilation_rate, n_layers, gin_channels=gin_channels, mean_only=True))
38
- self.flows.append(modules.Flip())
39
-
40
- def forward(self, x, x_mask, g=None, reverse=False):
41
- if not reverse:
42
- for flow in self.flows:
43
- x, _ = flow(x, x_mask, g=g, reverse=reverse)
44
- else:
45
- for flow in reversed(self.flows):
46
- x = flow(x, x_mask, g=g, reverse=reverse)
47
- return x
48
-
49
-
50
- class Encoder(nn.Module):
51
- def __init__(self,
52
- in_channels,
53
- out_channels,
54
- hidden_channels,
55
- kernel_size,
56
- dilation_rate,
57
- n_layers,
58
- gin_channels=0):
59
- super().__init__()
60
- self.in_channels = in_channels
61
- self.out_channels = out_channels
62
- self.hidden_channels = hidden_channels
63
- self.kernel_size = kernel_size
64
- self.dilation_rate = dilation_rate
65
- self.n_layers = n_layers
66
- self.gin_channels = gin_channels
67
-
68
- self.pre = nn.Conv1d(in_channels, hidden_channels, 1)
69
- self.enc = modules.WN(hidden_channels, kernel_size, dilation_rate, n_layers, gin_channels=gin_channels)
70
- self.proj = nn.Conv1d(hidden_channels, out_channels * 2, 1)
71
-
72
- def forward(self, x, x_lengths, g=None):
73
- # print(x.shape,x_lengths.shape)
74
- x_mask = torch.unsqueeze(commons.sequence_mask(x_lengths, x.size(2)), 1).to(x.dtype)
75
- x = self.pre(x) * x_mask
76
- x = self.enc(x, x_mask, g=g)
77
- stats = self.proj(x) * x_mask
78
- m, logs = torch.split(stats, self.out_channels, dim=1)
79
- z = (m + torch.randn_like(m) * torch.exp(logs)) * x_mask
80
- return z, m, logs, x_mask
81
-
82
-
83
- class TextEncoder(nn.Module):
84
- def __init__(self,
85
- in_channels,
86
- out_channels,
87
- hidden_channels,
88
- kernel_size,
89
- dilation_rate,
90
- n_layers,
91
- gin_channels=0,
92
- filter_channels=None,
93
- n_heads=None,
94
- p_dropout=None):
95
- super().__init__()
96
- self.in_channels = in_channels
97
- self.out_channels = out_channels
98
- self.hidden_channels = hidden_channels
99
- self.kernel_size = kernel_size
100
- self.dilation_rate = dilation_rate
101
- self.n_layers = n_layers
102
- self.gin_channels = gin_channels
103
- self.pre = nn.Conv1d(in_channels, hidden_channels, 1)
104
- self.proj = nn.Conv1d(hidden_channels, out_channels * 2, 1)
105
- self.f0_emb = nn.Embedding(256, hidden_channels)
106
-
107
- self.enc_ = attentions.Encoder(
108
- hidden_channels,
109
- filter_channels,
110
- n_heads,
111
- n_layers,
112
- kernel_size,
113
- p_dropout)
114
-
115
- def forward(self, x, x_lengths, f0=None):
116
- x_mask = torch.unsqueeze(commons.sequence_mask(x_lengths, x.size(2)), 1).to(x.dtype)
117
- x = self.pre(x) * x_mask
118
- x = x + self.f0_emb(f0.long()).transpose(1,2)
119
- x = self.enc_(x * x_mask, x_mask)
120
- stats = self.proj(x) * x_mask
121
- m, logs = torch.split(stats, self.out_channels, dim=1)
122
- z = (m + torch.randn_like(m) * torch.exp(logs)) * x_mask
123
-
124
- return z, m, logs, x_mask
125
-
126
-
127
-
128
- class DiscriminatorP(torch.nn.Module):
129
- def __init__(self, period, kernel_size=5, stride=3, use_spectral_norm=False):
130
- super(DiscriminatorP, self).__init__()
131
- self.period = period
132
- self.use_spectral_norm = use_spectral_norm
133
- norm_f = weight_norm if use_spectral_norm == False else spectral_norm
134
- self.convs = nn.ModuleList([
135
- norm_f(Conv2d(1, 32, (kernel_size, 1), (stride, 1), padding=(get_padding(kernel_size, 1), 0))),
136
- norm_f(Conv2d(32, 128, (kernel_size, 1), (stride, 1), padding=(get_padding(kernel_size, 1), 0))),
137
- norm_f(Conv2d(128, 512, (kernel_size, 1), (stride, 1), padding=(get_padding(kernel_size, 1), 0))),
138
- norm_f(Conv2d(512, 1024, (kernel_size, 1), (stride, 1), padding=(get_padding(kernel_size, 1), 0))),
139
- norm_f(Conv2d(1024, 1024, (kernel_size, 1), 1, padding=(get_padding(kernel_size, 1), 0))),
140
- ])
141
- self.conv_post = norm_f(Conv2d(1024, 1, (3, 1), 1, padding=(1, 0)))
142
-
143
- def forward(self, x):
144
- fmap = []
145
-
146
- # 1d to 2d
147
- b, c, t = x.shape
148
- if t % self.period != 0: # pad first
149
- n_pad = self.period - (t % self.period)
150
- x = F.pad(x, (0, n_pad), "reflect")
151
- t = t + n_pad
152
- x = x.view(b, c, t // self.period, self.period)
153
-
154
- for l in self.convs:
155
- x = l(x)
156
- x = F.leaky_relu(x, modules.LRELU_SLOPE)
157
- fmap.append(x)
158
- x = self.conv_post(x)
159
- fmap.append(x)
160
- x = torch.flatten(x, 1, -1)
161
-
162
- return x, fmap
163
-
164
-
165
- class DiscriminatorS(torch.nn.Module):
166
- def __init__(self, use_spectral_norm=False):
167
- super(DiscriminatorS, self).__init__()
168
- norm_f = weight_norm if use_spectral_norm == False else spectral_norm
169
- self.convs = nn.ModuleList([
170
- norm_f(Conv1d(1, 16, 15, 1, padding=7)),
171
- norm_f(Conv1d(16, 64, 41, 4, groups=4, padding=20)),
172
- norm_f(Conv1d(64, 256, 41, 4, groups=16, padding=20)),
173
- norm_f(Conv1d(256, 1024, 41, 4, groups=64, padding=20)),
174
- norm_f(Conv1d(1024, 1024, 41, 4, groups=256, padding=20)),
175
- norm_f(Conv1d(1024, 1024, 5, 1, padding=2)),
176
- ])
177
- self.conv_post = norm_f(Conv1d(1024, 1, 3, 1, padding=1))
178
-
179
- def forward(self, x):
180
- fmap = []
181
-
182
- for l in self.convs:
183
- x = l(x)
184
- x = F.leaky_relu(x, modules.LRELU_SLOPE)
185
- fmap.append(x)
186
- x = self.conv_post(x)
187
- fmap.append(x)
188
- x = torch.flatten(x, 1, -1)
189
-
190
- return x, fmap
191
-
192
-
193
- class MultiPeriodDiscriminator(torch.nn.Module):
194
- def __init__(self, use_spectral_norm=False):
195
- super(MultiPeriodDiscriminator, self).__init__()
196
- periods = [2,3,5,7,11]
197
-
198
- discs = [DiscriminatorS(use_spectral_norm=use_spectral_norm)]
199
- discs = discs + [DiscriminatorP(i, use_spectral_norm=use_spectral_norm) for i in periods]
200
- self.discriminators = nn.ModuleList(discs)
201
-
202
- def forward(self, y, y_hat):
203
- y_d_rs = []
204
- y_d_gs = []
205
- fmap_rs = []
206
- fmap_gs = []
207
- for i, d in enumerate(self.discriminators):
208
- y_d_r, fmap_r = d(y)
209
- y_d_g, fmap_g = d(y_hat)
210
- y_d_rs.append(y_d_r)
211
- y_d_gs.append(y_d_g)
212
- fmap_rs.append(fmap_r)
213
- fmap_gs.append(fmap_g)
214
-
215
- return y_d_rs, y_d_gs, fmap_rs, fmap_gs
216
-
217
-
218
- class SpeakerEncoder(torch.nn.Module):
219
- def __init__(self, mel_n_channels=80, model_num_layers=3, model_hidden_size=256, model_embedding_size=256):
220
- super(SpeakerEncoder, self).__init__()
221
- self.lstm = nn.LSTM(mel_n_channels, model_hidden_size, model_num_layers, batch_first=True)
222
- self.linear = nn.Linear(model_hidden_size, model_embedding_size)
223
- self.relu = nn.ReLU()
224
-
225
- def forward(self, mels):
226
- self.lstm.flatten_parameters()
227
- _, (hidden, _) = self.lstm(mels)
228
- embeds_raw = self.relu(self.linear(hidden[-1]))
229
- return embeds_raw / torch.norm(embeds_raw, dim=1, keepdim=True)
230
-
231
- def compute_partial_slices(self, total_frames, partial_frames, partial_hop):
232
- mel_slices = []
233
- for i in range(0, total_frames-partial_frames, partial_hop):
234
- mel_range = torch.arange(i, i+partial_frames)
235
- mel_slices.append(mel_range)
236
-
237
- return mel_slices
238
-
239
- def embed_utterance(self, mel, partial_frames=128, partial_hop=64):
240
- mel_len = mel.size(1)
241
- last_mel = mel[:,-partial_frames:]
242
-
243
- if mel_len > partial_frames:
244
- mel_slices = self.compute_partial_slices(mel_len, partial_frames, partial_hop)
245
- mels = list(mel[:,s] for s in mel_slices)
246
- mels.append(last_mel)
247
- mels = torch.stack(tuple(mels), 0).squeeze(1)
248
-
249
- with torch.no_grad():
250
- partial_embeds = self(mels)
251
- embed = torch.mean(partial_embeds, axis=0).unsqueeze(0)
252
- #embed = embed / torch.linalg.norm(embed, 2)
253
- else:
254
- with torch.no_grad():
255
- embed = self(last_mel)
256
-
257
- return embed
258
-
259
-
260
- class SynthesizerTrn(nn.Module):
261
- """
262
- Synthesizer for Training
263
- """
264
-
265
- def __init__(self,
266
- spec_channels,
267
- segment_size,
268
- inter_channels,
269
- hidden_channels,
270
- filter_channels,
271
- n_heads,
272
- n_layers,
273
- kernel_size,
274
- p_dropout,
275
- resblock,
276
- resblock_kernel_sizes,
277
- resblock_dilation_sizes,
278
- upsample_rates,
279
- upsample_initial_channel,
280
- upsample_kernel_sizes,
281
- gin_channels,
282
- ssl_dim,
283
- n_speakers,
284
- **kwargs):
285
-
286
- super().__init__()
287
- self.spec_channels = spec_channels
288
- self.inter_channels = inter_channels
289
- self.hidden_channels = hidden_channels
290
- self.filter_channels = filter_channels
291
- self.n_heads = n_heads
292
- self.n_layers = n_layers
293
- self.kernel_size = kernel_size
294
- self.p_dropout = p_dropout
295
- self.resblock = resblock
296
- self.resblock_kernel_sizes = resblock_kernel_sizes
297
- self.resblock_dilation_sizes = resblock_dilation_sizes
298
- self.upsample_rates = upsample_rates
299
- self.upsample_initial_channel = upsample_initial_channel
300
- self.upsample_kernel_sizes = upsample_kernel_sizes
301
- self.segment_size = segment_size
302
- self.gin_channels = gin_channels
303
- self.ssl_dim = ssl_dim
304
- self.emb_g = nn.Embedding(n_speakers, gin_channels)
305
-
306
- self.enc_p_ = TextEncoder(ssl_dim, inter_channels, hidden_channels, 5, 1, 16,0, filter_channels, n_heads, p_dropout)
307
- hps = {
308
- "sampling_rate": 32000,
309
- "inter_channels": 192,
310
- "resblock": "1",
311
- "resblock_kernel_sizes": [3, 7, 11],
312
- "resblock_dilation_sizes": [[1, 3, 5], [1, 3, 5], [1, 3, 5]],
313
- "upsample_rates": [10, 8, 2, 2],
314
- "upsample_initial_channel": 512,
315
- "upsample_kernel_sizes": [16, 16, 4, 4],
316
- "gin_channels": 256,
317
- }
318
- self.dec = Generator(h=hps)
319
- self.enc_q = Encoder(spec_channels, inter_channels, hidden_channels, 5, 1, 16, gin_channels=gin_channels)
320
- self.flow = ResidualCouplingBlock(inter_channels, hidden_channels, 5, 1, 4, gin_channels=gin_channels)
321
-
322
- def forward(self, c, c_lengths, f0, g=None):
323
- g = self.emb_g(g.unsqueeze(0)).transpose(1,2)
324
- z_p, m_p, logs_p, c_mask = self.enc_p_(c.transpose(1,2), c_lengths, f0=f0_to_coarse(f0))
325
- z = self.flow(z_p, c_mask, g=g, reverse=True)
326
- o = self.dec(z * c_mask, g=g, f0=f0.float())
327
- return o
328
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
onnx/model_onnx_48k.py DELETED
@@ -1,328 +0,0 @@
1
- import copy
2
- import math
3
- import torch
4
- from torch import nn
5
- from torch.nn import functional as F
6
-
7
- import modules.attentions as attentions
8
- import modules.commons as commons
9
- import modules.modules as modules
10
-
11
- from torch.nn import Conv1d, ConvTranspose1d, AvgPool1d, Conv2d
12
- from torch.nn.utils import weight_norm, remove_weight_norm, spectral_norm
13
- from modules.commons import init_weights, get_padding
14
- from vdecoder.hifigan.models import Generator
15
- from utils import f0_to_coarse
16
-
17
- class ResidualCouplingBlock(nn.Module):
18
- def __init__(self,
19
- channels,
20
- hidden_channels,
21
- kernel_size,
22
- dilation_rate,
23
- n_layers,
24
- n_flows=4,
25
- gin_channels=0):
26
- super().__init__()
27
- self.channels = channels
28
- self.hidden_channels = hidden_channels
29
- self.kernel_size = kernel_size
30
- self.dilation_rate = dilation_rate
31
- self.n_layers = n_layers
32
- self.n_flows = n_flows
33
- self.gin_channels = gin_channels
34
-
35
- self.flows = nn.ModuleList()
36
- for i in range(n_flows):
37
- self.flows.append(modules.ResidualCouplingLayer(channels, hidden_channels, kernel_size, dilation_rate, n_layers, gin_channels=gin_channels, mean_only=True))
38
- self.flows.append(modules.Flip())
39
-
40
- def forward(self, x, x_mask, g=None, reverse=False):
41
- if not reverse:
42
- for flow in self.flows:
43
- x, _ = flow(x, x_mask, g=g, reverse=reverse)
44
- else:
45
- for flow in reversed(self.flows):
46
- x = flow(x, x_mask, g=g, reverse=reverse)
47
- return x
48
-
49
-
50
- class Encoder(nn.Module):
51
- def __init__(self,
52
- in_channels,
53
- out_channels,
54
- hidden_channels,
55
- kernel_size,
56
- dilation_rate,
57
- n_layers,
58
- gin_channels=0):
59
- super().__init__()
60
- self.in_channels = in_channels
61
- self.out_channels = out_channels
62
- self.hidden_channels = hidden_channels
63
- self.kernel_size = kernel_size
64
- self.dilation_rate = dilation_rate
65
- self.n_layers = n_layers
66
- self.gin_channels = gin_channels
67
-
68
- self.pre = nn.Conv1d(in_channels, hidden_channels, 1)
69
- self.enc = modules.WN(hidden_channels, kernel_size, dilation_rate, n_layers, gin_channels=gin_channels)
70
- self.proj = nn.Conv1d(hidden_channels, out_channels * 2, 1)
71
-
72
- def forward(self, x, x_lengths, g=None):
73
- # print(x.shape,x_lengths.shape)
74
- x_mask = torch.unsqueeze(commons.sequence_mask(x_lengths, x.size(2)), 1).to(x.dtype)
75
- x = self.pre(x) * x_mask
76
- x = self.enc(x, x_mask, g=g)
77
- stats = self.proj(x) * x_mask
78
- m, logs = torch.split(stats, self.out_channels, dim=1)
79
- z = (m + torch.randn_like(m) * torch.exp(logs)) * x_mask
80
- return z, m, logs, x_mask
81
-
82
-
83
- class TextEncoder(nn.Module):
84
- def __init__(self,
85
- in_channels,
86
- out_channels,
87
- hidden_channels,
88
- kernel_size,
89
- dilation_rate,
90
- n_layers,
91
- gin_channels=0,
92
- filter_channels=None,
93
- n_heads=None,
94
- p_dropout=None):
95
- super().__init__()
96
- self.in_channels = in_channels
97
- self.out_channels = out_channels
98
- self.hidden_channels = hidden_channels
99
- self.kernel_size = kernel_size
100
- self.dilation_rate = dilation_rate
101
- self.n_layers = n_layers
102
- self.gin_channels = gin_channels
103
- self.pre = nn.Conv1d(in_channels, hidden_channels, 1)
104
- self.proj = nn.Conv1d(hidden_channels, out_channels * 2, 1)
105
- self.f0_emb = nn.Embedding(256, hidden_channels)
106
-
107
- self.enc_ = attentions.Encoder(
108
- hidden_channels,
109
- filter_channels,
110
- n_heads,
111
- n_layers,
112
- kernel_size,
113
- p_dropout)
114
-
115
- def forward(self, x, x_lengths, f0=None):
116
- x_mask = torch.unsqueeze(commons.sequence_mask(x_lengths, x.size(2)), 1).to(x.dtype)
117
- x = self.pre(x) * x_mask
118
- x = x + self.f0_emb(f0.long()).transpose(1,2)
119
- x = self.enc_(x * x_mask, x_mask)
120
- stats = self.proj(x) * x_mask
121
- m, logs = torch.split(stats, self.out_channels, dim=1)
122
- z = (m + torch.randn_like(m) * torch.exp(logs)) * x_mask
123
-
124
- return z, m, logs, x_mask
125
-
126
-
127
-
128
- class DiscriminatorP(torch.nn.Module):
129
- def __init__(self, period, kernel_size=5, stride=3, use_spectral_norm=False):
130
- super(DiscriminatorP, self).__init__()
131
- self.period = period
132
- self.use_spectral_norm = use_spectral_norm
133
- norm_f = weight_norm if use_spectral_norm == False else spectral_norm
134
- self.convs = nn.ModuleList([
135
- norm_f(Conv2d(1, 32, (kernel_size, 1), (stride, 1), padding=(get_padding(kernel_size, 1), 0))),
136
- norm_f(Conv2d(32, 128, (kernel_size, 1), (stride, 1), padding=(get_padding(kernel_size, 1), 0))),
137
- norm_f(Conv2d(128, 512, (kernel_size, 1), (stride, 1), padding=(get_padding(kernel_size, 1), 0))),
138
- norm_f(Conv2d(512, 1024, (kernel_size, 1), (stride, 1), padding=(get_padding(kernel_size, 1), 0))),
139
- norm_f(Conv2d(1024, 1024, (kernel_size, 1), 1, padding=(get_padding(kernel_size, 1), 0))),
140
- ])
141
- self.conv_post = norm_f(Conv2d(1024, 1, (3, 1), 1, padding=(1, 0)))
142
-
143
- def forward(self, x):
144
- fmap = []
145
-
146
- # 1d to 2d
147
- b, c, t = x.shape
148
- if t % self.period != 0: # pad first
149
- n_pad = self.period - (t % self.period)
150
- x = F.pad(x, (0, n_pad), "reflect")
151
- t = t + n_pad
152
- x = x.view(b, c, t // self.period, self.period)
153
-
154
- for l in self.convs:
155
- x = l(x)
156
- x = F.leaky_relu(x, modules.LRELU_SLOPE)
157
- fmap.append(x)
158
- x = self.conv_post(x)
159
- fmap.append(x)
160
- x = torch.flatten(x, 1, -1)
161
-
162
- return x, fmap
163
-
164
-
165
- class DiscriminatorS(torch.nn.Module):
166
- def __init__(self, use_spectral_norm=False):
167
- super(DiscriminatorS, self).__init__()
168
- norm_f = weight_norm if use_spectral_norm == False else spectral_norm
169
- self.convs = nn.ModuleList([
170
- norm_f(Conv1d(1, 16, 15, 1, padding=7)),
171
- norm_f(Conv1d(16, 64, 41, 4, groups=4, padding=20)),
172
- norm_f(Conv1d(64, 256, 41, 4, groups=16, padding=20)),
173
- norm_f(Conv1d(256, 1024, 41, 4, groups=64, padding=20)),
174
- norm_f(Conv1d(1024, 1024, 41, 4, groups=256, padding=20)),
175
- norm_f(Conv1d(1024, 1024, 5, 1, padding=2)),
176
- ])
177
- self.conv_post = norm_f(Conv1d(1024, 1, 3, 1, padding=1))
178
-
179
- def forward(self, x):
180
- fmap = []
181
-
182
- for l in self.convs:
183
- x = l(x)
184
- x = F.leaky_relu(x, modules.LRELU_SLOPE)
185
- fmap.append(x)
186
- x = self.conv_post(x)
187
- fmap.append(x)
188
- x = torch.flatten(x, 1, -1)
189
-
190
- return x, fmap
191
-
192
-
193
- class MultiPeriodDiscriminator(torch.nn.Module):
194
- def __init__(self, use_spectral_norm=False):
195
- super(MultiPeriodDiscriminator, self).__init__()
196
- periods = [2,3,5,7,11]
197
-
198
- discs = [DiscriminatorS(use_spectral_norm=use_spectral_norm)]
199
- discs = discs + [DiscriminatorP(i, use_spectral_norm=use_spectral_norm) for i in periods]
200
- self.discriminators = nn.ModuleList(discs)
201
-
202
- def forward(self, y, y_hat):
203
- y_d_rs = []
204
- y_d_gs = []
205
- fmap_rs = []
206
- fmap_gs = []
207
- for i, d in enumerate(self.discriminators):
208
- y_d_r, fmap_r = d(y)
209
- y_d_g, fmap_g = d(y_hat)
210
- y_d_rs.append(y_d_r)
211
- y_d_gs.append(y_d_g)
212
- fmap_rs.append(fmap_r)
213
- fmap_gs.append(fmap_g)
214
-
215
- return y_d_rs, y_d_gs, fmap_rs, fmap_gs
216
-
217
-
218
- class SpeakerEncoder(torch.nn.Module):
219
- def __init__(self, mel_n_channels=80, model_num_layers=3, model_hidden_size=256, model_embedding_size=256):
220
- super(SpeakerEncoder, self).__init__()
221
- self.lstm = nn.LSTM(mel_n_channels, model_hidden_size, model_num_layers, batch_first=True)
222
- self.linear = nn.Linear(model_hidden_size, model_embedding_size)
223
- self.relu = nn.ReLU()
224
-
225
- def forward(self, mels):
226
- self.lstm.flatten_parameters()
227
- _, (hidden, _) = self.lstm(mels)
228
- embeds_raw = self.relu(self.linear(hidden[-1]))
229
- return embeds_raw / torch.norm(embeds_raw, dim=1, keepdim=True)
230
-
231
- def compute_partial_slices(self, total_frames, partial_frames, partial_hop):
232
- mel_slices = []
233
- for i in range(0, total_frames-partial_frames, partial_hop):
234
- mel_range = torch.arange(i, i+partial_frames)
235
- mel_slices.append(mel_range)
236
-
237
- return mel_slices
238
-
239
- def embed_utterance(self, mel, partial_frames=128, partial_hop=64):
240
- mel_len = mel.size(1)
241
- last_mel = mel[:,-partial_frames:]
242
-
243
- if mel_len > partial_frames:
244
- mel_slices = self.compute_partial_slices(mel_len, partial_frames, partial_hop)
245
- mels = list(mel[:,s] for s in mel_slices)
246
- mels.append(last_mel)
247
- mels = torch.stack(tuple(mels), 0).squeeze(1)
248
-
249
- with torch.no_grad():
250
- partial_embeds = self(mels)
251
- embed = torch.mean(partial_embeds, axis=0).unsqueeze(0)
252
- #embed = embed / torch.linalg.norm(embed, 2)
253
- else:
254
- with torch.no_grad():
255
- embed = self(last_mel)
256
-
257
- return embed
258
-
259
-
260
- class SynthesizerTrn(nn.Module):
261
- """
262
- Synthesizer for Training
263
- """
264
-
265
- def __init__(self,
266
- spec_channels,
267
- segment_size,
268
- inter_channels,
269
- hidden_channels,
270
- filter_channels,
271
- n_heads,
272
- n_layers,
273
- kernel_size,
274
- p_dropout,
275
- resblock,
276
- resblock_kernel_sizes,
277
- resblock_dilation_sizes,
278
- upsample_rates,
279
- upsample_initial_channel,
280
- upsample_kernel_sizes,
281
- gin_channels,
282
- ssl_dim,
283
- n_speakers,
284
- **kwargs):
285
-
286
- super().__init__()
287
- self.spec_channels = spec_channels
288
- self.inter_channels = inter_channels
289
- self.hidden_channels = hidden_channels
290
- self.filter_channels = filter_channels
291
- self.n_heads = n_heads
292
- self.n_layers = n_layers
293
- self.kernel_size = kernel_size
294
- self.p_dropout = p_dropout
295
- self.resblock = resblock
296
- self.resblock_kernel_sizes = resblock_kernel_sizes
297
- self.resblock_dilation_sizes = resblock_dilation_sizes
298
- self.upsample_rates = upsample_rates
299
- self.upsample_initial_channel = upsample_initial_channel
300
- self.upsample_kernel_sizes = upsample_kernel_sizes
301
- self.segment_size = segment_size
302
- self.gin_channels = gin_channels
303
- self.ssl_dim = ssl_dim
304
- self.emb_g = nn.Embedding(n_speakers, gin_channels)
305
-
306
- self.enc_p_ = TextEncoder(ssl_dim, inter_channels, hidden_channels, 5, 1, 16,0, filter_channels, n_heads, p_dropout)
307
- hps = {
308
- "sampling_rate": 48000,
309
- "inter_channels": 192,
310
- "resblock": "1",
311
- "resblock_kernel_sizes": [3, 7, 11],
312
- "resblock_dilation_sizes": [[1, 3, 5], [1, 3, 5], [1, 3, 5]],
313
- "upsample_rates": [10, 8, 2, 2],
314
- "upsample_initial_channel": 512,
315
- "upsample_kernel_sizes": [16, 16, 4, 4],
316
- "gin_channels": 256,
317
- }
318
- self.dec = Generator(h=hps)
319
- self.enc_q = Encoder(spec_channels, inter_channels, hidden_channels, 5, 1, 16, gin_channels=gin_channels)
320
- self.flow = ResidualCouplingBlock(inter_channels, hidden_channels, 5, 1, 4, gin_channels=gin_channels)
321
-
322
- def forward(self, c, c_lengths, f0, g=None):
323
- g = self.emb_g(g.unsqueeze(0)).transpose(1,2)
324
- z_p, m_p, logs_p, c_mask = self.enc_p_(c.transpose(1,2), c_lengths, f0=f0_to_coarse(f0))
325
- z = self.flow(z_p, c_mask, g=g, reverse=True)
326
- o = self.dec(z * c_mask, g=g, f0=f0.float())
327
- return o
328
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
onnx/onnx_export.py DELETED
@@ -1,73 +0,0 @@
1
- import argparse
2
- import time
3
- import numpy as np
4
- import onnx
5
- from onnxsim import simplify
6
- import onnxruntime as ort
7
- import onnxoptimizer
8
- import torch
9
- from model_onnx import SynthesizerTrn
10
- import utils
11
- from hubert import hubert_model_onnx
12
-
13
- def main(HubertExport,NetExport):
14
-
15
- path = "NyaruTaffy"
16
-
17
- if(HubertExport):
18
- device = torch.device("cuda")
19
- hubert_soft = utils.get_hubert_model()
20
- test_input = torch.rand(1, 1, 16000)
21
- input_names = ["source"]
22
- output_names = ["embed"]
23
- torch.onnx.export(hubert_soft.to(device),
24
- test_input.to(device),
25
- "hubert3.0.onnx",
26
- dynamic_axes={
27
- "source": {
28
- 2: "sample_length"
29
- }
30
- },
31
- verbose=False,
32
- opset_version=13,
33
- input_names=input_names,
34
- output_names=output_names)
35
- if(NetExport):
36
- device = torch.device("cuda")
37
- hps = utils.get_hparams_from_file(f"checkpoints/{path}/config.json")
38
- SVCVITS = SynthesizerTrn(
39
- hps.data.filter_length // 2 + 1,
40
- hps.train.segment_size // hps.data.hop_length,
41
- **hps.model)
42
- _ = utils.load_checkpoint(f"checkpoints/{path}/model.pth", SVCVITS, None)
43
- _ = SVCVITS.eval().to(device)
44
- for i in SVCVITS.parameters():
45
- i.requires_grad = False
46
- test_hidden_unit = torch.rand(1, 50, 256)
47
- test_lengths = torch.LongTensor([50])
48
- test_pitch = torch.rand(1, 50)
49
- test_sid = torch.LongTensor([0])
50
- input_names = ["hidden_unit", "lengths", "pitch", "sid"]
51
- output_names = ["audio", ]
52
- SVCVITS.eval()
53
- torch.onnx.export(SVCVITS,
54
- (
55
- test_hidden_unit.to(device),
56
- test_lengths.to(device),
57
- test_pitch.to(device),
58
- test_sid.to(device)
59
- ),
60
- f"checkpoints/{path}/model.onnx",
61
- dynamic_axes={
62
- "hidden_unit": [0, 1],
63
- "pitch": [1]
64
- },
65
- do_constant_folding=False,
66
- opset_version=16,
67
- verbose=False,
68
- input_names=input_names,
69
- output_names=output_names)
70
-
71
-
72
- if __name__ == '__main__':
73
- main(False,True)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
onnx/onnx_export_48k.py DELETED
@@ -1,73 +0,0 @@
1
- import argparse
2
- import time
3
- import numpy as np
4
- import onnx
5
- from onnxsim import simplify
6
- import onnxruntime as ort
7
- import onnxoptimizer
8
- import torch
9
- from model_onnx_48k import SynthesizerTrn
10
- import utils
11
- from hubert import hubert_model_onnx
12
-
13
- def main(HubertExport,NetExport):
14
-
15
- path = "NyaruTaffy"
16
-
17
- if(HubertExport):
18
- device = torch.device("cuda")
19
- hubert_soft = hubert_model_onnx.hubert_soft("hubert/model.pt")
20
- test_input = torch.rand(1, 1, 16000)
21
- input_names = ["source"]
22
- output_names = ["embed"]
23
- torch.onnx.export(hubert_soft.to(device),
24
- test_input.to(device),
25
- "hubert3.0.onnx",
26
- dynamic_axes={
27
- "source": {
28
- 2: "sample_length"
29
- }
30
- },
31
- verbose=False,
32
- opset_version=13,
33
- input_names=input_names,
34
- output_names=output_names)
35
- if(NetExport):
36
- device = torch.device("cuda")
37
- hps = utils.get_hparams_from_file(f"checkpoints/{path}/config.json")
38
- SVCVITS = SynthesizerTrn(
39
- hps.data.filter_length // 2 + 1,
40
- hps.train.segment_size // hps.data.hop_length,
41
- **hps.model)
42
- _ = utils.load_checkpoint(f"checkpoints/{path}/model.pth", SVCVITS, None)
43
- _ = SVCVITS.eval().to(device)
44
- for i in SVCVITS.parameters():
45
- i.requires_grad = False
46
- test_hidden_unit = torch.rand(1, 50, 256)
47
- test_lengths = torch.LongTensor([50])
48
- test_pitch = torch.rand(1, 50)
49
- test_sid = torch.LongTensor([0])
50
- input_names = ["hidden_unit", "lengths", "pitch", "sid"]
51
- output_names = ["audio", ]
52
- SVCVITS.eval()
53
- torch.onnx.export(SVCVITS,
54
- (
55
- test_hidden_unit.to(device),
56
- test_lengths.to(device),
57
- test_pitch.to(device),
58
- test_sid.to(device)
59
- ),
60
- f"checkpoints/{path}/model.onnx",
61
- dynamic_axes={
62
- "hidden_unit": [0, 1],
63
- "pitch": [1]
64
- },
65
- do_constant_folding=False,
66
- opset_version=16,
67
- verbose=False,
68
- input_names=input_names,
69
- output_names=output_names)
70
-
71
-
72
- if __name__ == '__main__':
73
- main(False,True)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
requirements.txt CHANGED
@@ -19,3 +19,4 @@ onnxsim
19
  onnxoptimizer
20
  fairseq
21
  librosa
 
 
19
  onnxoptimizer
20
  fairseq
21
  librosa
22
+ edge-tts
vdecoder/__pycache__/__init__.cpython-38.pyc CHANGED
Binary files a/vdecoder/__pycache__/__init__.cpython-38.pyc and b/vdecoder/__pycache__/__init__.cpython-38.pyc differ
 
vdecoder/hifigan/__pycache__/env.cpython-38.pyc CHANGED
Binary files a/vdecoder/hifigan/__pycache__/env.cpython-38.pyc and b/vdecoder/hifigan/__pycache__/env.cpython-38.pyc differ
 
vdecoder/hifigan/__pycache__/models.cpython-38.pyc CHANGED
Binary files a/vdecoder/hifigan/__pycache__/models.cpython-38.pyc and b/vdecoder/hifigan/__pycache__/models.cpython-38.pyc differ
 
vdecoder/hifigan/__pycache__/utils.cpython-38.pyc CHANGED
Binary files a/vdecoder/hifigan/__pycache__/utils.cpython-38.pyc and b/vdecoder/hifigan/__pycache__/utils.cpython-38.pyc differ