sovits-models

Build error

App Files Files Community

mthsk commited on Apr 13, 2023

Commit

d4b2b22

1 Parent(s): 6844b9e

Add TTS from upstream

Browse files

Files changed (18) hide show

LICENSE +21 -407
app-slice.py +36 -12
app.py +55 -6
cluster/__pycache__/__init__.cpython-38.pyc +0 -0
hubert/__pycache__/__init__.cpython-38.pyc +0 -0
hubert/__pycache__/hubert_model.cpython-38.pyc +0 -0
inference/__pycache__/infer_tool.cpython-38.pyc +0 -0
inference/infer_tool.py +86 -38
inference_main.py +49 -19
onnx/model_onnx.py +0 -328
onnx/model_onnx_48k.py +0 -328
onnx/onnx_export.py +0 -73
onnx/onnx_export_48k.py +0 -73
requirements.txt +1 -0
vdecoder/__pycache__/__init__.cpython-38.pyc +0 -0
vdecoder/hifigan/__pycache__/env.cpython-38.pyc +0 -0
vdecoder/hifigan/__pycache__/models.cpython-38.pyc +0 -0
vdecoder/hifigan/__pycache__/utils.cpython-38.pyc +0 -0

LICENSE CHANGED Viewed

@@ -1,407 +1,21 @@
-Attribution-NonCommercial 4.0 International
-=======================================================================
-Creative Commons Corporation ("Creative Commons") is not a law firm and
-does not provide legal services or legal advice. Distribution of
-Creative Commons public licenses does not create a lawyer-client or
-other relationship. Creative Commons makes its licenses and related
-information available on an "as-is" basis. Creative Commons gives no
-warranties regarding its licenses, any material licensed under their
-terms and conditions, or any related information. Creative Commons
-disclaims all liability for damages resulting from their use to the
-fullest extent possible.
-Using Creative Commons Public Licenses
-Creative Commons public licenses provide a standard set of terms and
-conditions that creators and other rights holders may use to share
-original works of authorship and other material subject to copyright
-and certain other rights specified in the public license below. The
-following considerations are for informational purposes only, are not
-exhaustive, and do not form part of our licenses.
-     Considerations for licensors: Our public licenses are
-     intended for use by those authorized to give the public
-     permission to use material in ways otherwise restricted by
-     copyright and certain other rights. Our licenses are
-     irrevocable. Licensors should read and understand the terms
-     and conditions of the license they choose before applying it.
-     Licensors should also secure all rights necessary before
-     applying our licenses so that the public can reuse the
-     material as expected. Licensors should clearly mark any
-     material not subject to the license. This includes other CC-
-     licensed material, or material used under an exception or
-     limitation to copyright. More considerations for licensors:
-    wiki.creativecommons.org/Considerations_for_licensors
-     Considerations for the public: By using one of our public
-     licenses, a licensor grants the public permission to use the
-     licensed material under specified terms and conditions. If
-     the licensor's permission is not necessary for any reason--for
-     example, because of any applicable exception or limitation to
-     copyright--then that use is not regulated by the license. Our
-     licenses grant only permissions under copyright and certain
-     other rights that a licensor has authority to grant. Use of
-     the licensed material may still be restricted for other
-     reasons, including because others have copyright or other
-     rights in the material. A licensor may make special requests,
-     such as asking that all changes be marked or described.
-     Although not required by our licenses, you are encouraged to
-     respect those requests where reasonable. More considerations
-     for the public:
-    wiki.creativecommons.org/Considerations_for_licensees
-=======================================================================
-Creative Commons Attribution-NonCommercial 4.0 International Public
-License
-By exercising the Licensed Rights (defined below), You accept and agree
-to be bound by the terms and conditions of this Creative Commons
-Attribution-NonCommercial 4.0 International Public License ("Public
-License"). To the extent this Public License may be interpreted as a
-contract, You are granted the Licensed Rights in consideration of Your
-acceptance of these terms and conditions, and the Licensor grants You
-such rights in consideration of benefits the Licensor receives from
-making the Licensed Material available under these terms and
-conditions.
-Section 1 -- Definitions.
-  a. Adapted Material means material subject to Copyright and Similar
-     Rights that is derived from or based upon the Licensed Material
-     and in which the Licensed Material is translated, altered,
-     arranged, transformed, or otherwise modified in a manner requiring
-     permission under the Copyright and Similar Rights held by the
-     Licensor. For purposes of this Public License, where the Licensed
-     Material is a musical work, performance, or sound recording,
-     Adapted Material is always produced where the Licensed Material is
-     synched in timed relation with a moving image.
-  b. Adapter's License means the license You apply to Your Copyright
-     and Similar Rights in Your contributions to Adapted Material in
-     accordance with the terms and conditions of this Public License.
-  c. Copyright and Similar Rights means copyright and/or similar rights
-     closely related to copyright including, without limitation,
-     performance, broadcast, sound recording, and Sui Generis Database
-     Rights, without regard to how the rights are labeled or
-     categorized. For purposes of this Public License, the rights
-     specified in Section 2(b)(1)-(2) are not Copyright and Similar
-     Rights.
-  d. Effective Technological Measures means those measures that, in the
-     absence of proper authority, may not be circumvented under laws
-     fulfilling obligations under Article 11 of the WIPO Copyright
-     Treaty adopted on December 20, 1996, and/or similar international
-     agreements.
-  e. Exceptions and Limitations means fair use, fair dealing, and/or
-     any other exception or limitation to Copyright and Similar Rights
-     that applies to Your use of the Licensed Material.
-  f. Licensed Material means the artistic or literary work, database,
-     or other material to which the Licensor applied this Public
-     License.
-  g. Licensed Rights means the rights granted to You subject to the
-     terms and conditions of this Public License, which are limited to
-     all Copyright and Similar Rights that apply to Your use of the
-     Licensed Material and that the Licensor has authority to license.
-  h. Licensor means the individual(s) or entity(ies) granting rights
-     under this Public License.
-  i. NonCommercial means not primarily intended for or directed towards
-     commercial advantage or monetary compensation. For purposes of
-     this Public License, the exchange of the Licensed Material for
-     other material subject to Copyright and Similar Rights by digital
-     file-sharing or similar means is NonCommercial provided there is
-     no payment of monetary compensation in connection with the
-     exchange.
-  j. Share means to provide material to the public by any means or
-     process that requires permission under the Licensed Rights, such
-     as reproduction, public display, public performance, distribution,
-     dissemination, communication, or importation, and to make material
-     available to the public including in ways that members of the
-     public may access the material from a place and at a time
-     individually chosen by them.
-  k. Sui Generis Database Rights means rights other than copyright
-     resulting from Directive 96/9/EC of the European Parliament and of
-     the Council of 11 March 1996 on the legal protection of databases,
-     as amended and/or succeeded, as well as other essentially
-     equivalent rights anywhere in the world.
-  l. You means the individual or entity exercising the Licensed Rights
-     under this Public License. Your has a corresponding meaning.
-Section 2 -- Scope.
-  a. License grant.
-       1. Subject to the terms and conditions of this Public License,
-          the Licensor hereby grants You a worldwide, royalty-free,
-          non-sublicensable, non-exclusive, irrevocable license to
-          exercise the Licensed Rights in the Licensed Material to:
-            a. reproduce and Share the Licensed Material, in whole or
-               in part, for NonCommercial purposes only; and
-            b. produce, reproduce, and Share Adapted Material for
-               NonCommercial purposes only.
-       2. Exceptions and Limitations. For the avoidance of doubt, where
-          Exceptions and Limitations apply to Your use, this Public
-          License does not apply, and You do not need to comply with
-          its terms and conditions.
-       3. Term. The term of this Public License is specified in Section
-          6(a).
-       4. Media and formats; technical modifications allowed. The
-          Licensor authorizes You to exercise the Licensed Rights in
-          all media and formats whether now known or hereafter created,
-          and to make technical modifications necessary to do so. The
-          Licensor waives and/or agrees not to assert any right or
-          authority to forbid You from making technical modifications
-          necessary to exercise the Licensed Rights, including
-          technical modifications necessary to circumvent Effective
-          Technological Measures. For purposes of this Public License,
-          simply making modifications authorized by this Section 2(a)
-          (4) never produces Adapted Material.
-       5. Downstream recipients.
-            a. Offer from the Licensor -- Licensed Material. Every
-               recipient of the Licensed Material automatically
-               receives an offer from the Licensor to exercise the
-               Licensed Rights under the terms and conditions of this
-               Public License.
-            b. No downstream restrictions. You may not offer or impose
-               any additional or different terms or conditions on, or
-               apply any Effective Technological Measures to, the
-               Licensed Material if doing so restricts exercise of the
-               Licensed Rights by any recipient of the Licensed
-               Material.
-       6. No endorsement. Nothing in this Public License constitutes or
-          may be construed as permission to assert or imply that You
-          are, or that Your use of the Licensed Material is, connected
-          with, or sponsored, endorsed, or granted official status by,
-          the Licensor or others designated to receive attribution as
-          provided in Section 3(a)(1)(A)(i).
-  b. Other rights.
-       1. Moral rights, such as the right of integrity, are not
-          licensed under this Public License, nor are publicity,
-          privacy, and/or other similar personality rights; however, to
-          the extent possible, the Licensor waives and/or agrees not to
-          assert any such rights held by the Licensor to the limited
-          extent necessary to allow You to exercise the Licensed
-          Rights, but not otherwise.
-       2. Patent and trademark rights are not licensed under this
-          Public License.
-       3. To the extent possible, the Licensor waives any right to
-          collect royalties from You for the exercise of the Licensed
-          Rights, whether directly or through a collecting society
-          under any voluntary or waivable statutory or compulsory
-          licensing scheme. In all other cases the Licensor expressly
-          reserves any right to collect such royalties, including when
-          the Licensed Material is used other than for NonCommercial
-          purposes.
-Section 3 -- License Conditions.
-Your exercise of the Licensed Rights is expressly made subject to the
-following conditions.
-  a. Attribution.
-       1. If You Share the Licensed Material (including in modified
-          form), You must:
-            a. retain the following if it is supplied by the Licensor
-               with the Licensed Material:
-                 i. identification of the creator(s) of the Licensed
-                    Material and any others designated to receive
-                    attribution, in any reasonable manner requested by
-                    the Licensor (including by pseudonym if
-                    designated);
-                ii. a copyright notice;
-               iii. a notice that refers to this Public License;
-                iv. a notice that refers to the disclaimer of
-                    warranties;
-                 v. a URI or hyperlink to the Licensed Material to the
-                    extent reasonably practicable;
-            b. indicate if You modified the Licensed Material and
-               retain an indication of any previous modifications; and
-            c. indicate the Licensed Material is licensed under this
-               Public License, and include the text of, or the URI or
-               hyperlink to, this Public License.
-       2. You may satisfy the conditions in Section 3(a)(1) in any
-          reasonable manner based on the medium, means, and context in
-          which You Share the Licensed Material. For example, it may be
-          reasonable to satisfy the conditions by providing a URI or
-          hyperlink to a resource that includes the required
-          information.
-       3. If requested by the Licensor, You must remove any of the
-          information required by Section 3(a)(1)(A) to the extent
-          reasonably practicable.
-       4. If You Share Adapted Material You produce, the Adapter's
-          License You apply must not prevent recipients of the Adapted
-          Material from complying with this Public License.
-Section 4 -- Sui Generis Database Rights.
-Where the Licensed Rights include Sui Generis Database Rights that
-apply to Your use of the Licensed Material:
-  a. for the avoidance of doubt, Section 2(a)(1) grants You the right
-     to extract, reuse, reproduce, and Share all or a substantial
-     portion of the contents of the database for NonCommercial purposes
-     only;
-  b. if You include all or a substantial portion of the database
-     contents in a database in which You have Sui Generis Database
-     Rights, then the database in which You have Sui Generis Database
-     Rights (but not its individual contents) is Adapted Material; and
-  c. You must comply with the conditions in Section 3(a) if You Share
-     all or a substantial portion of the contents of the database.
-For the avoidance of doubt, this Section 4 supplements and does not
-replace Your obligations under this Public License where the Licensed
-Rights include other Copyright and Similar Rights.
-Section 5 -- Disclaimer of Warranties and Limitation of Liability.
-  a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
-     EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
-     AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
-     ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
-     IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
-     WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
-     PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
-     ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
-     KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
-     ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.
-  b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
-     TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
-     NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
-     INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
-     COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
-     USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
-     ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
-     DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
-     IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.
-  c. The disclaimer of warranties and limitation of liability provided
-     above shall be interpreted in a manner that, to the extent
-     possible, most closely approximates an absolute disclaimer and
-     waiver of all liability.
-Section 6 -- Term and Termination.
-  a. This Public License applies for the term of the Copyright and
-     Similar Rights licensed here. However, if You fail to comply with
-     this Public License, then Your rights under this Public License
-     terminate automatically.
-  b. Where Your right to use the Licensed Material has terminated under
-     Section 6(a), it reinstates:
-       1. automatically as of the date the violation is cured, provided
-          it is cured within 30 days of Your discovery of the
-          violation; or
-       2. upon express reinstatement by the Licensor.
-     For the avoidance of doubt, this Section 6(b) does not affect any
-     right the Licensor may have to seek remedies for Your violations
-     of this Public License.
-  c. For the avoidance of doubt, the Licensor may also offer the
-     Licensed Material under separate terms or conditions or stop
-     distributing the Licensed Material at any time; however, doing so
-     will not terminate this Public License.
-  d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
-     License.
-Section 7 -- Other Terms and Conditions.
-  a. The Licensor shall not be bound by any additional or different
-     terms or conditions communicated by You unless expressly agreed.
-  b. Any arrangements, understandings, or agreements regarding the
-     Licensed Material not stated herein are separate from and
-     independent of the terms and conditions of this Public License.
-Section 8 -- Interpretation.
-  a. For the avoidance of doubt, this Public License does not, and
-     shall not be interpreted to, reduce, limit, restrict, or impose
-     conditions on any use of the Licensed Material that could lawfully
-     be made without permission under this Public License.
-  b. To the extent possible, if any provision of this Public License is
-     deemed unenforceable, it shall be automatically reformed to the
-     minimum extent necessary to make it enforceable. If the provision
-     cannot be reformed, it shall be severed from this Public License
-     without affecting the enforceability of the remaining terms and
-     conditions.
-  c. No term or condition of this Public License will be waived and no
-     failure to comply consented to unless expressly agreed to by the
-     Licensor.
-  d. Nothing in this Public License constitutes or may be interpreted
-     as a limitation upon, or waiver of, any privileges and immunities
-     that apply to the Licensor or You, including from the legal
-     processes of any jurisdiction or authority.
-=======================================================================
-Creative Commons is not a party to its public
-licenses. Notwithstanding, Creative Commons may elect to apply one of
-its public licenses to material it publishes and in those instances
-will be considered the “Licensor.” The text of the Creative Commons
-public licenses is dedicated to the public domain under the CC0 Public
-Domain Dedication. Except for the limited purpose of indicating that
-material is shared under a Creative Commons public license or as
-otherwise permitted by the Creative Commons policies published at
-creativecommons.org/policies, Creative Commons does not authorize the
-use of the trademark "Creative Commons" or any other trademark or logo
-of Creative Commons without its prior written consent including,
-without limitation, in connection with any unauthorized modifications
-to any of its public licenses or any other arrangements,
-understandings, or agreements concerning use of licensed material. For
-the avoidance of doubt, this paragraph does not form part of the
-public licenses.
-Creative Commons may be contacted at creativecommons.org.

+MIT License
+Copyright (c) 2021 Jingyi Li
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

app-slice.py CHANGED Viewed

@@ -1,7 +1,6 @@
 import os
 import gradio as gr
-import librosa
-import numpy as np
 from pathlib import Path
 import inference.infer_tool as infer_tool
 import utils
@@ -9,6 +8,8 @@ from inference.infer_tool import Svc
 import logging
 import webbrowser
 import argparse
 import soundfile
 import gradio.processing_utils as gr_processing_utils
 logging.getLogger('numba').setLevel(logging.WARNING)
@@ -29,14 +30,24 @@ def audio_postprocess(self, y):
 gr.Audio.postprocess = audio_postprocess
 def create_vc_fn(model, sid):
-    def vc_fn(input_audio, vc_transform, auto_f0, slice_db, noise_scale, pad_seconds):
-        if input_audio is None:
-            return "You need to select an audio", None
-        raw_audio_path = f"raw/{input_audio}"
-        if "." not in raw_audio_path:
-            raw_audio_path += ".wav"
-        infer_tool.format_wav(raw_audio_path)
-        wav_path = Path(raw_audio_path).with_suffix('.wav')
         _audio = model.slice_inference(
             wav_path, sid, vc_transform, slice_db,
             cluster_infer_ratio=0,
@@ -50,6 +61,11 @@ def create_vc_fn(model, sid):
 def refresh_raw_wav():
     return gr.Dropdown.update(choices=os.listdir("raw"))
 if __name__ == '__main__':
     parser = argparse.ArgumentParser()
@@ -60,10 +76,14 @@ if __name__ == '__main__':
     args = parser.parse_args()
     hubert_model = utils.get_hubert_model().to(args.device)
     models = []
     raw = os.listdir("raw")
     for f in os.listdir("models"):
         name = f
-        model = Svc(fr"models/{f}/{f}.pth", f"models/{f}/config.json", device=args.device, hubert_model=hubert_model)
         cover = f"models/{f}/cover.png" if os.path.exists(f"models/{f}/cover.png") else None
         models.append((name, cover, create_vc_fn(model, name)))
     with gr.Blocks() as app:
@@ -100,12 +120,16 @@ if __name__ == '__main__':
                             noise_scale = gr.Number(label="noise_scale", value=0.4)
                             pad_seconds = gr.Number(label="pad_seconds", value=0.5)
                             auto_f0 = gr.Checkbox(label="auto_f0", value=False)
                             vc_submit = gr.Button("Generate", variant="primary")
                         with gr.Column():
                             vc_output1 = gr.Textbox(label="Output Message")
                             vc_output2 = gr.Audio(label="Output Audio")
-                vc_submit.click(vc_fn, [vc_input, vc_transform, auto_f0, slice_db,  noise_scale, pad_seconds], [vc_output1, vc_output2])
                 vc_refresh.click(refresh_raw_wav, [], [vc_input])
         if args.colab:
             webbrowser.open("http://127.0.0.1:7860")
         app.queue(concurrency_count=1, api_open=args.api).launch(share=args.share)

 import os
 import gradio as gr
+import edge_tts
 from pathlib import Path
 import inference.infer_tool as infer_tool
 import utils
 import logging
 import webbrowser
 import argparse
+import asyncio
+import librosa
 import soundfile
 import gradio.processing_utils as gr_processing_utils
 logging.getLogger('numba').setLevel(logging.WARNING)
 gr.Audio.postprocess = audio_postprocess
 def create_vc_fn(model, sid):
+    def vc_fn(input_audio, vc_transform, auto_f0, slice_db, noise_scale, pad_seconds, tts_text, tts_voice, tts_mode):
+        if tts_mode:
+            if len(tts_text) > 100 and limitation:
+                return "Text is too long", None
+            if tts_text is None or tts_voice is None:
+                return "You need to enter text and select a voice", None
+            asyncio.run(edge_tts.Communicate(tts_text, "-".join(tts_voice.split('-')[:-1])).save("tts.mp3"))
+            audio, sr = librosa.load("tts.mp3")
+            soundfile.write("tts.wav", audio, 24000, format="wav")
+            wav_path = "tts.wav"
+        else:
+            if input_audio is None:
+                return "You need to select an audio", None
+            raw_audio_path = f"raw/{input_audio}"
+            if "." not in raw_audio_path:
+                raw_audio_path += ".wav"
+            infer_tool.format_wav(raw_audio_path)
+            wav_path = Path(raw_audio_path).with_suffix('.wav')
         _audio = model.slice_inference(
             wav_path, sid, vc_transform, slice_db,
             cluster_infer_ratio=0,
 def refresh_raw_wav():
     return gr.Dropdown.update(choices=os.listdir("raw"))
+def change_to_tts_mode(tts_mode):
+    if tts_mode:
+        return gr.Audio.update(visible=False), gr.Button.update(visible=False), gr.Textbox.update(visible=True), gr.Dropdown.update(visible=True)
+    else:
+        return gr.Audio.update(visible=True), gr.Button.update(visible=True), gr.Textbox.update(visible=False), gr.Dropdown.update(visible=False)
 if __name__ == '__main__':
     parser = argparse.ArgumentParser()
     args = parser.parse_args()
     hubert_model = utils.get_hubert_model().to(args.device)
     models = []
+    voices = []
+    tts_voice_list = asyncio.get_event_loop().run_until_complete(edge_tts.list_voices())
+    for r in tts_voice_list:
+        voices.append(f"{r['ShortName']}-{r['Gender']}")
     raw = os.listdir("raw")
     for f in os.listdir("models"):
         name = f
+        model = Svc(fr"models/{f}/{f}.pth", f"models/{f}/config.json", device=args.device)
         cover = f"models/{f}/cover.png" if os.path.exists(f"models/{f}/cover.png") else None
         models.append((name, cover, create_vc_fn(model, name)))
     with gr.Blocks() as app:
                             noise_scale = gr.Number(label="noise_scale", value=0.4)
                             pad_seconds = gr.Number(label="pad_seconds", value=0.5)
                             auto_f0 = gr.Checkbox(label="auto_f0", value=False)
+                            tts_mode = gr.Checkbox(label="tts (use edge-tts as input)", value=False)
+                            tts_text = gr.Textbox(visible=False,label="TTS text (100 words limitation)" if limitation else "TTS text")
+                            tts_voice = gr.Dropdown(choices=voices, visible=False)
                             vc_submit = gr.Button("Generate", variant="primary")
                         with gr.Column():
                             vc_output1 = gr.Textbox(label="Output Message")
                             vc_output2 = gr.Audio(label="Output Audio")
+                vc_submit.click(vc_fn, [vc_input, vc_transform, auto_f0, slice_db,  noise_scale, pad_seconds, tts_text, tts_voice, tts_mode], [vc_output1, vc_output2])
                 vc_refresh.click(refresh_raw_wav, [], [vc_input])
+                tts_mode.change(change_to_tts_mode, [tts_mode], [vc_input, vc_refresh, tts_text, tts_voice])
         if args.colab:
             webbrowser.open("http://127.0.0.1:7860")
         app.queue(concurrency_count=1, api_open=args.api).launch(share=args.share)

app.py CHANGED Viewed

@@ -7,7 +7,9 @@ import utils
 from inference.infer_tool import Svc
 import logging
 import soundfile
 import argparse
 import gradio.processing_utils as gr_processing_utils
 logging.getLogger('numba').setLevel(logging.WARNING)
 logging.getLogger('markdown_it').setLevel(logging.WARNING)
@@ -27,7 +29,21 @@ def audio_postprocess(self, y):
 gr.Audio.postprocess = audio_postprocess
 def create_vc_fn(model, sid):
-    def vc_fn(input_audio, vc_transform, auto_f0):
         if input_audio is None:
             return "You need to upload an audio", None
         sampling_rate, audio = input_audio
@@ -48,6 +64,12 @@ def create_vc_fn(model, sid):
         return "Success", (44100, out_audio.cpu().numpy())
     return vc_fn
 if __name__ == '__main__':
     parser = argparse.ArgumentParser()
     parser.add_argument('--device', type=str, default='cpu')
@@ -56,16 +78,27 @@ if __name__ == '__main__':
     args = parser.parse_args()
     hubert_model = utils.get_hubert_model().to(args.device)
     models = []
     for f in os.listdir("models"):
         name = f
-        model = Svc(fr"models/{f}/{f}.pth", f"models/{f}/config.json", device=args.device, hubert_model=hubert_model)
         cover = f"models/{f}/cover.png" if os.path.exists(f"models/{f}/cover.png") else None
         models.append((name, cover, create_vc_fn(model, name)))
     with gr.Blocks() as app:
         gr.Markdown(
             "# <center> Sovits Models\n"
-            "## <center> The input audio should be clean and pure voice without background music.\n\n"
-            "[Original Repo](https://github.com/svc-develop-team/so-vits-svc)\n\n"
         )
         with gr.Tabs():
@@ -82,9 +115,25 @@ if __name__ == '__main__':
                             vc_input = gr.Audio(label="Input audio"+' (less than 20 seconds)' if limitation else '')
                             vc_transform = gr.Number(label="vc_transform", value=0)
                             auto_f0 = gr.Checkbox(label="auto_f0", value=False)
                             vc_submit = gr.Button("Generate", variant="primary")
                         with gr.Column():
                             vc_output1 = gr.Textbox(label="Output Message")
                             vc_output2 = gr.Audio(label="Output Audio")
-                vc_submit.click(vc_fn, [vc_input, vc_transform, auto_f0], [vc_output1, vc_output2])
-        app.queue(concurrency_count=1, api_open=args.api).launch(share=args.share)

 from inference.infer_tool import Svc
 import logging
 import soundfile
+import asyncio
 import argparse
+import edge_tts
 import gradio.processing_utils as gr_processing_utils
 logging.getLogger('numba').setLevel(logging.WARNING)
 logging.getLogger('markdown_it').setLevel(logging.WARNING)
 gr.Audio.postprocess = audio_postprocess
 def create_vc_fn(model, sid):
+    def vc_fn(input_audio, vc_transform, auto_f0, tts_text, tts_voice, tts_mode):
+        if tts_mode:
+            if len(tts_text) > 100 and limitation:
+                return "Text is too long", None
+            if tts_text is None or tts_voice is None:
+                return "You need to enter text and select a voice", None
+            asyncio.run(edge_tts.Communicate(tts_text, "-".join(tts_voice.split('-')[:-1])).save("tts.mp3"))
+            audio, sr = librosa.load("tts.mp3", sr=16000, mono=True)
+            raw_path = io.BytesIO()
+            soundfile.write(raw_path, audio, 16000, format="wav")
+            raw_path.seek(0)
+            out_audio, out_sr = model.infer(sid, vc_transform, raw_path,
+                                            auto_predict_f0=auto_f0,
+                                            )
+            return "Success", (44100, out_audio.cpu().numpy())
         if input_audio is None:
             return "You need to upload an audio", None
         sampling_rate, audio = input_audio
         return "Success", (44100, out_audio.cpu().numpy())
     return vc_fn
+def change_to_tts_mode(tts_mode):
+    if tts_mode:
+        return gr.Audio.update(visible=False), gr.Textbox.update(visible=True), gr.Dropdown.update(visible=True), gr.Checkbox.update(value=True)
+    else:
+        return gr.Audio.update(visible=True), gr.Textbox.update(visible=False), gr.Dropdown.update(visible=False), gr.Checkbox.update(value=False)
 if __name__ == '__main__':
     parser = argparse.ArgumentParser()
     parser.add_argument('--device', type=str, default='cpu')
     args = parser.parse_args()
     hubert_model = utils.get_hubert_model().to(args.device)
     models = []
+    others = {
+        "rudolf": "https://huggingface.co/spaces/sayashi/sovits-rudolf",
+        "teio": "https://huggingface.co/spaces/sayashi/sovits-teio",
+        "goldship": "https://huggingface.co/spaces/sayashi/sovits-goldship",
+        "tannhauser": "https://huggingface.co/spaces/sayashi/sovits-tannhauser"
+    }
+    voices = []
+    tts_voice_list = asyncio.get_event_loop().run_until_complete(edge_tts.list_voices())
+    for r in tts_voice_list:
+        voices.append(f"{r['ShortName']}-{r['Gender']}")
     for f in os.listdir("models"):
         name = f
+        model = Svc(fr"models/{f}/{f}.pth", f"models/{f}/config.json", device=args.device)
         cover = f"models/{f}/cover.png" if os.path.exists(f"models/{f}/cover.png") else None
         models.append((name, cover, create_vc_fn(model, name)))
     with gr.Blocks() as app:
         gr.Markdown(
             "# <center> Sovits Models\n"
+            "## <center> The input audio should be clean and pure voice without background music.\n"
+            "![visitor badge](https://visitor-badge.glitch.me/badge?page_id=mthsk.sovits-models)\n\n"
+            "[![Original Repo](https://badgen.net/badge/icon/github?icon=github&label=Original%20Repo)](https://github.com/svc-develop-team/so-vits-svc)\n\n"
         )
         with gr.Tabs():
                             vc_input = gr.Audio(label="Input audio"+' (less than 20 seconds)' if limitation else '')
                             vc_transform = gr.Number(label="vc_transform", value=0)
                             auto_f0 = gr.Checkbox(label="auto_f0", value=False)
+                            tts_mode = gr.Checkbox(label="tts (use edge-tts as input)", value=False)
+                            tts_text = gr.Textbox(visible=False, label="TTS text (100 words limitation)" if limitation else "TTS text")
+                            tts_voice = gr.Dropdown(choices=voices, visible=False)
                             vc_submit = gr.Button("Generate", variant="primary")
                         with gr.Column():
                             vc_output1 = gr.Textbox(label="Output Message")
                             vc_output2 = gr.Audio(label="Output Audio")
+                vc_submit.click(vc_fn, [vc_input, vc_transform, auto_f0, tts_text, tts_voice, tts_mode], [vc_output1, vc_output2])
+                tts_mode.change(change_to_tts_mode, [tts_mode], [vc_input, tts_text, tts_voice, auto_f0])
+            for category, link in others.items():
+                with gr.TabItem(category):
+                    gr.Markdown(
+                        f'''
+                        <center>
+                          <h2>Click to Go</h2>
+                          <a href="{link}">
+                            <img src="https://huggingface.co/datasets/huggingface/badges/raw/main/open-in-hf-spaces-xl-dark.svg"
+                          </a>
+                        </center>
+                        '''
+                    )
+        app.queue(concurrency_count=1, api_open=args.api).launch(share=args.share)

cluster/__pycache__/__init__.cpython-38.pyc CHANGED Viewed

Binary files a/cluster/__pycache__/__init__.cpython-38.pyc and b/cluster/__pycache__/__init__.cpython-38.pyc differ

hubert/__pycache__/__init__.cpython-38.pyc CHANGED Viewed

Binary files a/hubert/__pycache__/__init__.cpython-38.pyc and b/hubert/__pycache__/__init__.cpython-38.pyc differ

hubert/__pycache__/hubert_model.cpython-38.pyc CHANGED Viewed

Binary files a/hubert/__pycache__/hubert_model.cpython-38.pyc and b/hubert/__pycache__/hubert_model.cpython-38.pyc differ

inference/__pycache__/infer_tool.cpython-38.pyc CHANGED Viewed

Binary files a/inference/__pycache__/infer_tool.cpython-38.pyc and b/inference/__pycache__/infer_tool.cpython-38.pyc differ

inference/infer_tool.py CHANGED Viewed

@@ -108,8 +108,11 @@ def split_list_by_n(list_collection, n, pre=0):
         yield list_collection[i-pre if i-pre>=0 else i: i + n]
 class Svc(object):
-    def __init__(self, net_g_path, config_path, hubert_model,
                  device=None,
                  cluster_model_path="logs/44k/kmeans_10000.pt"):
         self.net_g_path = net_g_path
@@ -123,7 +126,7 @@ class Svc(object):
         self.hop_size = self.hps_ms.data.hop_length
         self.spk2id = self.hps_ms.spk
         # 加载hubert
-        self.hubert_model = hubert_model
         self.load_model()
         if os.path.exists(cluster_model_path):
             self.cluster_model = cluster.get_cluster_model(cluster_model_path)
@@ -142,12 +145,24 @@ class Svc(object):
-    def get_unit_f0(self, in_path, tran, cluster_infer_ratio, speaker):
         wav, sr = librosa.load(in_path, sr=self.target_sample)
-        f0 = utils.compute_f0_parselmouth(wav, sampling_rate=self.target_sample, hop_length=self.hop_size)
-        f0, uv = utils.interpolate_f0(f0)
-        f0 = torch.FloatTensor(f0)
-        uv = torch.FloatTensor(uv)
         f0 = f0 * 2 ** (tran / 12)
         f0 = f0.unsqueeze(0).to(self.dev)
         uv = uv.unsqueeze(0).to(self.dev)
@@ -157,7 +172,7 @@ class Svc(object):
         c = utils.get_hubert_content(self.hubert_model, wav_16k_tensor=wav16k)
         c = utils.repeat_expand_2d(c.squeeze(0), f0.shape[1])
-        if cluster_infer_ratio != 0:
             cluster_c = cluster.get_cluster_center_result(self.cluster_model, c.cpu().numpy().T, speaker).T
             cluster_c = torch.FloatTensor(cluster_c).to(self.dev)
             c = cluster_infer_ratio * cluster_c + (1 - cluster_infer_ratio) * c
@@ -168,13 +183,17 @@ class Svc(object):
     def infer(self, speaker, tran, raw_path,
               cluster_infer_ratio=0,
               auto_predict_f0=False,
-              noice_scale=0.4):
         speaker_id = self.spk2id.__dict__.get(speaker)
         if not speaker_id and type(speaker) is int:
             if len(self.spk2id.__dict__) >= speaker:
                 speaker_id = speaker
         sid = torch.LongTensor([int(speaker_id)]).to(self.dev).unsqueeze(0)
-        c, f0, uv = self.get_unit_f0(raw_path, tran, cluster_infer_ratio, speaker)
         if "half" in self.net_g_path and torch.cuda.is_available():
             c = c.half()
         with torch.no_grad():
@@ -183,23 +202,35 @@ class Svc(object):
             use_time = time.time() - start
             print("vits use time:{}".format(use_time))
         return audio, audio.shape[-1]
     def clear_empty(self):
         # 清理显存
         torch.cuda.empty_cache()
-    def slice_inference(self, raw_audio_path, spk, tran, slice_db, cluster_infer_ratio, auto_predict_f0, noice_scale,
-                        pad_seconds=0.5, clip_seconds=0, lg_num=0, lgr_num=0.75):
         wav_path = raw_audio_path
         chunks = slicer.cut(wav_path, db_thresh=slice_db)
         audio_data, audio_sr = slicer.chunks2audio(wav_path, chunks)
-        per_size = int(clip_seconds * audio_sr)
-        lg_size = int(lg_num * audio_sr)
-        lg_size_r = int(lg_size * lgr_num)
-        lg_size_c_l = (lg_size - lg_size_r) // 2
-        lg_size_c_r = lg_size - lg_size_r - lg_size_c_l
-        lg = np.linspace(0, 1, lg_size_r) if lg_size != 0 else 0
         audio = []
         for (slice_tag, data) in audio_data:
             print(f'#=====segment start, {round(len(data) / audio_sr, 3)}s======')
@@ -211,12 +242,12 @@ class Svc(object):
                 audio.extend(list(pad_array(_audio, length)))
                 continue
             if per_size != 0:
-                datas = split_list_by_n(data, per_size, lg_size)
             else:
                 datas = [data]
-            for k, dat in enumerate(datas):
-                per_length = int(np.ceil(len(dat) / audio_sr * self.target_sample)) if clip_seconds != 0 else length
-                if clip_seconds != 0: print(f'###=====segment clip start, {round(len(dat) / audio_sr, 3)}s======')
                 # padd
                 pad_len = int(audio_sr * pad_seconds)
                 dat = np.concatenate([np.zeros([pad_len]), dat, np.zeros([pad_len])])
@@ -224,25 +255,25 @@ class Svc(object):
                 soundfile.write(raw_path, dat, audio_sr, format="wav")
                 raw_path.seek(0)
                 out_audio, out_sr = self.infer(spk, tran, raw_path,
-                                               cluster_infer_ratio=cluster_infer_ratio,
-                                               auto_predict_f0=auto_predict_f0,
-                                               noice_scale=noice_scale
-                                               )
                 _audio = out_audio.cpu().numpy()
                 pad_len = int(self.target_sample * pad_seconds)
                 _audio = _audio[pad_len:-pad_len]
                 _audio = pad_array(_audio, per_length)
-                if lg_size != 0 and k != 0:
-                    lg1 = audio[-(lg_size_r + lg_size_c_r):-lg_size_c_r] if lgr_num != 1 else audio[-lg_size:]
-                    lg2 = _audio[lg_size_c_l:lg_size_c_l + lg_size_r] if lgr_num != 1 else _audio[0:lg_size]
-                    lg_pre = lg1 * (1 - lg) + lg2 * lg
-                    audio = audio[0:-(lg_size_r + lg_size_c_r)] if lgr_num != 1 else audio[0:-lg_size]
                     audio.extend(lg_pre)
-                    _audio = _audio[lg_size_c_l + lg_size_r:] if lgr_num != 1 else _audio[lg_size:]
                 audio.extend(list(_audio))
         return np.array(audio)
 class RealTimeVC:
     def __init__(self):
         self.last_chunk = None
@@ -252,14 +283,25 @@ class RealTimeVC:
     """输入输出都是1维numpy 音频波形数组"""
-    def process(self, svc_model, speaker_id, f_pitch_change, input_wav_path):
         import maad
         audio, sr = torchaudio.load(input_wav_path)
         audio = audio.cpu().numpy()[0]
         temp_wav = io.BytesIO()
         if self.last_chunk is None:
             input_wav_path.seek(0)
-            audio, sr = svc_model.infer(speaker_id, f_pitch_change, input_wav_path)
             audio = audio.cpu().numpy()
             self.last_chunk = audio[-self.pre_len:]
             self.last_o = audio
@@ -268,7 +310,13 @@ class RealTimeVC:
             audio = np.concatenate([self.last_chunk, audio])
             soundfile.write(temp_wav, audio, sr, format="wav")
             temp_wav.seek(0)
-            audio, sr = svc_model.infer(speaker_id, f_pitch_change, temp_wav)
             audio = audio.cpu().numpy()
             ret = maad.util.crossfade(self.last_o, audio, self.pre_len)
             self.last_chunk = audio[-self.pre_len:]

         yield list_collection[i-pre if i-pre>=0 else i: i + n]
+class F0FilterException(Exception):
+    pass
 class Svc(object):
+    def __init__(self, net_g_path, config_path,
                  device=None,
                  cluster_model_path="logs/44k/kmeans_10000.pt"):
         self.net_g_path = net_g_path
         self.hop_size = self.hps_ms.data.hop_length
         self.spk2id = self.hps_ms.spk
         # 加载hubert
+        self.hubert_model = utils.get_hubert_model().to(self.dev)
         self.load_model()
         if os.path.exists(cluster_model_path):
             self.cluster_model = cluster.get_cluster_model(cluster_model_path)
+    def get_unit_f0(self, in_path, tran, cluster_infer_ratio, speaker, f0_filter ,F0_mean_pooling):
         wav, sr = librosa.load(in_path, sr=self.target_sample)
+        if F0_mean_pooling == True:
+            f0, uv = utils.compute_f0_uv_torchcrepe(torch.FloatTensor(wav), sampling_rate=self.target_sample, hop_length=self.hop_size,device=self.dev)
+            if f0_filter and sum(f0) == 0:
+                raise F0FilterException("未检测到人声")
+            f0 = torch.FloatTensor(list(f0))
+            uv = torch.FloatTensor(list(uv))
+        if F0_mean_pooling == False:
+            f0 = utils.compute_f0_parselmouth(wav, sampling_rate=self.target_sample, hop_length=self.hop_size)
+            if f0_filter and sum(f0) == 0:
+                raise F0FilterException("未检测到人声")
+            f0, uv = utils.interpolate_f0(f0)
+            f0 = torch.FloatTensor(f0)
+            uv = torch.FloatTensor(uv)
         f0 = f0 * 2 ** (tran / 12)
         f0 = f0.unsqueeze(0).to(self.dev)
         uv = uv.unsqueeze(0).to(self.dev)
         c = utils.get_hubert_content(self.hubert_model, wav_16k_tensor=wav16k)
         c = utils.repeat_expand_2d(c.squeeze(0), f0.shape[1])
+        if cluster_infer_ratio !=0:
             cluster_c = cluster.get_cluster_center_result(self.cluster_model, c.cpu().numpy().T, speaker).T
             cluster_c = torch.FloatTensor(cluster_c).to(self.dev)
             c = cluster_infer_ratio * cluster_c + (1 - cluster_infer_ratio) * c
     def infer(self, speaker, tran, raw_path,
               cluster_infer_ratio=0,
               auto_predict_f0=False,
+              noice_scale=0.4,
+              f0_filter=False,
+              F0_mean_pooling=False
+              ):
         speaker_id = self.spk2id.__dict__.get(speaker)
         if not speaker_id and type(speaker) is int:
             if len(self.spk2id.__dict__) >= speaker:
                 speaker_id = speaker
         sid = torch.LongTensor([int(speaker_id)]).to(self.dev).unsqueeze(0)
+        c, f0, uv = self.get_unit_f0(raw_path, tran, cluster_infer_ratio, speaker, f0_filter,F0_mean_pooling)
         if "half" in self.net_g_path and torch.cuda.is_available():
             c = c.half()
         with torch.no_grad():
             use_time = time.time() - start
             print("vits use time:{}".format(use_time))
         return audio, audio.shape[-1]
     def clear_empty(self):
         # 清理显存
         torch.cuda.empty_cache()
+    def slice_inference(self,
+                        raw_audio_path,
+                        spk,
+                        tran,
+                        slice_db,
+                        cluster_infer_ratio,
+                        auto_predict_f0,
+                        noice_scale,
+                        pad_seconds=0.5,
+                        clip_seconds=0,
+                        lg_num=0,
+                        lgr_num =0.75,
+                        F0_mean_pooling = False
+                        ):
         wav_path = raw_audio_path
         chunks = slicer.cut(wav_path, db_thresh=slice_db)
         audio_data, audio_sr = slicer.chunks2audio(wav_path, chunks)
+        per_size = int(clip_seconds*audio_sr)
+        lg_size = int(lg_num*audio_sr)
+        lg_size_r = int(lg_size*lgr_num)
+        lg_size_c_l = (lg_size-lg_size_r)//2
+        lg_size_c_r = lg_size-lg_size_r-lg_size_c_l
+        lg = np.linspace(0,1,lg_size_r) if lg_size!=0 else 0
         audio = []
         for (slice_tag, data) in audio_data:
             print(f'#=====segment start, {round(len(data) / audio_sr, 3)}s======')
                 audio.extend(list(pad_array(_audio, length)))
                 continue
             if per_size != 0:
+                datas = split_list_by_n(data, per_size,lg_size)
             else:
                 datas = [data]
+            for k,dat in enumerate(datas):
+                per_length = int(np.ceil(len(dat) / audio_sr * self.target_sample)) if clip_seconds!=0 else length
+                if clip_seconds!=0: print(f'###=====segment clip start, {round(len(dat) / audio_sr, 3)}s======')
                 # padd
                 pad_len = int(audio_sr * pad_seconds)
                 dat = np.concatenate([np.zeros([pad_len]), dat, np.zeros([pad_len])])
                 soundfile.write(raw_path, dat, audio_sr, format="wav")
                 raw_path.seek(0)
                 out_audio, out_sr = self.infer(spk, tran, raw_path,
+                                                    cluster_infer_ratio=cluster_infer_ratio,
+                                                    auto_predict_f0=auto_predict_f0,
+                                                    noice_scale=noice_scale,
+                                                    F0_mean_pooling = F0_mean_pooling
+                                                    )
                 _audio = out_audio.cpu().numpy()
                 pad_len = int(self.target_sample * pad_seconds)
                 _audio = _audio[pad_len:-pad_len]
                 _audio = pad_array(_audio, per_length)
+                if lg_size!=0 and k!=0:
+                    lg1 = audio[-(lg_size_r+lg_size_c_r):-lg_size_c_r] if lgr_num != 1 else audio[-lg_size:]
+                    lg2 = _audio[lg_size_c_l:lg_size_c_l+lg_size_r]  if lgr_num != 1 else _audio[0:lg_size]
+                    lg_pre = lg1*(1-lg)+lg2*lg
+                    audio = audio[0:-(lg_size_r+lg_size_c_r)] if lgr_num != 1 else audio[0:-lg_size]
                     audio.extend(lg_pre)
+                    _audio = _audio[lg_size_c_l+lg_size_r:] if lgr_num != 1 else _audio[lg_size:]
                 audio.extend(list(_audio))
         return np.array(audio)
 class RealTimeVC:
     def __init__(self):
         self.last_chunk = None
     """输入输出都是1维numpy 音频波形数组"""
+    def process(self, svc_model, speaker_id, f_pitch_change, input_wav_path,
+                cluster_infer_ratio=0,
+                auto_predict_f0=False,
+                noice_scale=0.4,
+                f0_filter=False):
         import maad
         audio, sr = torchaudio.load(input_wav_path)
         audio = audio.cpu().numpy()[0]
         temp_wav = io.BytesIO()
         if self.last_chunk is None:
             input_wav_path.seek(0)
+            audio, sr = svc_model.infer(speaker_id, f_pitch_change, input_wav_path,
+                                        cluster_infer_ratio=cluster_infer_ratio,
+                                        auto_predict_f0=auto_predict_f0,
+                                        noice_scale=noice_scale,
+                                        f0_filter=f0_filter)
             audio = audio.cpu().numpy()
             self.last_chunk = audio[-self.pre_len:]
             self.last_o = audio
             audio = np.concatenate([self.last_chunk, audio])
             soundfile.write(temp_wav, audio, sr, format="wav")
             temp_wav.seek(0)
+            audio, sr = svc_model.infer(speaker_id, f_pitch_change, temp_wav,
+                                        cluster_infer_ratio=cluster_infer_ratio,
+                                        auto_predict_f0=auto_predict_f0,
+                                        noice_scale=noice_scale,
+                                        f0_filter=f0_filter)
             audio = audio.cpu().numpy()
             ret = maad.util.crossfade(self.last_o, audio, self.pre_len)
             self.last_chunk = audio[-self.pre_len:]

inference_main.py CHANGED Viewed

@@ -23,17 +23,19 @@ def main():
     parser = argparse.ArgumentParser(description='sovits4 inference')
     # 一定要设置的部分
-    parser.add_argument('-m', '--model_path', type=str, default="/Volumes/Extend/下载/G_20800.pth", help='模型路径')
     parser.add_argument('-c', '--config_path', type=str, default="configs/config.json", help='配置文件路径')
-    parser.add_argument('-n', '--clean_names', type=str, nargs='+', default=["君の知らない物語-src"], help='wav文件名列表，放在raw文件夹下')
     parser.add_argument('-t', '--trans', type=int, nargs='+', default=[0], help='音高调整，支持正负（半音）')
-    parser.add_argument('-s', '--spk_list', type=str, nargs='+', default=['nyaru'], help='合成目标说话人名称')
     # 可选项部分
-    parser.add_argument('-a', '--auto_predict_f0', action='store_true', default=False,
-                        help='语音转换自动预测音高，转换歌声时不要打开这个会严重跑调')
-    parser.add_argument('-cm', '--cluster_model_path', type=str, default="/Volumes/Extend/下载/so-vits-svc-4.0/logs/44k/kmeans_10000.pt", help='聚类模型路径，如果没有训练聚类则随便填')
-    parser.add_argument('-cr', '--cluster_infer_ratio', type=float, default=1, help='聚类方案占比，范围0-1，若没有训练聚类模型则填0即可')
     # 不用动的部分
     parser.add_argument('-sd', '--slice_db', type=int, default=-40, help='默认-40，嘈杂的音频可以-30，干声保留呼吸可以-50')
@@ -41,6 +43,7 @@ def main():
     parser.add_argument('-ns', '--noice_scale', type=float, default=0.4, help='噪音级别，会影响咬字和音质，较为玄学')
     parser.add_argument('-p', '--pad_seconds', type=float, default=0.5, help='推理音频pad秒数，由于未知原因开头结尾会有异响，pad一小段静音段后就不会出现')
     parser.add_argument('-wf', '--wav_format', type=str, default='flac', help='音频输出格式')
     args = parser.parse_args()
@@ -55,6 +58,10 @@ def main():
     cluster_infer_ratio = args.cluster_infer_ratio
     noice_scale = args.noice_scale
     pad_seconds = args.pad_seconds
     infer_tool.fill_a_to_b(trans, clean_names)
     for clean_name, tran in zip(clean_names, trans):
@@ -65,35 +72,58 @@ def main():
         wav_path = Path(raw_audio_path).with_suffix('.wav')
         chunks = slicer.cut(wav_path, db_thresh=slice_db)
         audio_data, audio_sr = slicer.chunks2audio(wav_path, chunks)
         for spk in spk_list:
             audio = []
             for (slice_tag, data) in audio_data:
                 print(f'#=====segment start, {round(len(data) / audio_sr, 3)}s======')
-                # padd
-                pad_len = int(audio_sr * pad_seconds)
-                data = np.concatenate([np.zeros([pad_len]), data, np.zeros([pad_len])])
                 length = int(np.ceil(len(data) / audio_sr * svc_model.target_sample))
-                raw_path = io.BytesIO()
-                soundfile.write(raw_path, data, audio_sr, format="wav")
-                raw_path.seek(0)
                 if slice_tag:
                     print('jump empty segment')
                     _audio = np.zeros(length)
                 else:
                     out_audio, out_sr = svc_model.infer(spk, tran, raw_path,
                                                         cluster_infer_ratio=cluster_infer_ratio,
                                                         auto_predict_f0=auto_predict_f0,
-                                                        noice_scale=noice_scale
                                                         )
                     _audio = out_audio.cpu().numpy()
-                pad_len = int(svc_model.target_sample * pad_seconds)
-                _audio = _audio[pad_len:-pad_len]
-                audio.extend(list(_audio))
             key = "auto" if auto_predict_f0 else f"{tran}key"
             cluster_name = "" if cluster_infer_ratio == 0 else f"_{cluster_infer_ratio}"
-            res_path = f'./results/old——{clean_name}_{key}_{spk}{cluster_name}.{wav_format}'
             soundfile.write(res_path, audio, svc_model.target_sample, format=wav_format)
 if __name__ == '__main__':

     parser = argparse.ArgumentParser(description='sovits4 inference')
     # 一定要设置的部分
+    parser.add_argument('-m', '--model_path', type=str, default="logs/44k/G_0.pth", help='模型路径')
     parser.add_argument('-c', '--config_path', type=str, default="configs/config.json", help='配置文件路径')
+    parser.add_argument('-cl', '--clip', type=float, default=0, help='音频强制切片，默认0为自动切片，单位为秒/s')
+    parser.add_argument('-n', '--clean_names', type=str, nargs='+', default=["君の知らない物語-src.wav"], help='wav文件名列表，放在raw文件夹下')
     parser.add_argument('-t', '--trans', type=int, nargs='+', default=[0], help='音高调整，支持正负（半音）')
+    parser.add_argument('-s', '--spk_list', type=str, nargs='+', default=['nen'], help='合成目标说话人名称')
     # 可选项部分
+    parser.add_argument('-a', '--auto_predict_f0', action='store_true', default=False,help='语音转换自动预测音高，转换歌声时不要打开这个会严重跑调')
+    parser.add_argument('-cm', '--cluster_model_path', type=str, default="logs/44k/kmeans_10000.pt", help='聚类模型路径，如果没有训练聚类则随便填')
+    parser.add_argument('-cr', '--cluster_infer_ratio', type=float, default=0, help='聚类方案占比，范围0-1，若没有训练聚类模型则默认0即可')
+    parser.add_argument('-lg', '--linear_gradient', type=float, default=0, help='两段音频切片的交叉淡入长度，如果强制切片后出现人声不连贯可调整该数值，如果连贯建议采用默认值0，单位为秒')
+    parser.add_argument('-fmp', '--f0_mean_pooling', type=bool, default=False, help='是否对F0使用均值滤波器(池化)，对部分哑音有改善。注意，启动该选项会导致推理速度下降，默认关闭')
     # 不用动的部分
     parser.add_argument('-sd', '--slice_db', type=int, default=-40, help='默认-40，嘈杂的音频可以-30，干声保留呼吸可以-50')
     parser.add_argument('-ns', '--noice_scale', type=float, default=0.4, help='噪音级别，会影响咬字和音质，较为玄学')
     parser.add_argument('-p', '--pad_seconds', type=float, default=0.5, help='推理音频pad秒数，由于未知原因开头结尾会有异响，pad一小段静音段后就不会出现')
     parser.add_argument('-wf', '--wav_format', type=str, default='flac', help='音频输出格式')
+    parser.add_argument('-lgr', '--linear_gradient_retain', type=float, default=0.75, help='自动音频切片后，需要舍弃每段切片的头尾。该参数设置交叉长度保留的比例，范围0-1,左开右闭')
     args = parser.parse_args()
     cluster_infer_ratio = args.cluster_infer_ratio
     noice_scale = args.noice_scale
     pad_seconds = args.pad_seconds
+    clip = args.clip
+    lg = args.linear_gradient
+    lgr = args.linear_gradient_retain
+    F0_mean_pooling = args.f0_mean_pooling
     infer_tool.fill_a_to_b(trans, clean_names)
     for clean_name, tran in zip(clean_names, trans):
         wav_path = Path(raw_audio_path).with_suffix('.wav')
         chunks = slicer.cut(wav_path, db_thresh=slice_db)
         audio_data, audio_sr = slicer.chunks2audio(wav_path, chunks)
+        per_size = int(clip*audio_sr)
+        lg_size = int(lg*audio_sr)
+        lg_size_r = int(lg_size*lgr)
+        lg_size_c_l = (lg_size-lg_size_r)//2
+        lg_size_c_r = lg_size-lg_size_r-lg_size_c_l
+        lg = np.linspace(0,1,lg_size_r) if lg_size!=0 else 0
         for spk in spk_list:
             audio = []
             for (slice_tag, data) in audio_data:
                 print(f'#=====segment start, {round(len(data) / audio_sr, 3)}s======')
                 length = int(np.ceil(len(data) / audio_sr * svc_model.target_sample))
                 if slice_tag:
                     print('jump empty segment')
                     _audio = np.zeros(length)
+                    audio.extend(list(infer_tool.pad_array(_audio, length)))
+                    continue
+                if per_size != 0:
+                    datas = infer_tool.split_list_by_n(data, per_size,lg_size)
                 else:
+                    datas = [data]
+                for k,dat in enumerate(datas):
+                    per_length = int(np.ceil(len(dat) / audio_sr * svc_model.target_sample)) if clip!=0 else length
+                    if clip!=0: print(f'###=====segment clip start, {round(len(dat) / audio_sr, 3)}s======')
+                    # padd
+                    pad_len = int(audio_sr * pad_seconds)
+                    dat = np.concatenate([np.zeros([pad_len]), dat, np.zeros([pad_len])])
+                    raw_path = io.BytesIO()
+                    soundfile.write(raw_path, dat, audio_sr, format="wav")
+                    raw_path.seek(0)
                     out_audio, out_sr = svc_model.infer(spk, tran, raw_path,
                                                         cluster_infer_ratio=cluster_infer_ratio,
                                                         auto_predict_f0=auto_predict_f0,
+                                                        noice_scale=noice_scale,
+                                                        F0_mean_pooling = F0_mean_pooling
                                                         )
                     _audio = out_audio.cpu().numpy()
+                    pad_len = int(svc_model.target_sample * pad_seconds)
+                    _audio = _audio[pad_len:-pad_len]
+                    _audio = infer_tool.pad_array(_audio, per_length)
+                    if lg_size!=0 and k!=0:
+                        lg1 = audio[-(lg_size_r+lg_size_c_r):-lg_size_c_r] if lgr != 1 else audio[-lg_size:]
+                        lg2 = _audio[lg_size_c_l:lg_size_c_l+lg_size_r]  if lgr != 1 else _audio[0:lg_size]
+                        lg_pre = lg1*(1-lg)+lg2*lg
+                        audio = audio[0:-(lg_size_r+lg_size_c_r)] if lgr != 1 else audio[0:-lg_size]
+                        audio.extend(lg_pre)
+                        _audio = _audio[lg_size_c_l+lg_size_r:] if lgr != 1 else _audio[lg_size:]
+                    audio.extend(list(_audio))
             key = "auto" if auto_predict_f0 else f"{tran}key"
             cluster_name = "" if cluster_infer_ratio == 0 else f"_{cluster_infer_ratio}"
+            res_path = f'./results/{clean_name}_{key}_{spk}{cluster_name}.{wav_format}'
             soundfile.write(res_path, audio, svc_model.target_sample, format=wav_format)
 if __name__ == '__main__':

onnx/model_onnx.py DELETED Viewed

@@ -1,328 +0,0 @@
-import copy
-import math
-import torch
-from torch import nn
-from torch.nn import functional as F
-import modules.attentions as attentions
-import modules.commons as commons
-import modules.modules as modules
-from torch.nn import Conv1d, ConvTranspose1d, AvgPool1d, Conv2d
-from torch.nn.utils import weight_norm, remove_weight_norm, spectral_norm
-from modules.commons import init_weights, get_padding
-from vdecoder.hifigan.models import Generator
-from utils import f0_to_coarse
-class ResidualCouplingBlock(nn.Module):
-  def __init__(self,
-      channels,
-      hidden_channels,
-      kernel_size,
-      dilation_rate,
-      n_layers,
-      n_flows=4,
-      gin_channels=0):
-    super().__init__()
-    self.channels = channels
-    self.hidden_channels = hidden_channels
-    self.kernel_size = kernel_size
-    self.dilation_rate = dilation_rate
-    self.n_layers = n_layers
-    self.n_flows = n_flows
-    self.gin_channels = gin_channels
-    self.flows = nn.ModuleList()
-    for i in range(n_flows):
-      self.flows.append(modules.ResidualCouplingLayer(channels, hidden_channels, kernel_size, dilation_rate, n_layers, gin_channels=gin_channels, mean_only=True))
-      self.flows.append(modules.Flip())
-  def forward(self, x, x_mask, g=None, reverse=False):
-    if not reverse:
-      for flow in self.flows:
-        x, _ = flow(x, x_mask, g=g, reverse=reverse)
-    else:
-      for flow in reversed(self.flows):
-        x = flow(x, x_mask, g=g, reverse=reverse)
-    return x
-class Encoder(nn.Module):
-  def __init__(self,
-      in_channels,
-      out_channels,
-      hidden_channels,
-      kernel_size,
-      dilation_rate,
-      n_layers,
-      gin_channels=0):
-    super().__init__()
-    self.in_channels = in_channels
-    self.out_channels = out_channels
-    self.hidden_channels = hidden_channels
-    self.kernel_size = kernel_size
-    self.dilation_rate = dilation_rate
-    self.n_layers = n_layers
-    self.gin_channels = gin_channels
-    self.pre = nn.Conv1d(in_channels, hidden_channels, 1)
-    self.enc = modules.WN(hidden_channels, kernel_size, dilation_rate, n_layers, gin_channels=gin_channels)
-    self.proj = nn.Conv1d(hidden_channels, out_channels * 2, 1)
-  def forward(self, x, x_lengths, g=None):
-    # print(x.shape,x_lengths.shape)
-    x_mask = torch.unsqueeze(commons.sequence_mask(x_lengths, x.size(2)), 1).to(x.dtype)
-    x = self.pre(x) * x_mask
-    x = self.enc(x, x_mask, g=g)
-    stats = self.proj(x) * x_mask
-    m, logs = torch.split(stats, self.out_channels, dim=1)
-    z = (m + torch.randn_like(m) * torch.exp(logs)) * x_mask
-    return z, m, logs, x_mask
-class TextEncoder(nn.Module):
-  def __init__(self,
-      in_channels,
-      out_channels,
-      hidden_channels,
-      kernel_size,
-      dilation_rate,
-      n_layers,
-      gin_channels=0,
-      filter_channels=None,
-      n_heads=None,
-      p_dropout=None):
-    super().__init__()
-    self.in_channels = in_channels
-    self.out_channels = out_channels
-    self.hidden_channels = hidden_channels
-    self.kernel_size = kernel_size
-    self.dilation_rate = dilation_rate
-    self.n_layers = n_layers
-    self.gin_channels = gin_channels
-    self.pre = nn.Conv1d(in_channels, hidden_channels, 1)
-    self.proj = nn.Conv1d(hidden_channels, out_channels * 2, 1)
-    self.f0_emb = nn.Embedding(256, hidden_channels)
-    self.enc_ =  attentions.Encoder(
-        hidden_channels,
-        filter_channels,
-        n_heads,
-        n_layers,
-        kernel_size,
-        p_dropout)
-  def forward(self, x, x_lengths, f0=None):
-    x_mask = torch.unsqueeze(commons.sequence_mask(x_lengths, x.size(2)), 1).to(x.dtype)
-    x = self.pre(x) * x_mask
-    x = x + self.f0_emb(f0.long()).transpose(1,2)
-    x = self.enc_(x * x_mask, x_mask)
-    stats = self.proj(x) * x_mask
-    m, logs = torch.split(stats, self.out_channels, dim=1)
-    z = (m + torch.randn_like(m) * torch.exp(logs)) * x_mask
-    return z, m, logs, x_mask
-class DiscriminatorP(torch.nn.Module):
-    def __init__(self, period, kernel_size=5, stride=3, use_spectral_norm=False):
-        super(DiscriminatorP, self).__init__()
-        self.period = period
-        self.use_spectral_norm = use_spectral_norm
-        norm_f = weight_norm if use_spectral_norm == False else spectral_norm
-        self.convs = nn.ModuleList([
-            norm_f(Conv2d(1, 32, (kernel_size, 1), (stride, 1), padding=(get_padding(kernel_size, 1), 0))),
-            norm_f(Conv2d(32, 128, (kernel_size, 1), (stride, 1), padding=(get_padding(kernel_size, 1), 0))),
-            norm_f(Conv2d(128, 512, (kernel_size, 1), (stride, 1), padding=(get_padding(kernel_size, 1), 0))),
-            norm_f(Conv2d(512, 1024, (kernel_size, 1), (stride, 1), padding=(get_padding(kernel_size, 1), 0))),
-            norm_f(Conv2d(1024, 1024, (kernel_size, 1), 1, padding=(get_padding(kernel_size, 1), 0))),
-        ])
-        self.conv_post = norm_f(Conv2d(1024, 1, (3, 1), 1, padding=(1, 0)))
-    def forward(self, x):
-        fmap = []
-        # 1d to 2d
-        b, c, t = x.shape
-        if t % self.period != 0: # pad first
-            n_pad = self.period - (t % self.period)
-            x = F.pad(x, (0, n_pad), "reflect")
-            t = t + n_pad
-        x = x.view(b, c, t // self.period, self.period)
-        for l in self.convs:
-            x = l(x)
-            x = F.leaky_relu(x, modules.LRELU_SLOPE)
-            fmap.append(x)
-        x = self.conv_post(x)
-        fmap.append(x)
-        x = torch.flatten(x, 1, -1)
-        return x, fmap
-class DiscriminatorS(torch.nn.Module):
-    def __init__(self, use_spectral_norm=False):
-        super(DiscriminatorS, self).__init__()
-        norm_f = weight_norm if use_spectral_norm == False else spectral_norm
-        self.convs = nn.ModuleList([
-            norm_f(Conv1d(1, 16, 15, 1, padding=7)),
-            norm_f(Conv1d(16, 64, 41, 4, groups=4, padding=20)),
-            norm_f(Conv1d(64, 256, 41, 4, groups=16, padding=20)),
-            norm_f(Conv1d(256, 1024, 41, 4, groups=64, padding=20)),
-            norm_f(Conv1d(1024, 1024, 41, 4, groups=256, padding=20)),
-            norm_f(Conv1d(1024, 1024, 5, 1, padding=2)),
-        ])
-        self.conv_post = norm_f(Conv1d(1024, 1, 3, 1, padding=1))
-    def forward(self, x):
-        fmap = []
-        for l in self.convs:
-            x = l(x)
-            x = F.leaky_relu(x, modules.LRELU_SLOPE)
-            fmap.append(x)
-        x = self.conv_post(x)
-        fmap.append(x)
-        x = torch.flatten(x, 1, -1)
-        return x, fmap
-class MultiPeriodDiscriminator(torch.nn.Module):
-    def __init__(self, use_spectral_norm=False):
-        super(MultiPeriodDiscriminator, self).__init__()
-        periods = [2,3,5,7,11]
-        discs = [DiscriminatorS(use_spectral_norm=use_spectral_norm)]
-        discs = discs + [DiscriminatorP(i, use_spectral_norm=use_spectral_norm) for i in periods]
-        self.discriminators = nn.ModuleList(discs)
-    def forward(self, y, y_hat):
-        y_d_rs = []
-        y_d_gs = []
-        fmap_rs = []
-        fmap_gs = []
-        for i, d in enumerate(self.discriminators):
-            y_d_r, fmap_r = d(y)
-            y_d_g, fmap_g = d(y_hat)
-            y_d_rs.append(y_d_r)
-            y_d_gs.append(y_d_g)
-            fmap_rs.append(fmap_r)
-            fmap_gs.append(fmap_g)
-        return y_d_rs, y_d_gs, fmap_rs, fmap_gs
-class SpeakerEncoder(torch.nn.Module):
-    def __init__(self, mel_n_channels=80, model_num_layers=3, model_hidden_size=256, model_embedding_size=256):
-        super(SpeakerEncoder, self).__init__()
-        self.lstm = nn.LSTM(mel_n_channels, model_hidden_size, model_num_layers, batch_first=True)
-        self.linear = nn.Linear(model_hidden_size, model_embedding_size)
-        self.relu = nn.ReLU()
-    def forward(self, mels):
-        self.lstm.flatten_parameters()
-        _, (hidden, _) = self.lstm(mels)
-        embeds_raw = self.relu(self.linear(hidden[-1]))
-        return embeds_raw / torch.norm(embeds_raw, dim=1, keepdim=True)
-    def compute_partial_slices(self, total_frames, partial_frames, partial_hop):
-        mel_slices = []
-        for i in range(0, total_frames-partial_frames, partial_hop):
-            mel_range = torch.arange(i, i+partial_frames)
-            mel_slices.append(mel_range)
-        return mel_slices
-    def embed_utterance(self, mel, partial_frames=128, partial_hop=64):
-        mel_len = mel.size(1)
-        last_mel = mel[:,-partial_frames:]
-        if mel_len > partial_frames:
-            mel_slices = self.compute_partial_slices(mel_len, partial_frames, partial_hop)
-            mels = list(mel[:,s] for s in mel_slices)
-            mels.append(last_mel)
-            mels = torch.stack(tuple(mels), 0).squeeze(1)
-            with torch.no_grad():
-                partial_embeds = self(mels)
-            embed = torch.mean(partial_embeds, axis=0).unsqueeze(0)
-            #embed = embed / torch.linalg.norm(embed, 2)
-        else:
-            with torch.no_grad():
-                embed = self(last_mel)
-        return embed
-class SynthesizerTrn(nn.Module):
-  """
-  Synthesizer for Training
-  """
-  def __init__(self,
-    spec_channels,
-    segment_size,
-    inter_channels,
-    hidden_channels,
-    filter_channels,
-    n_heads,
-    n_layers,
-    kernel_size,
-    p_dropout,
-    resblock,
-    resblock_kernel_sizes,
-    resblock_dilation_sizes,
-    upsample_rates,
-    upsample_initial_channel,
-    upsample_kernel_sizes,
-    gin_channels,
-    ssl_dim,
-    n_speakers,
-    **kwargs):
-    super().__init__()
-    self.spec_channels = spec_channels
-    self.inter_channels = inter_channels
-    self.hidden_channels = hidden_channels
-    self.filter_channels = filter_channels
-    self.n_heads = n_heads
-    self.n_layers = n_layers
-    self.kernel_size = kernel_size
-    self.p_dropout = p_dropout
-    self.resblock = resblock
-    self.resblock_kernel_sizes = resblock_kernel_sizes
-    self.resblock_dilation_sizes = resblock_dilation_sizes
-    self.upsample_rates = upsample_rates
-    self.upsample_initial_channel = upsample_initial_channel
-    self.upsample_kernel_sizes = upsample_kernel_sizes
-    self.segment_size = segment_size
-    self.gin_channels = gin_channels
-    self.ssl_dim = ssl_dim
-    self.emb_g = nn.Embedding(n_speakers, gin_channels)
-    self.enc_p_ = TextEncoder(ssl_dim, inter_channels, hidden_channels, 5, 1, 16,0, filter_channels, n_heads, p_dropout)
-    hps = {
-        "sampling_rate": 32000,
-        "inter_channels": 192,
-        "resblock": "1",
-        "resblock_kernel_sizes": [3, 7, 11],
-        "resblock_dilation_sizes": [[1, 3, 5], [1, 3, 5], [1, 3, 5]],
-        "upsample_rates": [10, 8, 2, 2],
-        "upsample_initial_channel": 512,
-        "upsample_kernel_sizes": [16, 16, 4, 4],
-        "gin_channels": 256,
-    }
-    self.dec = Generator(h=hps)
-    self.enc_q = Encoder(spec_channels, inter_channels, hidden_channels, 5, 1, 16, gin_channels=gin_channels)
-    self.flow = ResidualCouplingBlock(inter_channels, hidden_channels, 5, 1, 4, gin_channels=gin_channels)
-  def forward(self, c, c_lengths, f0, g=None):
-    g = self.emb_g(g.unsqueeze(0)).transpose(1,2)
-    z_p, m_p, logs_p, c_mask = self.enc_p_(c.transpose(1,2), c_lengths, f0=f0_to_coarse(f0))
-    z = self.flow(z_p, c_mask, g=g, reverse=True)
-    o = self.dec(z * c_mask, g=g, f0=f0.float())
-    return o

onnx/model_onnx_48k.py DELETED Viewed

@@ -1,328 +0,0 @@
-import copy
-import math
-import torch
-from torch import nn
-from torch.nn import functional as F
-import modules.attentions as attentions
-import modules.commons as commons
-import modules.modules as modules
-from torch.nn import Conv1d, ConvTranspose1d, AvgPool1d, Conv2d
-from torch.nn.utils import weight_norm, remove_weight_norm, spectral_norm
-from modules.commons import init_weights, get_padding
-from vdecoder.hifigan.models import Generator
-from utils import f0_to_coarse
-class ResidualCouplingBlock(nn.Module):
-  def __init__(self,
-      channels,
-      hidden_channels,
-      kernel_size,
-      dilation_rate,
-      n_layers,
-      n_flows=4,
-      gin_channels=0):
-    super().__init__()
-    self.channels = channels
-    self.hidden_channels = hidden_channels
-    self.kernel_size = kernel_size
-    self.dilation_rate = dilation_rate
-    self.n_layers = n_layers
-    self.n_flows = n_flows
-    self.gin_channels = gin_channels
-    self.flows = nn.ModuleList()
-    for i in range(n_flows):
-      self.flows.append(modules.ResidualCouplingLayer(channels, hidden_channels, kernel_size, dilation_rate, n_layers, gin_channels=gin_channels, mean_only=True))
-      self.flows.append(modules.Flip())
-  def forward(self, x, x_mask, g=None, reverse=False):
-    if not reverse:
-      for flow in self.flows:
-        x, _ = flow(x, x_mask, g=g, reverse=reverse)
-    else:
-      for flow in reversed(self.flows):
-        x = flow(x, x_mask, g=g, reverse=reverse)
-    return x
-class Encoder(nn.Module):
-  def __init__(self,
-      in_channels,
-      out_channels,
-      hidden_channels,
-      kernel_size,
-      dilation_rate,
-      n_layers,
-      gin_channels=0):
-    super().__init__()
-    self.in_channels = in_channels
-    self.out_channels = out_channels
-    self.hidden_channels = hidden_channels
-    self.kernel_size = kernel_size
-    self.dilation_rate = dilation_rate
-    self.n_layers = n_layers
-    self.gin_channels = gin_channels
-    self.pre = nn.Conv1d(in_channels, hidden_channels, 1)
-    self.enc = modules.WN(hidden_channels, kernel_size, dilation_rate, n_layers, gin_channels=gin_channels)
-    self.proj = nn.Conv1d(hidden_channels, out_channels * 2, 1)
-  def forward(self, x, x_lengths, g=None):
-    # print(x.shape,x_lengths.shape)
-    x_mask = torch.unsqueeze(commons.sequence_mask(x_lengths, x.size(2)), 1).to(x.dtype)
-    x = self.pre(x) * x_mask
-    x = self.enc(x, x_mask, g=g)
-    stats = self.proj(x) * x_mask
-    m, logs = torch.split(stats, self.out_channels, dim=1)
-    z = (m + torch.randn_like(m) * torch.exp(logs)) * x_mask
-    return z, m, logs, x_mask
-class TextEncoder(nn.Module):
-  def __init__(self,
-      in_channels,
-      out_channels,
-      hidden_channels,
-      kernel_size,
-      dilation_rate,
-      n_layers,
-      gin_channels=0,
-      filter_channels=None,
-      n_heads=None,
-      p_dropout=None):
-    super().__init__()
-    self.in_channels = in_channels
-    self.out_channels = out_channels
-    self.hidden_channels = hidden_channels
-    self.kernel_size = kernel_size
-    self.dilation_rate = dilation_rate
-    self.n_layers = n_layers
-    self.gin_channels = gin_channels
-    self.pre = nn.Conv1d(in_channels, hidden_channels, 1)
-    self.proj = nn.Conv1d(hidden_channels, out_channels * 2, 1)
-    self.f0_emb = nn.Embedding(256, hidden_channels)
-    self.enc_ =  attentions.Encoder(
-        hidden_channels,
-        filter_channels,
-        n_heads,
-        n_layers,
-        kernel_size,
-        p_dropout)
-  def forward(self, x, x_lengths, f0=None):
-    x_mask = torch.unsqueeze(commons.sequence_mask(x_lengths, x.size(2)), 1).to(x.dtype)
-    x = self.pre(x) * x_mask
-    x = x + self.f0_emb(f0.long()).transpose(1,2)
-    x = self.enc_(x * x_mask, x_mask)
-    stats = self.proj(x) * x_mask
-    m, logs = torch.split(stats, self.out_channels, dim=1)
-    z = (m + torch.randn_like(m) * torch.exp(logs)) * x_mask
-    return z, m, logs, x_mask
-class DiscriminatorP(torch.nn.Module):
-    def __init__(self, period, kernel_size=5, stride=3, use_spectral_norm=False):
-        super(DiscriminatorP, self).__init__()
-        self.period = period
-        self.use_spectral_norm = use_spectral_norm
-        norm_f = weight_norm if use_spectral_norm == False else spectral_norm
-        self.convs = nn.ModuleList([
-            norm_f(Conv2d(1, 32, (kernel_size, 1), (stride, 1), padding=(get_padding(kernel_size, 1), 0))),
-            norm_f(Conv2d(32, 128, (kernel_size, 1), (stride, 1), padding=(get_padding(kernel_size, 1), 0))),
-            norm_f(Conv2d(128, 512, (kernel_size, 1), (stride, 1), padding=(get_padding(kernel_size, 1), 0))),
-            norm_f(Conv2d(512, 1024, (kernel_size, 1), (stride, 1), padding=(get_padding(kernel_size, 1), 0))),
-            norm_f(Conv2d(1024, 1024, (kernel_size, 1), 1, padding=(get_padding(kernel_size, 1), 0))),
-        ])
-        self.conv_post = norm_f(Conv2d(1024, 1, (3, 1), 1, padding=(1, 0)))
-    def forward(self, x):
-        fmap = []
-        # 1d to 2d
-        b, c, t = x.shape
-        if t % self.period != 0: # pad first
-            n_pad = self.period - (t % self.period)
-            x = F.pad(x, (0, n_pad), "reflect")
-            t = t + n_pad
-        x = x.view(b, c, t // self.period, self.period)
-        for l in self.convs:
-            x = l(x)
-            x = F.leaky_relu(x, modules.LRELU_SLOPE)
-            fmap.append(x)
-        x = self.conv_post(x)
-        fmap.append(x)
-        x = torch.flatten(x, 1, -1)
-        return x, fmap
-class DiscriminatorS(torch.nn.Module):
-    def __init__(self, use_spectral_norm=False):
-        super(DiscriminatorS, self).__init__()
-        norm_f = weight_norm if use_spectral_norm == False else spectral_norm
-        self.convs = nn.ModuleList([
-            norm_f(Conv1d(1, 16, 15, 1, padding=7)),
-            norm_f(Conv1d(16, 64, 41, 4, groups=4, padding=20)),
-            norm_f(Conv1d(64, 256, 41, 4, groups=16, padding=20)),
-            norm_f(Conv1d(256, 1024, 41, 4, groups=64, padding=20)),
-            norm_f(Conv1d(1024, 1024, 41, 4, groups=256, padding=20)),
-            norm_f(Conv1d(1024, 1024, 5, 1, padding=2)),
-        ])
-        self.conv_post = norm_f(Conv1d(1024, 1, 3, 1, padding=1))
-    def forward(self, x):
-        fmap = []
-        for l in self.convs:
-            x = l(x)
-            x = F.leaky_relu(x, modules.LRELU_SLOPE)
-            fmap.append(x)
-        x = self.conv_post(x)
-        fmap.append(x)
-        x = torch.flatten(x, 1, -1)
-        return x, fmap
-class MultiPeriodDiscriminator(torch.nn.Module):
-    def __init__(self, use_spectral_norm=False):
-        super(MultiPeriodDiscriminator, self).__init__()
-        periods = [2,3,5,7,11]
-        discs = [DiscriminatorS(use_spectral_norm=use_spectral_norm)]
-        discs = discs + [DiscriminatorP(i, use_spectral_norm=use_spectral_norm) for i in periods]
-        self.discriminators = nn.ModuleList(discs)
-    def forward(self, y, y_hat):
-        y_d_rs = []
-        y_d_gs = []
-        fmap_rs = []
-        fmap_gs = []
-        for i, d in enumerate(self.discriminators):
-            y_d_r, fmap_r = d(y)
-            y_d_g, fmap_g = d(y_hat)
-            y_d_rs.append(y_d_r)
-            y_d_gs.append(y_d_g)
-            fmap_rs.append(fmap_r)
-            fmap_gs.append(fmap_g)
-        return y_d_rs, y_d_gs, fmap_rs, fmap_gs
-class SpeakerEncoder(torch.nn.Module):
-    def __init__(self, mel_n_channels=80, model_num_layers=3, model_hidden_size=256, model_embedding_size=256):
-        super(SpeakerEncoder, self).__init__()
-        self.lstm = nn.LSTM(mel_n_channels, model_hidden_size, model_num_layers, batch_first=True)
-        self.linear = nn.Linear(model_hidden_size, model_embedding_size)
-        self.relu = nn.ReLU()
-    def forward(self, mels):
-        self.lstm.flatten_parameters()
-        _, (hidden, _) = self.lstm(mels)
-        embeds_raw = self.relu(self.linear(hidden[-1]))
-        return embeds_raw / torch.norm(embeds_raw, dim=1, keepdim=True)
-    def compute_partial_slices(self, total_frames, partial_frames, partial_hop):
-        mel_slices = []
-        for i in range(0, total_frames-partial_frames, partial_hop):
-            mel_range = torch.arange(i, i+partial_frames)
-            mel_slices.append(mel_range)
-        return mel_slices
-    def embed_utterance(self, mel, partial_frames=128, partial_hop=64):
-        mel_len = mel.size(1)
-        last_mel = mel[:,-partial_frames:]
-        if mel_len > partial_frames:
-            mel_slices = self.compute_partial_slices(mel_len, partial_frames, partial_hop)
-            mels = list(mel[:,s] for s in mel_slices)
-            mels.append(last_mel)
-            mels = torch.stack(tuple(mels), 0).squeeze(1)
-            with torch.no_grad():
-                partial_embeds = self(mels)
-            embed = torch.mean(partial_embeds, axis=0).unsqueeze(0)
-            #embed = embed / torch.linalg.norm(embed, 2)
-        else:
-            with torch.no_grad():
-                embed = self(last_mel)
-        return embed
-class SynthesizerTrn(nn.Module):
-  """
-  Synthesizer for Training
-  """
-  def __init__(self,
-    spec_channels,
-    segment_size,
-    inter_channels,
-    hidden_channels,
-    filter_channels,
-    n_heads,
-    n_layers,
-    kernel_size,
-    p_dropout,
-    resblock,
-    resblock_kernel_sizes,
-    resblock_dilation_sizes,
-    upsample_rates,
-    upsample_initial_channel,
-    upsample_kernel_sizes,
-    gin_channels,
-    ssl_dim,
-    n_speakers,
-    **kwargs):
-    super().__init__()
-    self.spec_channels = spec_channels
-    self.inter_channels = inter_channels
-    self.hidden_channels = hidden_channels
-    self.filter_channels = filter_channels
-    self.n_heads = n_heads
-    self.n_layers = n_layers
-    self.kernel_size = kernel_size
-    self.p_dropout = p_dropout
-    self.resblock = resblock
-    self.resblock_kernel_sizes = resblock_kernel_sizes
-    self.resblock_dilation_sizes = resblock_dilation_sizes
-    self.upsample_rates = upsample_rates
-    self.upsample_initial_channel = upsample_initial_channel
-    self.upsample_kernel_sizes = upsample_kernel_sizes
-    self.segment_size = segment_size
-    self.gin_channels = gin_channels
-    self.ssl_dim = ssl_dim
-    self.emb_g = nn.Embedding(n_speakers, gin_channels)
-    self.enc_p_ = TextEncoder(ssl_dim, inter_channels, hidden_channels, 5, 1, 16,0, filter_channels, n_heads, p_dropout)
-    hps = {
-        "sampling_rate": 48000,
-        "inter_channels": 192,
-        "resblock": "1",
-        "resblock_kernel_sizes": [3, 7, 11],
-        "resblock_dilation_sizes": [[1, 3, 5], [1, 3, 5], [1, 3, 5]],
-        "upsample_rates": [10, 8, 2, 2],
-        "upsample_initial_channel": 512,
-        "upsample_kernel_sizes": [16, 16, 4, 4],
-        "gin_channels": 256,
-    }
-    self.dec = Generator(h=hps)
-    self.enc_q = Encoder(spec_channels, inter_channels, hidden_channels, 5, 1, 16, gin_channels=gin_channels)
-    self.flow = ResidualCouplingBlock(inter_channels, hidden_channels, 5, 1, 4, gin_channels=gin_channels)
-  def forward(self, c, c_lengths, f0, g=None):
-    g = self.emb_g(g.unsqueeze(0)).transpose(1,2)
-    z_p, m_p, logs_p, c_mask = self.enc_p_(c.transpose(1,2), c_lengths, f0=f0_to_coarse(f0))
-    z = self.flow(z_p, c_mask, g=g, reverse=True)
-    o = self.dec(z * c_mask, g=g, f0=f0.float())
-    return o

onnx/onnx_export.py DELETED Viewed

@@ -1,73 +0,0 @@
-import argparse
-import time
-import numpy as np
-import onnx
-from onnxsim import simplify
-import onnxruntime as ort
-import onnxoptimizer
-import torch
-from model_onnx import SynthesizerTrn
-import utils
-from hubert import hubert_model_onnx
-def main(HubertExport,NetExport):
-    path = "NyaruTaffy"
-    if(HubertExport):
-        device = torch.device("cuda")
-        hubert_soft = utils.get_hubert_model()
-        test_input = torch.rand(1, 1, 16000)
-        input_names = ["source"]
-        output_names = ["embed"]
-        torch.onnx.export(hubert_soft.to(device),
-                        test_input.to(device),
-                        "hubert3.0.onnx",
-                        dynamic_axes={
-                            "source": {
-                                2: "sample_length"
-                            }
-                        },
-                        verbose=False,
-                        opset_version=13,
-                        input_names=input_names,
-                        output_names=output_names)
-    if(NetExport):
-        device = torch.device("cuda")
-        hps = utils.get_hparams_from_file(f"checkpoints/{path}/config.json")
-        SVCVITS = SynthesizerTrn(
-            hps.data.filter_length // 2 + 1,
-            hps.train.segment_size // hps.data.hop_length,
-            **hps.model)
-        _ = utils.load_checkpoint(f"checkpoints/{path}/model.pth", SVCVITS, None)
-        _ = SVCVITS.eval().to(device)
-        for i in SVCVITS.parameters():
-            i.requires_grad = False
-        test_hidden_unit = torch.rand(1, 50, 256)
-        test_lengths = torch.LongTensor([50])
-        test_pitch = torch.rand(1, 50)
-        test_sid = torch.LongTensor([0])
-        input_names = ["hidden_unit", "lengths", "pitch", "sid"]
-        output_names = ["audio", ]
-        SVCVITS.eval()
-        torch.onnx.export(SVCVITS,
-                        (
-                            test_hidden_unit.to(device),
-                            test_lengths.to(device),
-                            test_pitch.to(device),
-                            test_sid.to(device)
-                        ),
-                        f"checkpoints/{path}/model.onnx",
-                        dynamic_axes={
-                            "hidden_unit": [0, 1],
-                            "pitch": [1]
-                        },
-                        do_constant_folding=False,
-                        opset_version=16,
-                        verbose=False,
-                        input_names=input_names,
-                        output_names=output_names)
-if __name__ == '__main__':
-    main(False,True)

onnx/onnx_export_48k.py DELETED Viewed

@@ -1,73 +0,0 @@
-import argparse
-import time
-import numpy as np
-import onnx
-from onnxsim import simplify
-import onnxruntime as ort
-import onnxoptimizer
-import torch
-from model_onnx_48k import SynthesizerTrn
-import utils
-from hubert import hubert_model_onnx
-def main(HubertExport,NetExport):
-    path = "NyaruTaffy"
-    if(HubertExport):
-        device = torch.device("cuda")
-        hubert_soft = hubert_model_onnx.hubert_soft("hubert/model.pt")
-        test_input = torch.rand(1, 1, 16000)
-        input_names = ["source"]
-        output_names = ["embed"]
-        torch.onnx.export(hubert_soft.to(device),
-                        test_input.to(device),
-                        "hubert3.0.onnx",
-                        dynamic_axes={
-                            "source": {
-                                2: "sample_length"
-                            }
-                        },
-                        verbose=False,
-                        opset_version=13,
-                        input_names=input_names,
-                        output_names=output_names)
-    if(NetExport):
-        device = torch.device("cuda")
-        hps = utils.get_hparams_from_file(f"checkpoints/{path}/config.json")
-        SVCVITS = SynthesizerTrn(
-            hps.data.filter_length // 2 + 1,
-            hps.train.segment_size // hps.data.hop_length,
-            **hps.model)
-        _ = utils.load_checkpoint(f"checkpoints/{path}/model.pth", SVCVITS, None)
-        _ = SVCVITS.eval().to(device)
-        for i in SVCVITS.parameters():
-            i.requires_grad = False
-        test_hidden_unit = torch.rand(1, 50, 256)
-        test_lengths = torch.LongTensor([50])
-        test_pitch = torch.rand(1, 50)
-        test_sid = torch.LongTensor([0])
-        input_names = ["hidden_unit", "lengths", "pitch", "sid"]
-        output_names = ["audio", ]
-        SVCVITS.eval()
-        torch.onnx.export(SVCVITS,
-                        (
-                            test_hidden_unit.to(device),
-                            test_lengths.to(device),
-                            test_pitch.to(device),
-                            test_sid.to(device)
-                        ),
-                        f"checkpoints/{path}/model.onnx",
-                        dynamic_axes={
-                            "hidden_unit": [0, 1],
-                            "pitch": [1]
-                        },
-                        do_constant_folding=False,
-                        opset_version=16,
-                        verbose=False,
-                        input_names=input_names,
-                        output_names=output_names)
-if __name__ == '__main__':
-    main(False,True)

requirements.txt CHANGED Viewed

@@ -19,3 +19,4 @@ onnxsim
 onnxoptimizer
 fairseq
 librosa

 onnxoptimizer
 fairseq
 librosa
+edge-tts

vdecoder/__pycache__/__init__.cpython-38.pyc CHANGED Viewed

Binary files a/vdecoder/__pycache__/__init__.cpython-38.pyc and b/vdecoder/__pycache__/__init__.cpython-38.pyc differ

vdecoder/hifigan/__pycache__/env.cpython-38.pyc CHANGED Viewed

Binary files a/vdecoder/hifigan/__pycache__/env.cpython-38.pyc and b/vdecoder/hifigan/__pycache__/env.cpython-38.pyc differ

vdecoder/hifigan/__pycache__/models.cpython-38.pyc CHANGED Viewed

Binary files a/vdecoder/hifigan/__pycache__/models.cpython-38.pyc and b/vdecoder/hifigan/__pycache__/models.cpython-38.pyc differ

vdecoder/hifigan/__pycache__/utils.cpython-38.pyc CHANGED Viewed

Binary files a/vdecoder/hifigan/__pycache__/utils.cpython-38.pyc and b/vdecoder/hifigan/__pycache__/utils.cpython-38.pyc differ