AAOBA's picture
updated info and default values
f3e00ea
|
raw
history blame
3.96 kB
# Umamusume DeBERTA-VITS2 TTS
---------------
๐Ÿ“… 2023.10.19 ๐Ÿ“…
- Updated current Generator to 180K steps' checkpoint
------------------
๐Ÿ‘Œ **Currently, ONLY Japanese is supported.** ๐Ÿ‘Œ
๐Ÿ’ช **Based on [Bert-VITS2](https://github.com/fishaudio/Bert-VITS2), this work tightly follows [Akito/umamusume_bert_vits2](https://huggingface.co/spaces/AkitoP/umamusume_bert_vits2), from which the Japanese text preprocessor is provided.** โค
---------------
## Instruction for use | ไฝฟ็”จ่ฏดๆ˜Ž | ไฝฟ็”จใซใคใ„ใฆใฎ่ชฌๆ˜Ž
โœ‹ **Please do NOT enter a really LOOOONG sentence or sentences in a single row. Splitting your inputs into multiple rows makes each row to be inferenced separately. Please avoid completely empty rows, which will lead to weird sounds in the corresponding spaces in the generated audio.** โœ‹
โœ‹ **่ฏทไธ่ฆๅœจไธ€่กŒๅ†…่พ“ๅ…ฅ่ถ…้•ฟๆ–‡ๆœฌ๏ผŒๆจกๅž‹ไผšๅฐ†ๆฏ่กŒ็š„่พ“ๅ…ฅ่ง†ไธบไธ€ๅฅ่ฏ่ฟ›่กŒๆŽจ็†ใ€‚ๅœจไธๅฝฑๅ“่ฏญๆ„่ฟž่ดฏ็š„ๆƒ…ๅ†ตไธ‹๏ผŒ่ฏทๅฐ†ๅคšๅฅ่ฏๅˆ†ๅˆซๆ”พๅ…ฅไธๅŒ็š„่กŒไธญๆฅๅ‡ๅฐ‘ๆŽจ็†ๆ—ถ้—ดใ€‚่ฏทๅˆ ้™ค่พ“ๅ…ฅไธญ็š„็ฉบ็™ฝ่กŒ๏ผŒ่ฟ™ไผšๅฏผ่‡ดๅœจ็”Ÿๆˆ็š„่ฏญ้Ÿณ็š„ๅฏนๅบ”ไฝ็ฝฎไธญไบง็”Ÿๅฅ‡ๆ€ช็š„ๅฃฐ้Ÿณใ€‚** โœ‹
โœ‹ **้•ทใ™ใŽใ‚‹ใƒ†ใ‚ญใ‚นใƒˆใ‚’ไธ€่กŒใซๅ…ฅๅŠ›ใ—ใชใ„ใงใใ ใ•ใ„ใ€‚ใƒขใƒ‡ใƒซใฏๅ„่กŒใ‚’ไธ€ใคใฎๆ–‡ใจใ—ใฆๆŽจ็†ใ—ใพใ™ใ€‚ๆ„ๅ‘ณใŒ็น‹ใŒใ‚‹็ฏ„ๅ›ฒใงใ€่ค‡ๆ•ฐใฎๆ–‡ใ‚’็•ฐใชใ‚‹่กŒใซๅˆ†ใ‘ใฆๆŽจ็†ๆ™‚้–“ใ‚’็Ÿญ็ธฎใ—ใฆใใ ใ•ใ„ใ€‚็ฉบ็™ฝ่กŒใฏๅ‰Š้™คใ—ใฆใใ ใ•ใ„ใ€‚ใ“ใ‚ŒใŒ็”Ÿๆˆใ•ใ‚ŒใŸ้Ÿณๅฃฐใฎๅฏพๅฟœ้ƒจๅˆ†ใงๅฅ‡ๅฆ™ใช้Ÿณใ‚’็”Ÿใ˜ใ‚‹ๅŽŸๅ› ใจใชใ‚Šใพใ™ใ€‚** โœ‹
-------------------------
๐Ÿ‘ **When encountering situations where an error occurs, please check if there's rare and difficult CHINISE CHARACTERS in your inputs, and replace them with Hiragana or Katakana.** ๐Ÿ‘
๐Ÿ‘ **ๅฆ‚ๆžœ็”Ÿๆˆๅ‡บ็Žฐไบ†้”™่ฏฏ๏ผŒ่ฏท้ฆ–ๅ…ˆๆฃ€ๆŸฅ่พ“ๅ…ฅไธญๆ˜ฏๅฆๅญ˜ๅœจ้žๅธธๅฐ‘่ง็š„็”Ÿๅƒปๆฑ‰ๅญ—๏ผŒๅฆ‚ๆžœๆœ‰๏ผŒ่ฏทๅฐ†ๅ…ถๆ›ฟๆขไธบๅนณๅ‡ๅๆˆ–่€…็‰‡ๅ‡ๅใ€‚** ๐Ÿ‘
๐Ÿ‘ **็”Ÿๆˆใซ่ชคใ‚ŠใŒใ‚ใ‚‹ๅ ดๅˆใฏใ€ใพใšๅ…ฅๅŠ›ใซ้žๅธธใซ็ใ—ใ„้›ฃ่งฃใชๆผขๅญ—ใŒใชใ„ใ‹็ขบ่ชใ—ใฆใใ ใ•ใ„ใ€‚ใ‚‚ใ—ๅญ˜ๅœจใ™ใ‚‹ๅ ดๅˆใ€ใใ‚Œใ‚’ๅนณไปฎๅใพใŸใฏ็‰‡ไปฎๅใซ็ฝฎใๆ›ใˆใฆใใ ใ•ใ„ใ€‚** ๐Ÿ‘
------------------------
๐ŸŽˆ **Please make good use of punctuation marks.** ๐ŸŽˆ
๐ŸŽˆ **่ฏทๅ–„็”จๆ ‡็‚น็ฌฆๅท็š„็ฅžๅฅ‡ๅŠ›้‡ใ€‚** ๐ŸŽˆ
๐ŸŽˆ **ๅฅ่ชญ็‚นใฎ้ญ”ๆณ•ใฎๅŠ›ใ‚’ใ†ใพใๆดป็”จใ—ใฆใใ ใ•ใ„ใ€‚** ๐ŸŽˆ
---------------------
๐Ÿ“š **What is the Chinese name for the character name? Please refer to [Umamusume Bilibili Wiki](https://wiki.biligame.com/umamusume/%E8%B5%9B%E9%A9%AC%E5%A8%98%E4%B8%80%E8%A7%88).** ๐Ÿ“š
๐Ÿ“š **ใ‚ญใƒฃใƒฉใฎไธญๅ›ฝ่ชžๅใฏไฝ•ใงใ™ใ‹๏ผŸใ“ใ“ใซใ”่ฆงใใ ใ•ใ„๏ผš[ใ‚ฆใƒžๅจ˜ใƒ“ใƒชใƒ“ใƒชWiki](https://wiki.biligame.com/umamusume/%E8%B5%9B%E9%A9%AC%E5%A8%98%E4%B8%80%E8%A7%88).** ๐Ÿ“š
## Training Details - For those who may be interested
๐ŸŽˆ **This work switches [cl-tohoku/bert-base-japanese-v3](https://huggingface.co/cl-tohoku/bert-base-japanese-v3) to [ku-nlp/deberta-v2-base-japanese](https://huggingface.co/ku-nlp/deberta-v2-base-japanese) expecting potentially better performance, and, just for fun.** ๐Ÿฅฐ
โค Thanks to **SUSTech Center for Computational Science and Engineering**. โค This model is trained on A100 (40GB) x 2 with **batch size 32** in total.
๐Ÿ’ช This model has been trained for **1 cycle, 180K steps (=120 epoch),** currently. ๐Ÿ’ช
๐Ÿ“• This work uses linear with warmup **(7.5% of total steps)** LR scheduler with ` max_lr=1e-4`. ๐Ÿ“•
โœ‚ This work **clips gradient value to 10** โœ‚.
โš  Finetuning the model on **single-speaker datasets separately** will definitely reach better result than training on **a huge dataset comprising of many speakers**. Sharing a same model leads to unexpected mixing of the speaker's voice line. โš 
### TODO:
๐Ÿ“… Train one more cycle using text preprocessor provided by [AkitoP](https://huggingface.co/AkitoP) with cleaner text inputs and training data of Mejiro Ramonu. ๐Ÿ“