File size: 3,784 Bytes
9bd9742
 
35acbc0
 
52c1378
35acbc0
52c1378
35acbc0
 
 
9bd9742
 
 
 
733058c
 
 
 
82150c0
9bd9742
82150c0
 
ba783b8
 
733058c
 
82150c0
 
 
9bd9742
ba783b8
 
733058c
 
ba783b8
 
 
 
 
 
733058c
 
 
ba783b8
733058c
ba783b8
2edbce5
 
9bd9742
 
 
 
 
 
2edbce5
9bd9742
82150c0
9bd9742
82150c0
9bd9742
82150c0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# Umamusume DeBERTA-VITS2 TTS 

---------------

๐Ÿ“… 2023.10.24 ๐Ÿ“…

- Updated current Generator to 270K steps' checkpoint

------------------

๐Ÿ‘Œ **Currently, ONLY Japanese is supported.** ๐Ÿ‘Œ

๐Ÿ’ช **Based on [Bert-VITS2](https://github.com/fishaudio/Bert-VITS2), this work tightly follows [Akito/umamusume_bert_vits2](https://huggingface.co/spaces/AkitoP/umamusume_bert_vits2), from which the Japanese text preprocessor is provided.** โค

---------------

## Instruction for use | ไฝฟ็”จ่ฏดๆ˜Ž | ไฝฟ็”จใซใคใ„ใฆใฎ่ชฌๆ˜Ž

โœ‹ **Please do NOT enter a really LOOOONG sentence or sentences in a single row. Splitting your inputs into multiple rows makes each row to be inferenced separately. Please avoid completely empty rows, which will lead to weird sounds in the corresponding spaces in the generated audio.** โœ‹

โœ‹ **่ฏทไธ่ฆๅœจไธ€่กŒๅ†…่พ“ๅ…ฅ่ถ…้•ฟๆ–‡ๆœฌ๏ผŒๆจกๅž‹ไผšๅฐ†ๆฏ่กŒ็š„่พ“ๅ…ฅ่ง†ไธบไธ€ๅฅ่ฏ่ฟ›่กŒๆŽจ็†ใ€‚ๅœจไธๅฝฑๅ“่ฏญๆ„่ฟž่ดฏ็š„ๆƒ…ๅ†ตไธ‹๏ผŒ่ฏทๅฐ†ๅคšๅฅ่ฏๅˆ†ๅˆซๆ”พๅ…ฅไธๅŒ็š„่กŒไธญๆฅๅ‡ๅฐ‘ๆŽจ็†ๆ—ถ้—ดใ€‚่ฏทๅˆ ้™ค่พ“ๅ…ฅไธญ็š„็ฉบ็™ฝ่กŒ๏ผŒ่ฟ™ไผšๅฏผ่‡ดๅœจ็”Ÿๆˆ็š„่ฏญ้Ÿณ็š„ๅฏนๅบ”ไฝ็ฝฎไธญไบง็”Ÿๅฅ‡ๆ€ช็š„ๅฃฐ้Ÿณใ€‚** โœ‹

โœ‹ **้•ทใ™ใŽใ‚‹ใƒ†ใ‚ญใ‚นใƒˆใ‚’ไธ€่กŒใซๅ…ฅๅŠ›ใ—ใชใ„ใงใใ ใ•ใ„ใ€‚ใƒขใƒ‡ใƒซใฏๅ„่กŒใ‚’ไธ€ใคใฎๆ–‡ใจใ—ใฆๆŽจ็†ใ—ใพใ™ใ€‚ๆ„ๅ‘ณใŒ็น‹ใŒใ‚‹็ฏ„ๅ›ฒใงใ€่ค‡ๆ•ฐใฎๆ–‡ใ‚’็•ฐใชใ‚‹่กŒใซๅˆ†ใ‘ใฆๆŽจ็†ๆ™‚้–“ใ‚’็Ÿญ็ธฎใ—ใฆใใ ใ•ใ„ใ€‚็ฉบ็™ฝ่กŒใฏๅ‰Š้™คใ—ใฆใใ ใ•ใ„ใ€‚ใ“ใ‚ŒใŒ็”Ÿๆˆใ•ใ‚ŒใŸ้Ÿณๅฃฐใฎๅฏพๅฟœ้ƒจๅˆ†ใงๅฅ‡ๅฆ™ใช้Ÿณใ‚’็”Ÿใ˜ใ‚‹ๅŽŸๅ› ใจใชใ‚Šใพใ™ใ€‚** โœ‹

-------------------------

๐Ÿ‘ **When encountering situations where an error occurs, please check if there's rare and difficult CHINISE CHARACTERS in your inputs, and replace them with Hiragana or Katakana.** ๐Ÿ‘

๐Ÿ‘ **ๅฆ‚ๆžœ็”Ÿๆˆๅ‡บ็Žฐไบ†้”™่ฏฏ๏ผŒ่ฏท้ฆ–ๅ…ˆๆฃ€ๆŸฅ่พ“ๅ…ฅไธญๆ˜ฏๅฆๅญ˜ๅœจ้žๅธธๅฐ‘่ง็š„็”Ÿๅƒปๆฑ‰ๅญ—๏ผŒๅฆ‚ๆžœๆœ‰๏ผŒ่ฏทๅฐ†ๅ…ถๆ›ฟๆขไธบๅนณๅ‡ๅๆˆ–่€…็‰‡ๅ‡ๅใ€‚** ๐Ÿ‘

๐Ÿ‘ **็”Ÿๆˆใซ่ชคใ‚ŠใŒใ‚ใ‚‹ๅ ดๅˆใฏใ€ใพใšๅ…ฅๅŠ›ใซ้žๅธธใซ็ใ—ใ„้›ฃ่งฃใชๆผขๅญ—ใŒใชใ„ใ‹็ขบ่ชใ—ใฆใใ ใ•ใ„ใ€‚ใ‚‚ใ—ๅญ˜ๅœจใ™ใ‚‹ๅ ดๅˆใ€ใใ‚Œใ‚’ๅนณไปฎๅใพใŸใฏ็‰‡ไปฎๅใซ็ฝฎใๆ›ใˆใฆใใ ใ•ใ„ใ€‚** ๐Ÿ‘

------------------------

๐ŸŽˆ **Please make good use of punctuation marks.** ๐ŸŽˆ

๐ŸŽˆ **่ฏทๅ–„็”จๆ ‡็‚น็ฌฆๅท็š„็ฅžๅฅ‡ๅŠ›้‡ใ€‚** ๐ŸŽˆ

๐ŸŽˆ **ๅฅ่ชญ็‚นใฎ้ญ”ๆณ•ใฎๅŠ›ใ‚’ใ†ใพใๆดป็”จใ—ใฆใใ ใ•ใ„ใ€‚** ๐ŸŽˆ

---------------------

๐Ÿ“š **What is the Chinese name for the character name? Please refer to [Umamusume Bilibili Wiki](https://wiki.biligame.com/umamusume/%E8%B5%9B%E9%A9%AC%E5%A8%98%E4%B8%80%E8%A7%88).** ๐Ÿ“š

๐Ÿ“š **ใ‚ญใƒฃใƒฉใฎไธญๅ›ฝ่ชžๅใฏไฝ•ใงใ™ใ‹๏ผŸใ“ใ“ใซใ”่ฆงใใ ใ•ใ„๏ผš[ใ‚ฆใƒžๅจ˜ใƒ“ใƒชใƒ“ใƒชWiki](https://wiki.biligame.com/umamusume/%E8%B5%9B%E9%A9%AC%E5%A8%98%E4%B8%80%E8%A7%88).** ๐Ÿ“š

---------------

## Training Details - For those who may be interested

๐ŸŽˆ **This work switches [cl-tohoku/bert-base-japanese-v3](https://huggingface.co/cl-tohoku/bert-base-japanese-v3) to [ku-nlp/deberta-v2-base-japanese](https://huggingface.co/ku-nlp/deberta-v2-base-japanese) expecting potentially better performance, and, just for fun.** ๐Ÿฅฐ

โค Thanks to **SUSTech Center for Computational Science and Engineering**. โค This model is trained on A100 (40GB) x 2 with **batch size 32** in total.

๐Ÿ’ช This model has been trained for **3 cycles, 270K steps (=180 epoch)** . ๐Ÿ’ช

๐Ÿ“• This work uses linear with warmup **(7.5% of total steps)** LR scheduler with  ` max_lr=1e-4`. ๐Ÿ“•

โœ‚ This work **clips gradient value to 10** โœ‚.

โš  Finetuning the model on **single-speaker datasets separately** will definitely reach better result than training on **a huge dataset comprising of many speakers**. Sharing a same model leads to unexpected mixing of the speaker's voice line. โš