Text-to-Speech
Fairseq
English
audio
arampacha commited on
Commit
8b3f165
1 Parent(s): 9ead7dc
Files changed (8) hide show
  1. .gitattributes +1 -0
  2. README.md +70 -0
  3. config.yaml +30 -0
  4. fbank_mfa_gcmvn_stats.npz +3 -0
  5. hifigan.bin +3 -0
  6. hifigan.json +37 -0
  7. pytorch_model.pt +3 -0
  8. vocab.txt +71 -0
.gitattributes CHANGED
@@ -25,3 +25,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
25
  *.zip filter=lfs diff=lfs merge=lfs -text
26
  *.zstandard filter=lfs diff=lfs merge=lfs -text
27
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
25
  *.zip filter=lfs diff=lfs merge=lfs -text
26
  *.zstandard filter=lfs diff=lfs merge=lfs -text
27
  *tfevents* filter=lfs diff=lfs merge=lfs -text
28
+ fbank_mfa_gcmvn_stats.npz filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,70 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: fairseq
3
+ task: text-to-speech
4
+ tags:
5
+ - fairseq
6
+ - audio
7
+ - text-to-speech
8
+ language: en
9
+ datasets:
10
+ - ljspeech
11
+ widget:
12
+ - text: "Hello, this is a test run."
13
+ example_title: "Hello, this is a test run."
14
+ ---
15
+ # fastspeech2-en-ljspeech
16
+
17
+ [FastSpeech 2](https://arxiv.org/abs/2006.04558) text-to-speech model from fairseq S^2 ([paper](https://arxiv.org/abs/2109.06912)/[code](https://github.com/pytorch/fairseq/tree/main/examples/speech_synthesis)):
18
+ - English
19
+ - Single-speaker female voice
20
+ - Trained on [LJSpeech](https://keithito.com/LJ-Speech-Dataset/)
21
+
22
+ ## Usage
23
+
24
+ ```python
25
+ from fairseq.checkpoint_utils import load_model_ensemble_and_task_from_hf_hub
26
+ from fairseq.models.text_to_speech.hub_interface import TTSHubInterface
27
+ import IPython.display as ipd
28
+
29
+
30
+ models, cfg, task = load_model_ensemble_and_task_from_hf_hub(
31
+ "facebook/fastspeech2-en-ljspeech",
32
+ arg_overrides={"vocoder": "hifigan", "fp16": False}
33
+ )
34
+ model = models[0]
35
+ TTSHubInterface.update_cfg_with_data_cfg(cfg, task.data_cfg)
36
+ generator = task.build_generator(model, cfg)
37
+
38
+ text = "Hello, this is a test run."
39
+
40
+ sample = TTSHubInterface.get_model_input(task, text)
41
+ wav, rate = TTSHubInterface.get_prediction(task, model, generator, sample)
42
+
43
+ ipd.Audio(wav, rate=rate)
44
+ ```
45
+
46
+ See also [fairseq S^2 example](https://github.com/pytorch/fairseq/blob/main/examples/speech_synthesis/docs/ljspeech_example.md).
47
+
48
+ ## Citation
49
+
50
+ ```bibtex
51
+ @inproceedings{wang-etal-2021-fairseq,
52
+ title = "fairseq S{\^{}}2: A Scalable and Integrable Speech Synthesis Toolkit",
53
+ author = "Wang, Changhan and
54
+ Hsu, Wei-Ning and
55
+ Adi, Yossi and
56
+ Polyak, Adam and
57
+ Lee, Ann and
58
+ Chen, Peng-Jen and
59
+ Gu, Jiatao and
60
+ Pino, Juan",
61
+ booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
62
+ month = nov,
63
+ year = "2021",
64
+ address = "Online and Punta Cana, Dominican Republic",
65
+ publisher = "Association for Computational Linguistics",
66
+ url = "https://aclanthology.org/2021.emnlp-demo.17",
67
+ doi = "10.18653/v1/2021.emnlp-demo.17",
68
+ pages = "143--152",
69
+ }
70
+ ```
config.yaml ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ features:
2
+ energy_max: 3.2244551181793213
3
+ energy_min: -4.9544901847839355
4
+ eps: 1.0e-05
5
+ f_max: 8000
6
+ f_min: 0
7
+ hop_len_t: 0.011609977324263039
8
+ hop_length: 256
9
+ n_fft: 1024
10
+ n_mels: 80
11
+ n_stft: 513
12
+ pitch_max: 5.733940816898645
13
+ pitch_min: -4.660287183665281
14
+ sample_rate: 22050
15
+ type: spectrogram+melscale+log
16
+ win_len_t: 0.046439909297052155
17
+ win_length: 1024
18
+ window_fn: hann
19
+ global_cmvn:
20
+ stats_npz_path: fbank_mfa_gcmvn_stats.npz
21
+ transforms:
22
+ '*':
23
+ - global_cmvn
24
+ vocab_filename: vocab.txt
25
+ vocoder:
26
+ type: hifigan
27
+ config: hifigan.json
28
+ checkpoint: hifigan.bin
29
+ hub:
30
+ phonemizer: g2p
fbank_mfa_gcmvn_stats.npz ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6735b35875c2614cee80bf861c6a604aba35671887e6f04b4449dc257bb15d34
3
+ size 1140
hifigan.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d4f4f016c791fd9ca9859a9e25e7eb0a823fee2ea997c1e5ae8e1a9ea5f99b1f
3
+ size 55825897
hifigan.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "resblock": "1",
3
+ "num_gpus": 0,
4
+ "batch_size": 16,
5
+ "learning_rate": 0.0002,
6
+ "adam_b1": 0.8,
7
+ "adam_b2": 0.99,
8
+ "lr_decay": 0.999,
9
+ "seed": 1234,
10
+
11
+ "upsample_rates": [8,8,2,2],
12
+ "upsample_kernel_sizes": [16,16,4,4],
13
+ "upsample_initial_channel": 512,
14
+ "resblock_kernel_sizes": [3,7,11],
15
+ "resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]],
16
+
17
+ "segment_size": 8192,
18
+ "num_mels": 80,
19
+ "num_freq": 1025,
20
+ "n_fft": 1024,
21
+ "hop_size": 256,
22
+ "win_size": 1024,
23
+
24
+ "sampling_rate": 22050,
25
+
26
+ "fmin": 0,
27
+ "fmax": 8000,
28
+ "fmax_for_loss": null,
29
+
30
+ "num_workers": 4,
31
+
32
+ "dist_config": {
33
+ "dist_backend": "nccl",
34
+ "dist_url": "tcp://localhost:54321",
35
+ "world_size": 1
36
+ }
37
+ }
pytorch_model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a48d454fe66939079d0ddb70f1c062ec669f521a7cfadc608968746e312986ab
3
+ size 494816801
vocab.txt ADDED
@@ -0,0 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ AH0 71007
2
+ N 63410
3
+ T 60842
4
+ S 40263
5
+ D 39886
6
+ R 35965
7
+ L 30358
8
+ sp 27584
9
+ IH0 27113
10
+ DH 26584
11
+ K 25851
12
+ IH1 25683
13
+ Z 25387
14
+ EH1 21690
15
+ AE1 21648
16
+ M 21537
17
+ W 18760
18
+ P 18458
19
+ ER0 18446
20
+ V 18169
21
+ IY0 17832
22
+ AH1 16995
23
+ F 15549
24
+ B 14227
25
+ HH 13468
26
+ IY1 12751
27
+ EY1 12141
28
+ AO1 11595
29
+ AA1 10589
30
+ AY1 9624
31
+ UW1 8865
32
+ SH 7449
33
+ OW1 7441
34
+ NG 6705
35
+ G 5472
36
+ ER1 4898
37
+ Y 4548
38
+ JH 4486
39
+ CH 4355
40
+ TH 3980
41
+ AW1 3607
42
+ UH1 2469
43
+ EH2 1881
44
+ spn 1774
45
+ AO0 1357
46
+ OW0 1328
47
+ EY2 1258
48
+ IH2 1251
49
+ AE2 1104
50
+ UW0 1077
51
+ AY2 1062
52
+ AA2 774
53
+ OY1 771
54
+ AO2 622
55
+ ZH 587
56
+ EH0 568
57
+ OW2 557
58
+ EY0 443
59
+ IY2 435
60
+ UW2 431
61
+ AY0 390
62
+ AE0 374
63
+ AH2 316
64
+ AW2 290
65
+ AA0 259
66
+ ER2 136
67
+ UH2 127
68
+ OY2 44
69
+ UH0 36
70
+ AW0 35
71
+ OY0 4