Edit model card

FastSpeech2ConformerWithHifiGan

This model combines FastSpeech2Conformer and FastSpeech2ConformerHifiGan into one model for a simpler and more convenient usage.

FastSpeech2Conformer is a non-autoregressive text-to-speech (TTS) model that combines the strengths of FastSpeech2 and the conformer architecture to generate high-quality speech from text quickly and efficiently, and the HiFi-GAN vocoder is used to turn generated mel-spectrograms into speech waveforms.

πŸ€— Transformers Usage

You can run FastSpeech2Conformer locally with the πŸ€— Transformers library.

  1. First install the πŸ€— Transformers library and g2p-en:
pip install --upgrade pip
pip install --upgrade transformers g2p-en
  1. Run inference via the Transformers modelling code with the model and hifigan combined

from transformers import FastSpeech2ConformerTokenizer, FastSpeech2ConformerWithHifiGan
import soundfile as sf

tokenizer = FastSpeech2ConformerTokenizer.from_pretrained("espnet/fastspeech2_conformer")
inputs = tokenizer("Hello, my dog is cute.", return_tensors="pt")
input_ids = inputs["input_ids"]

model = FastSpeech2ConformerWithHifiGan.from_pretrained("espnet/fastspeech2_conformer_with_hifigan")
output_dict = model(input_ids, return_dict=True)
waveform = output_dict["waveform"]

sf.write("speech.wav", waveform.squeeze().detach().numpy(), samplerate=22050)
Downloads last month
587
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.