HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
Abstract
Several recent work on speech synthesis have employed generative adversarial networks (GANs) to produce raw waveforms. Although such methods improve the sampling efficiency and memory usage, their sample quality has not yet reached that of autoregressive and flow-based generative models. In this work, we propose HiFi-GAN, which achieves both efficient and high-fidelity speech synthesis. As speech audio consists of sinusoidal signals with various periods, we demonstrate that modeling periodic patterns of an audio is crucial for enhancing sample quality. A subjective human evaluation (mean opinion score, MOS) of a single speaker dataset indicates that our proposed method demonstrates similarity to human quality while generating 22.05 kHz high-fidelity audio 167.9 times faster than real-time on a single V100 GPU. We further show the generality of HiFi-GAN to the mel-spectrogram inversion of unseen speakers and end-to-end <PRE_TAG>speech synthesis</POST_TAG>. Finally, a small footprint version of HiFi-GAN generates samples 13.4 times faster than real-time on CPU with comparable quality to an autoregressive counterpart.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- CSSinger: End-to-End Chunkwise Streaming Singing Voice Synthesis System Based on Conditional Variational Autoencoder (2024)
- HiFi-SR: A Unified Generative Transformer-Convolutional Adversarial Network for High-Fidelity Speech Super-Resolution (2025)
- FlashSR: One-step Versatile Audio Super-resolution via Diffusion Distillation (2025)
- Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey (2024)
- Scaling Transformers for Low-Bitrate High-Quality Speech Coding (2024)
- Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners (2024)
- Towards Lightweight and Stable Zero-shot TTS with Self-distilled Representation Disentanglement (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 23
Browse 23 models citing this paperDatasets citing this paper 0
No dataset linking this paper