Jungil Kong, Jaehyeon Kim, Jaekyoung Bae


HiFi-GAN is a generative adversarial network for speech synthesis that achieves high-fidelity audio generation with efficient computational performance. It leverages periodic pattern modeling to enhance sample quality and demonstrates significant improvements over previous models like WaveNet and WaveGlow.


Training and Data:
HiFi-GAN uses a two-stage training process with a generator and two types of discriminators (multi-scale and multi-period). The generator is trained adversarially along with additional losses (mel-spectrogram loss and feature matching loss) to stabilize training and improve audio fidelity.


HiFi-GAN achieves higher MOS scores than previous models, indicating better audio quality. It synthesizes speech significantly faster than real-time on both GPU and CPU. The model also generalizes well to unseen speakers and performs well in end-to-end TTS pipelines with fine-tuning.


The model may still require significant computational resources for training, and fine-tuning might be necessary for optimal performance in end-to-end TTS pipelines. The trade-off between model size and synthesis quality needs to be considered for specific applications.


Model Architecture:
The architecture includes a fully convolutional generator with multi-receptive field fusion modules to capture various patterns. The discriminators consist of multi-period discriminators (MPD) that focus on periodic patterns and multi-scale discriminators (MSD) that capture long-term dependencies.


Dependencies include PyTorch for training the convolutional neural networks, tools for audio preprocessing and mel-spectrogram extraction, and computational resources for handling the training and inference processes.


Uses GAN-based architecture with discriminators that capture both periodic and long-term dependencies in audio signals. The generator synthesizes raw waveforms from mel-spectrograms, ensuring high fidelity and efficient synthesis through adversarial training and additional loss functions.


Used datasets include LJSpeech for single-speaker TTS and VCTK for multi-speaker TTS, with specific preprocessing steps for consistency in training and evaluation. These datasets provide diverse audio samples for training and testing the generative models.


Raw audio is preprocessed into mel-spectrograms, which serve as input conditions for the generator. FFT, window, and hop size parameters are set for mel-spectrogram extraction. Preprocessing steps ensure consistency and compatibility with the model’s architecture.


Evaluation Metrics:

Evaluated using MOS tests for perceptual quality, synthesis speed measurements on GPU and CPU, and model size. MOS scores are recorded with 95% confidence intervals, and synthesis speeds are measured in terms of raw audio samples generated per second.


HiFi-GAN achieves superior performance in terms of MOS scores, synthesis speed, and model size compared to previous models. It provides high-fidelity audio generation, efficient synthesis on both GPU and CPU, and demonstrates generalization to unseen speakers and effectiveness in end-to-end TTS pipelines.

