Authors:
Jungil Kong, Jaehyeon Kim, Jaekyoung Bae
Description:
HiFi-GAN is a generative adversarial network for speech synthesis that achieves high-fidelity audio generation with efficient computational performance. It leverages periodic pattern modeling to enhance sample quality and demonstrates significant improvements over previous models like WaveNet and WaveGlow.
Training and Data:
HiFi-GAN uses a two-stage training process with a generator and two types of discriminators (multi-scale and multi-period). The generator is trained adversarially along with additional losses (mel-spectrogram loss and feature matching loss) to stabilize training and improve audio fidelity.
Advantages:
HiFi-GAN achieves higher MOS scores than previous models, indicating better audio quality. It synthesizes speech significantly faster than real-time on both GPU and CPU. The model also generalizes well to unseen speakers and performs well in end-to-end TTS pipelines with fine-tuning.
Limitations:
The model may still require significant computational resources for training, and fine-tuning might be necessary for optimal performance in end-to-end TTS pipelines. The trade-off between model size and synthesis quality needs to be considered for specific applications.
Model Architecture:
The architecture includes a fully convolutional generator with multi-receptive field fusion modules to capture various patterns. The discriminators consist of multi-period discriminators (MPD) that focus on periodic patterns and multi-scale discriminators (MSD) that capture long-term dependencies.
Dependencies:
Dependencies include PyTorch for training the convolutional neural networks, tools for audio preprocessing and mel-spectrogram extraction, and computational resources for handling the training and inference processes.
Synthesis:
Uses GAN-based architecture with discriminators that capture both periodic and long-term dependencies in audio signals. The generator synthesizes raw waveforms from mel-spectrograms, ensuring high fidelity and efficient synthesis through adversarial training and additional loss functions.
Dataset:
Used datasets include LJSpeech for single-speaker TTS and VCTK for multi-speaker TTS, with specific preprocessing steps for consistency in training and evaluation. These datasets provide diverse audio samples for training and testing the generative models.
Preprocessing:
Raw audio is preprocessed into mel-spectrograms, which serve as input conditions for the generator. FFT, window, and hop size parameters are set for mel-spectrogram extraction. Preprocessing steps ensure consistency and compatibility with the model’s architecture.
Evaluation Metrics:
Evaluated using MOS tests for perceptual quality, synthesis speed measurements on GPU and CPU, and model size. MOS scores are recorded with 95% confidence intervals, and synthesis speeds are measured in terms of raw audio samples generated per second.
HiFi-GAN achieves superior performance in terms of MOS scores, synthesis speed, and model size compared to previous models. It provides high-fidelity audio generation, efficient synthesis on both GPU and CPU, and demonstrates generalization to unseen speakers and effectiveness in end-to-end TTS pipelines.
Last Accessed: 7/16/2024