WaveGlow

Authors:
Ryan Prenger, Rafael Valle, Bryan Catanzaro

 

Description:
WaveGlow is a flow-based neural network designed to generate high-quality speech from mel-spectrograms efficiently and without auto-regression. It combines insights from Glow and WaveNet to achieve this goal.

 

Training and Data:
WaveGlow is trained using a single network and a single loss function (likelihood maximization). It leverages invertible neural networks to ensure tractability in likelihood calculation, simplifying the training process.

 

Advantages:
WaveGlow provides high-quality, efficient, and fast audio synthesis. It simplifies training and inference compared to auto-regressive models and achieves real-time or faster-than-real-time performance on modern GPUs.

 

Limitations:
The model’s performance and quality are evaluated primarily on a single-speaker dataset (LJ speech), and future work could explore its effectiveness across more diverse datasets and multi-speaker scenarios.

 

Model Architecture:
The model is based on a series of invertible transformations, including invertible 1×1 convolutions and affine coupling layers. It conditions the audio generation on mel-spectrograms and uses a normalizing flow approach for tractable likelihood.

 

Dependencies:
Implemented in PyTorch, utilizes NVIDIA GPUs for training and inference, with specific preprocessing parameters for mel-spectrogram generation.

 

Synthesis Process:
WaveGlow generates speech by sampling from a zero-mean spherical Gaussian distribution and transforming these samples through a series of invertible layers conditioned on the input mel-spectrograms.

 

Dataset:
Trained on the LJ speech dataset, consisting of 13,100 short audio clips (approximately 24 hours of speech) recorded in a home environment using a MacBook Pro’s built-in microphone.

 

Preprocessing:
Trained on the LJ speech dataset, consisting of 13,100 short audio clips (approximately 24 hours of speech) recorded in a home environment using a MacBook Pro’s built-in microphone.

 

Evaluation Metrics:
Mean Opinion Score (MOS) for perceptual quality comparison, and synthesis speed measured in kHz to assess efficiency and real-time capability.

 

Performance:
Achieves synthesis speeds of over 500 kHz on an NVIDIA V100 GPU, significantly faster than real-time requirements. Delivers MOS scores close to real audio and better than Griffin-Lim and comparable to the best WaveNet implementations.

 

Contributions:
Introduced a novel flow-based generative model for speech synthesis that combines ideas from Glow and WaveNet, achieving high-quality, efficient, and fast audio generation with a simpler training and inference process.
Link to paper


Last Accessed: 7/17/2024

NSF Award #2346473