WaveGlow

Authors:
Ryan Prenger, Rafael Valle, Bryan Catanzaro

Description:
WaveGlow is a flow-based neural network designed to generate high-quality speech from mel-spectrograms efficiently and without auto-regression. It combines insights from Glow and WaveNet to achieve this goal.

Training and Data:
WaveGlow is trained using a single network and a single loss function (likelihood maximization). It leverages invertible neural networks to ensure tractability in likelihood calculation, simplifying the training process.

Advantages:
WaveGlow provides high-quality, efficient, and fast audio synthesis. It simplifies training and inference compared to auto-regressive models and achieves real-time or faster-than-real-time performance on modern GPUs.

Limitations:
The model’s performance and quality are evaluated primarily on a single-speaker dataset (LJ speech), and future work could explore its effectiveness across more diverse datasets and multi-speaker scenarios.

Model Architecture:
The model is based on a series of invertible transformations, including invertible 1×1 convolutions and affine coupling layers. It conditions the audio generation on mel-spectrograms and uses a normalizing flow approach for tractable likelihood.

Dependencies:
Implemented in PyTorch, utilizes NVIDIA GPUs for training and inference, with specific preprocessing parameters for mel-spectrogram generation.

Synthesis Process:
WaveGlow generates speech by sampling from a zero-mean spherical Gaussian distribution and transforming these samples through a series of invertible layers conditioned on the input mel-spectrograms.

Dataset:
Trained on the LJ speech dataset, consisting of 13,100 short audio clips (approximately 24 hours of speech) recorded in a home environment using a MacBook Pro’s built-in microphone.

Preprocessing:
Trained on the LJ speech dataset, consisting of 13,100 short audio clips (approximately 24 hours of speech) recorded in a home environment using a MacBook Pro’s built-in microphone.

Evaluation Metrics:
Mean Opinion Score (MOS) for perceptual quality comparison, and synthesis speed measured in kHz to assess efficiency and real-time capability.

Performance:
Achieves synthesis speeds of over 500 kHz on an NVIDIA V100 GPU, significantly faster than real-time requirements. Delivers MOS scores close to real audio and better than Griffin-Lim and comparable to the best WaveNet implementations.

Contributions:
Introduced a novel flow-based generative model for speech synthesis that combines ideas from Glow and WaveNet, achieving high-quality, efficient, and fast audio generation with a simpler training and inference process.

Link to paper

Last Accessed: 7/17/2024

NSF Award #2346473

Community Infrastructure to Strengthen AI for Audio Deepfake analysis (CISAAD)

College of Engineering and Information Technology

Community Infrastructure to Strengthen AI for Audio Deepfake analysis (CISAAD)

WaveGlow

Community Infrastructure to Strengthen AI for Audio Deepfake analysis (CISAAD)

Subscribe to UMBC Weekly Top Stories

I am interested in: