Authors:
Ryan Prenger, Rafael Valle, Bryan Catanzaro
Ryan Prenger, Rafael Valle, Bryan Catanzaro
Description:
WaveGlow is a flow-based neural network designed to generate high-quality speech from mel-spectrograms efficiently and without auto-regression. It combines insights from Glow and WaveNet to achieve this goal.
WaveGlow is a flow-based neural network designed to generate high-quality speech from mel-spectrograms efficiently and without auto-regression. It combines insights from Glow and WaveNet to achieve this goal.
Training and Data:
WaveGlow is trained using a single network and a single loss function (likelihood maximization). It leverages invertible neural networks to ensure tractability in likelihood calculation, simplifying the training process.
WaveGlow is trained using a single network and a single loss function (likelihood maximization). It leverages invertible neural networks to ensure tractability in likelihood calculation, simplifying the training process.
Advantages:
WaveGlow provides high-quality, efficient, and fast audio synthesis. It simplifies training and inference compared to auto-regressive models and achieves real-time or faster-than-real-time performance on modern GPUs.
WaveGlow provides high-quality, efficient, and fast audio synthesis. It simplifies training and inference compared to auto-regressive models and achieves real-time or faster-than-real-time performance on modern GPUs.
Limitations:
The model’s performance and quality are evaluated primarily on a single-speaker dataset (LJ speech), and future work could explore its effectiveness across more diverse datasets and multi-speaker scenarios.
The model’s performance and quality are evaluated primarily on a single-speaker dataset (LJ speech), and future work could explore its effectiveness across more diverse datasets and multi-speaker scenarios.
Model Architecture:
The model is based on a series of invertible transformations, including invertible 1×1 convolutions and affine coupling layers. It conditions the audio generation on mel-spectrograms and uses a normalizing flow approach for tractable likelihood.
The model is based on a series of invertible transformations, including invertible 1×1 convolutions and affine coupling layers. It conditions the audio generation on mel-spectrograms and uses a normalizing flow approach for tractable likelihood.
Dependencies:
Implemented in PyTorch, utilizes NVIDIA GPUs for training and inference, with specific preprocessing parameters for mel-spectrogram generation.
Implemented in PyTorch, utilizes NVIDIA GPUs for training and inference, with specific preprocessing parameters for mel-spectrogram generation.
Synthesis Process:
WaveGlow generates speech by sampling from a zero-mean spherical Gaussian distribution and transforming these samples through a series of invertible layers conditioned on the input mel-spectrograms.
WaveGlow generates speech by sampling from a zero-mean spherical Gaussian distribution and transforming these samples through a series of invertible layers conditioned on the input mel-spectrograms.
Dataset:
Trained on the LJ speech dataset, consisting of 13,100 short audio clips (approximately 24 hours of speech) recorded in a home environment using a MacBook Pro’s built-in microphone.
Trained on the LJ speech dataset, consisting of 13,100 short audio clips (approximately 24 hours of speech) recorded in a home environment using a MacBook Pro’s built-in microphone.
Preprocessing:
Trained on the LJ speech dataset, consisting of 13,100 short audio clips (approximately 24 hours of speech) recorded in a home environment using a MacBook Pro’s built-in microphone.
Trained on the LJ speech dataset, consisting of 13,100 short audio clips (approximately 24 hours of speech) recorded in a home environment using a MacBook Pro’s built-in microphone.
Evaluation Metrics:
Mean Opinion Score (MOS) for perceptual quality comparison, and synthesis speed measured in kHz to assess efficiency and real-time capability.
Mean Opinion Score (MOS) for perceptual quality comparison, and synthesis speed measured in kHz to assess efficiency and real-time capability.
Performance:
Achieves synthesis speeds of over 500 kHz on an NVIDIA V100 GPU, significantly faster than real-time requirements. Delivers MOS scores close to real audio and better than Griffin-Lim and comparable to the best WaveNet implementations.
Achieves synthesis speeds of over 500 kHz on an NVIDIA V100 GPU, significantly faster than real-time requirements. Delivers MOS scores close to real audio and better than Griffin-Lim and comparable to the best WaveNet implementations.
Contributions:
Introduced a novel flow-based generative model for speech synthesis that combines ideas from Glow and WaveNet, achieving high-quality, efficient, and fast audio generation with a simpler training and inference process.
Introduced a novel flow-based generative model for speech synthesis that combines ideas from Glow and WaveNet, achieving high-quality, efficient, and fast audio generation with a simpler training and inference process.