Authors:
Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Mila Alexandre de Brebisson, Yoshua Bengio, Aaron Courville
Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Mila Alexandre de Brebisson, Yoshua Bengio, Aaron Courville
Description:
MelGAN is a non-autoregressive feed-forward convolutional architecture designed for audio waveform generation in a GAN setup, aimed at generating high-quality coherent waveforms for tasks such as speech synthesis and music domain translation.
MelGAN is a non-autoregressive feed-forward convolutional architecture designed for audio waveform generation in a GAN setup, aimed at generating high-quality coherent waveforms for tasks such as speech synthesis and music domain translation.
Training and Data:
Trained using a hinge loss version of the GAN objective with a feature matching objective. The generator is a fully convolutional feed-forward network with mel-spectrograms as input and raw waveforms as output. The model is trained with Adam optimizer with a learning rate of 1e-4.
Trained using a hinge loss version of the GAN objective with a feature matching objective. The generator is a fully convolutional feed-forward network with mel-spectrograms as input and raw waveforms as output. The model is trained with Adam optimizer with a learning rate of 1e-4.
Advantages:
Suitable for high-quality text-to-speech synthesis, music domain translation, and unconditional music synthesis. Fast and parallelizable, significantly fewer parameters than competing models, generalizes to unseen speakers for mel-spectrogram inversion, and runs at more than 100x faster than real-time on a GTX 1080Ti GPU and more than 2x faster on CPU.
Suitable for high-quality text-to-speech synthesis, music domain translation, and unconditional music synthesis. Fast and parallelizable, significantly fewer parameters than competing models, generalizes to unseen speakers for mel-spectrogram inversion, and runs at more than 100x faster than real-time on a GTX 1080Ti GPU and more than 2x faster on CPU.
Limitations:
Slight quality degradation compared to autoregressive models, requires time-aligned conditioning information, and the need for paired ground truth data for feature matching.
Slight quality degradation compared to autoregressive models, requires time-aligned conditioning information, and the need for paired ground truth data for feature matching.
Model Architecture:
The generator consists of transposed convolutional layers with residual blocks and dilated convolutions. The discriminator is a multi-scale architecture with three identical network structures operating on different audio scales, incorporating window-based objectives for feature matching.
The generator consists of transposed convolutional layers with residual blocks and dilated convolutions. The discriminator is a multi-scale architecture with three identical network structures operating on different audio scales, incorporating window-based objectives for feature matching.
Synthesis:
The system generates raw audio waveforms from mel-spectrograms using a transposed convolutional generator and a multi-scale discriminator that operates on different audio scales.
The system generates raw audio waveforms from mel-spectrograms using a transposed convolutional generator and a multi-scale discriminator that operates on different audio scales.
Dataset:
Trained on datasets like LJ Speech and an internal 6-speaker dataset containing roughly 10 hours of speech per speaker.
Trained on datasets like LJ Speech and an internal 6-speaker dataset containing roughly 10 hours of speech per speaker.
Preprocessing:
Uses mel-spectrograms as intermediate representations with a resolution of 256x lower than the raw audio waveform.
Uses mel-spectrograms as intermediate representations with a resolution of 256x lower than the raw audio waveform.
Performance:
Introduced a lightweight, non-autoregressive GAN architecture for high-quality audio waveform generation, demonstrating effectiveness and generality in various audio synthesis tasks, and achieving state-of-the-art performance with significantly faster inference times.
Introduced a lightweight, non-autoregressive GAN architecture for high-quality audio waveform generation, demonstrating effectiveness and generality in various audio synthesis tasks, and achieving state-of-the-art performance with significantly faster inference times.
Contributions:
Demonstrated a novel non-autoregressive GAN architecture for high-quality audio waveform generation, achieving state-of-the-art performance with faster inference times and effectiveness in various audio synthesis tasks.
Demonstrated a novel non-autoregressive GAN architecture for high-quality audio waveform generation, achieving state-of-the-art performance with faster inference times and effectiveness in various audio synthesis tasks.