MelGAN

Authors:
Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Mila Alexandre de Brebisson, Yoshua Bengio, Aaron Courville

 

Description:
MelGAN is a non-autoregressive feed-forward convolutional architecture designed for audio waveform generation in a GAN setup, aimed at generating high-quality coherent waveforms for tasks such as speech synthesis and music domain translation.

 

Training and Data:
Trained using a hinge loss version of the GAN objective with a feature matching objective. The generator is a fully convolutional feed-forward network with mel-spectrograms as input and raw waveforms as output. The model is trained with Adam optimizer with a learning rate of 1e-4.

 

Advantages:
Suitable for high-quality text-to-speech synthesis, music domain translation, and unconditional music synthesis. Fast and parallelizable, significantly fewer parameters than competing models, generalizes to unseen speakers for mel-spectrogram inversion, and runs at more than 100x faster than real-time on a GTX 1080Ti GPU and more than 2x faster on CPU.

 

Limitations:
Slight quality degradation compared to autoregressive models, requires time-aligned conditioning information, and the need for paired ground truth data for feature matching.

 

Model Architecture:
The generator consists of transposed convolutional layers with residual blocks and dilated convolutions. The discriminator is a multi-scale architecture with three identical network structures operating on different audio scales, incorporating window-based objectives for feature matching.

 

Synthesis:
The system generates raw audio waveforms from mel-spectrograms using a transposed convolutional generator and a multi-scale discriminator that operates on different audio scales.

 

Dataset:
Trained on datasets like LJ Speech and an internal 6-speaker dataset containing roughly 10 hours of speech per speaker.

 

Preprocessing:
Uses mel-spectrograms as intermediate representations with a resolution of 256x lower than the raw audio waveform.

 

Performance:
Introduced a lightweight, non-autoregressive GAN architecture for high-quality audio waveform generation, demonstrating effectiveness and generality in various audio synthesis tasks, and achieving state-of-the-art performance with significantly faster inference times.

 

Contributions:
Demonstrated a novel non-autoregressive GAN architecture for high-quality audio waveform generation, achieving state-of-the-art performance with faster inference times and effectiveness in various audio synthesis tasks.

Link to paper


Audio Samples


Last Accessed: 7/10/2024

NSF Award #2346473