Parallel WaveGAN

Ryuichi Yamamoto, Eunwoo Song, Jae-Min Kim


Proposes Parallel WaveGAN, a distillation-free, fast, and small-footprint waveform generation method using a GAN. It optimizes multi-resolution spectrogram and adversarial loss functions, capturing the time-frequency distribution of realistic speech waveforms. The model is compact with only 1.44M parameters and generates 24kHz speech 28.68 times faster than real-time on a single GPU.


Training and Data:
The model is trained using a combination of multi-resolution short-time Fourier transform (STFT) loss and adversarial loss, enabling it to capture time-frequency distributions effectively without the need for a teacher-student framework.


Distillation-free training process, fast training and inference times, small model size, and high perceptual quality of generated speech.


The quality of synthesized speech may vary depending on the complexity of the input features and the effectiveness of the multi-resolution STFT loss in capturing speech characteristics.


Model Architecture:
The model consists of a non-autoregressive WaveNet-based generator with 30 layers of dilated residual convolution blocks and a discriminator with ten layers of non-causal dilated 1-D convolutions.


Utilizes a combination of multi-resolution STFT loss and adversarial loss, weight normalization for convolutional layers, and the RAdam optimizer for stabilizing training.


Parallel WaveGAN generates high-quality speech by learning the distribution of realistic waveforms through a non-autoregressive WaveNet-based generator and a discriminator network, producing speech faster than real-time.


A phonetically and prosaically balanced speech corpus recorded by a female professional Japanese speaker, sampled at 24 kHz with 11,449 utterances for training, 250 for validation, and 250 for evaluation.


80-band log-mel spectrograms with a band-limited frequency range (70 to 8000 Hz) were extracted, normalized to have zero mean and unit variance, and used as input auxiliary features for waveform generation.


Evaluation Metrics:
Mean Opinion Score (MOS) tests for perceptual quality, inference speed (times faster than real-time), and the training time required to achieve optimal models.


Parallel WaveGAN achieved a MOS of 4.16 within a Transformer-based TTS framework, generated 24 kHz speech waveform 28.68 times faster than real-time, and required only 2.8 training days.


Introduced a joint training method combining multi-resolution STFT loss and adversarial loss for effective waveform generation, demonstrated significant improvements in training and inference speed, and achieved competitive perceptual quality.
Link to paper

Last Accessed: 7/18/2024

NSF Award #2346473