Parallel WaveGAN

Authors:
Ryuichi Yamamoto, Eunwoo Song, Jae-Min Kim

 

Description:
Proposes Parallel WaveGAN, a distillation-free, fast, and small-footprint waveform generation method using a GAN. It optimizes multi-resolution spectrogram and adversarial loss functions, capturing the time-frequency distribution of realistic speech waveforms. The model is compact with only 1.44M parameters and generates 24kHz speech 28.68 times faster than real-time on a single GPU.

 

Training and Data:
The model is trained using a combination of multi-resolution short-time Fourier transform (STFT) loss and adversarial loss, enabling it to capture time-frequency distributions effectively without the need for a teacher-student framework.

 

Advantages:
Distillation-free training process, fast training and inference times, small model size, and high perceptual quality of generated speech.

 

Limitations:
The quality of synthesized speech may vary depending on the complexity of the input features and the effectiveness of the multi-resolution STFT loss in capturing speech characteristics.

 

Model Architecture:
The model consists of a non-autoregressive WaveNet-based generator with 30 layers of dilated residual convolution blocks and a discriminator with ten layers of non-causal dilated 1-D convolutions.

 

Dependencies:
Utilizes a combination of multi-resolution STFT loss and adversarial loss, weight normalization for convolutional layers, and the RAdam optimizer for stabilizing training.

 

Synthesis:
Parallel WaveGAN generates high-quality speech by learning the distribution of realistic waveforms through a non-autoregressive WaveNet-based generator and a discriminator network, producing speech faster than real-time.

 

Dataset:
A phonetically and prosaically balanced speech corpus recorded by a female professional Japanese speaker, sampled at 24 kHz with 11,449 utterances for training, 250 for validation, and 250 for evaluation.

 

Preprocessing:
80-band log-mel spectrograms with a band-limited frequency range (70 to 8000 Hz) were extracted, normalized to have zero mean and unit variance, and used as input auxiliary features for waveform generation.

 

Evaluation Metrics:
Mean Opinion Score (MOS) tests for perceptual quality, inference speed (times faster than real-time), and the training time required to achieve optimal models.

 

Performance:
Parallel WaveGAN achieved a MOS of 4.16 within a Transformer-based TTS framework, generated 24 kHz speech waveform 28.68 times faster than real-time, and required only 2.8 training days.

 

Contributions:
Introduced a joint training method combining multi-resolution STFT loss and adversarial loss for effective waveform generation, demonstrated significant improvements in training and inference speed, and achieved competitive perceptual quality.
Link to paper


Last Accessed: 7/18/2024

NSF Award #2346473