Authors:
Ryuichi Yamamoto, Eunwoo Song, Jae-Min Kim
Ryuichi Yamamoto, Eunwoo Song, Jae-Min Kim
Description:
Proposes Parallel WaveGAN, a distillation-free, fast, and small-footprint waveform generation method using a GAN. It optimizes multi-resolution spectrogram and adversarial loss functions, capturing the time-frequency distribution of realistic speech waveforms. The model is compact with only 1.44M parameters and generates 24kHz speech 28.68 times faster than real-time on a single GPU.
Proposes Parallel WaveGAN, a distillation-free, fast, and small-footprint waveform generation method using a GAN. It optimizes multi-resolution spectrogram and adversarial loss functions, capturing the time-frequency distribution of realistic speech waveforms. The model is compact with only 1.44M parameters and generates 24kHz speech 28.68 times faster than real-time on a single GPU.
Training and Data:
The model is trained using a combination of multi-resolution short-time Fourier transform (STFT) loss and adversarial loss, enabling it to capture time-frequency distributions effectively without the need for a teacher-student framework.
The model is trained using a combination of multi-resolution short-time Fourier transform (STFT) loss and adversarial loss, enabling it to capture time-frequency distributions effectively without the need for a teacher-student framework.
Advantages:
Distillation-free training process, fast training and inference times, small model size, and high perceptual quality of generated speech.
Distillation-free training process, fast training and inference times, small model size, and high perceptual quality of generated speech.
Limitations:
The quality of synthesized speech may vary depending on the complexity of the input features and the effectiveness of the multi-resolution STFT loss in capturing speech characteristics.
The quality of synthesized speech may vary depending on the complexity of the input features and the effectiveness of the multi-resolution STFT loss in capturing speech characteristics.
Model Architecture:
The model consists of a non-autoregressive WaveNet-based generator with 30 layers of dilated residual convolution blocks and a discriminator with ten layers of non-causal dilated 1-D convolutions.
The model consists of a non-autoregressive WaveNet-based generator with 30 layers of dilated residual convolution blocks and a discriminator with ten layers of non-causal dilated 1-D convolutions.
Dependencies:
Utilizes a combination of multi-resolution STFT loss and adversarial loss, weight normalization for convolutional layers, and the RAdam optimizer for stabilizing training.
Utilizes a combination of multi-resolution STFT loss and adversarial loss, weight normalization for convolutional layers, and the RAdam optimizer for stabilizing training.
Synthesis:
Parallel WaveGAN generates high-quality speech by learning the distribution of realistic waveforms through a non-autoregressive WaveNet-based generator and a discriminator network, producing speech faster than real-time.
Parallel WaveGAN generates high-quality speech by learning the distribution of realistic waveforms through a non-autoregressive WaveNet-based generator and a discriminator network, producing speech faster than real-time.
Dataset:
A phonetically and prosaically balanced speech corpus recorded by a female professional Japanese speaker, sampled at 24 kHz with 11,449 utterances for training, 250 for validation, and 250 for evaluation.
A phonetically and prosaically balanced speech corpus recorded by a female professional Japanese speaker, sampled at 24 kHz with 11,449 utterances for training, 250 for validation, and 250 for evaluation.
Preprocessing:
80-band log-mel spectrograms with a band-limited frequency range (70 to 8000 Hz) were extracted, normalized to have zero mean and unit variance, and used as input auxiliary features for waveform generation.
80-band log-mel spectrograms with a band-limited frequency range (70 to 8000 Hz) were extracted, normalized to have zero mean and unit variance, and used as input auxiliary features for waveform generation.
Evaluation Metrics:
Mean Opinion Score (MOS) tests for perceptual quality, inference speed (times faster than real-time), and the training time required to achieve optimal models.
Mean Opinion Score (MOS) tests for perceptual quality, inference speed (times faster than real-time), and the training time required to achieve optimal models.
Performance:
Parallel WaveGAN achieved a MOS of 4.16 within a Transformer-based TTS framework, generated 24 kHz speech waveform 28.68 times faster than real-time, and required only 2.8 training days.
Parallel WaveGAN achieved a MOS of 4.16 within a Transformer-based TTS framework, generated 24 kHz speech waveform 28.68 times faster than real-time, and required only 2.8 training days.
Contributions:
Introduced a joint training method combining multi-resolution STFT loss and adversarial loss for effective waveform generation, demonstrated significant improvements in training and inference speed, and achieved competitive perceptual quality.
Introduced a joint training method combining multi-resolution STFT loss and adversarial loss for effective waveform generation, demonstrated significant improvements in training and inference speed, and achieved competitive perceptual quality.