Authors:
Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu
Description:
FastSpeech 2 is a non-autoregressive text-to-speech (TTS) model designed to synthesize speech faster and with higher quality than previous models. It addresses the limitations of FastSpeech by using ground-truth data for training and incorporating additional speech variation information such as pitch, energy, and accurate duration.
Training and Data:
FastSpeech 2 directly trains the model with ground-truth mel-spectrograms, pitch, energy, and accurate duration extracted from speech waveforms. The model uses a simplified training pipeline without the need for a teacher model and distillation process. FastSpeech 2s extends this by generating speech waveforms directly from text, further simplifying the process.
Advantages:
FastSpeech 2 simplifies the training process, improves speech quality by using ground-truth data, and incorporates detailed speech variation information to better handle the one-to-many mapping problem. FastSpeech 2s further simplifies the inference process by generating waveforms directly, leading to faster inference speeds.
Limitations:
The model requires external tools for accurate alignment and pitch extraction, which adds some complexity. Real-world performance might still depend on the quality and representativeness of the training data. Further improvements could explore simplifying the process even more and incorporating more variation information.
Model Architecture:
The architecture includes an encoder, a variance adaptor (with duration, pitch, and energy predictors), and a mel-spectrogram decoder. FastSpeech 2s adds a waveform decoder that directly generates speech waveforms from the text. The variance adaptor uses continuous wavelet transform for pitch prediction and introduces adversarial training in the waveform decoder for phase recovery.
Dependencies:
Dependencies include PyTorch for model training, MFA for phoneme duration extraction, PyWorldVocoder for pitch extraction, and Parallel WaveGAN for waveform synthesis. Computational resources include GPUs for efficient training.
Synthesis:
Uses a combination of phoneme encoding, variance adaptation (with duration, pitch, and energy predictors), and mel-spectrogram or waveform decoding to generate high-quality speech. The variance adaptation module provides detailed information to handle the one-to-many mapping problem in TTS effectively.
Dataset:
The LJSpeech dataset, containing 13,100 English audio clips and corresponding text transcripts, is used for training and evaluation. The dataset is preprocessed into phoneme sequences and mel-spectrograms for training FastSpeech 2 and waveform clips for FastSpeech 2s.
Preprocessing:
Text data is converted into phoneme sequences using a grapheme-to-phoneme tool. Speech data is transformed into mel-spectrograms with specific frame and hop sizes, and pitch and energy are extracted from the speech waveform for training the variance predictors.
Evaluation Metrics:
MOS and CMOS tests for perceptual quality, standard deviation, skewness, kurtosis, and average DTW distance for pitch accuracy, and mean absolute error (MAE) for energy similarity to ground-truth speech. Training and inference speeds are measured in terms of real-time factor (RTF).
FastSpeech 2 achieves higher MOS scores and faster training times compared to FastSpeech and autoregressive models. FastSpeech 2s further improves inference speed by directly generating waveforms, making it suitable for real-time applications while maintaining high audio quality.
Contributions:
Introduced a simplified and more effective training pipeline for TTS, demonstrated significant improvements in audio quality and training/inference speed, provided comprehensive evaluations and analyses, and released audio samples and implementation details for reproducibility.