Authors:
Jaehyeon Kim, Sungwon Kim, Jungil Kong, Sungroh Yoon
Description:
Glow-TTS is a flow-based generative model for parallel text-to-speech (TTS) synthesis that eliminates the need for external aligners by internally learning the alignment between text and the latent representation of speech. It combines the properties of flows and dynamic programming to search for the most probable monotonic alignment, which allows for robust, diverse, and controllable speech synthesis.
Training and Data:
Glow-TTS is trained to maximize the log-likelihood of the data by combining the properties of flows and dynamic programming to find the most probable monotonic alignment.
Advantages:
Fast Synthesis: Generates mel-spectrograms 15.7 times faster than autoregressive models like Tacotron 2. Robustness: Shows robustness to long texts and outperforms Tacotron 2 in handling long utterances. Diversity and Controllability: Allows for the generation of speech with various intonation patterns and the regulation of pitch and speaking rate.
Limitations:
Training Complexity: Requires an iterative training process involving alignment search and parameter updates. Data Dependency: Performance depends on the quality and variety of the training dataset. For multi-speaker settings, the model requires a significant amount of training data from different speakers.
Model Architecture:
Comprises a text encoder, duration predictor, and flow-based decoder. Uses activation normalization, invertible 1×1 convolution, and affine coupling layers. Ensures monotonic alignment using MAS.
Dependencies:
Python 3, TensorFlow, additional libraries for audio processing.
Synthesis:
Generates mel-spectrograms from text, which are then converted to raw waveforms using the WaveGlow vocoder. Supports real-time applications with an order-of-magnitude speed-up over Tacotron 2.
Dataset:
LJSpeech (single speaker, 24 hours), LibriTTS (multi-speaker, 54 hours). Data requires trimming silence and filtering text lengths.
Preprocessing:
Converts text to phonemes, applies silence trimming, and uses WaveGlow vocoder for mel-spectrogram to waveform conversion.
Evaluation Metrics:
Mean Opinion Score (MOS) for audio quality, Character Error Rate (CER) for robustness, and inference time comparison with Tacotron 2 for synthesis speed.
Comparable MOS to Tacotron 2, robust to long utterances, efficient parallel synthesis.
Contributions:
Introduced Monotonic Alignment Search (MAS), flow-based architecture for TTS, and made source code and pre-trained models publicly available.