Authors:
Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, and Yonghui Wu
Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, and Yonghui Wu
Description:
Tacotron 2 is a neural network architecture for speech synthesis directly from text. It combines a recurrent sequence-to-sequence feature prediction network with a modified WaveNet vocoder to generate time-domain waveforms from mel-spectrograms.
Tacotron 2 is a neural network architecture for speech synthesis directly from text. It combines a recurrent sequence-to-sequence feature prediction network with a modified WaveNet vocoder to generate time-domain waveforms from mel-spectrograms.
Training and Data:
The feature prediction network is trained with a batch size of 64 on a single GPU using the Adam optimizer with a learning rate of 10^-3, decaying to 10^-5. The modified WaveNet is trained with a batch size of 128 across 32 GPUs with synchronous updates and an Adam optimizer learning rate of 10^-4.
The feature prediction network is trained with a batch size of 64 on a single GPU using the Adam optimizer with a learning rate of 10^-3, decaying to 10^-5. The modified WaveNet is trained with a batch size of 128 across 32 GPUs with synchronous updates and an Adam optimizer learning rate of 10^-4.
Advantages:
Suitable for applications requiring high-quality, natural-sounding speech synthesis, such as voice assistants, audiobook generation, and other TTS systems. High-quality, natural-sounding speech synthesis comparable to human speech. The system simplifies the traditional TTS pipeline by eliminating the need for hand-engineered features and uses a more compact acoustic representation.
Suitable for applications requiring high-quality, natural-sounding speech synthesis, such as voice assistants, audiobook generation, and other TTS systems. High-quality, natural-sounding speech synthesis comparable to human speech. The system simplifies the traditional TTS pipeline by eliminating the need for hand-engineered features and uses a more compact acoustic representation.
Limitations:
Occasional mispronunciations and unnatural prosody. Requires extensive training data that covers the intended usage to avoid pronunciation difficulties, especially with out-of-domain text.
Occasional mispronunciations and unnatural prosody. Requires extensive training data that covers the intended usage to avoid pronunciation difficulties, especially with out-of-domain text.
Model Architecture:
The architecture consists of an encoder-decoder with attention for mel-spectrogram prediction and a modified WaveNet vocoder. The encoder has convolutional and bi-directional LSTM layers, and the decoder has uni-directional LSTM layers with a pre-net and a post-net for improved prediction.
The architecture consists of an encoder-decoder with attention for mel-spectrogram prediction and a modified WaveNet vocoder. The encoder has convolutional and bi-directional LSTM layers, and the decoder has uni-directional LSTM layers with a pre-net and a post-net for improved prediction.
Dependencies:
Lists Python, TensorFlow, and Linux dependencies required for running the models.
Lists Python, TensorFlow, and Linux dependencies required for running the models.
Synthesis:
The system generates mel-spectrograms from text using the feature prediction network and converts these spectrograms to time-domain waveforms using a modified WaveNet vocoder.
The system generates mel-spectrograms from text using the feature prediction network and converts these spectrograms to time-domain waveforms using a modified WaveNet vocoder.
Dataset:
Trained on an internal US English dataset containing 24.6 hours of speech from a single professional female speaker.
Trained on an internal US English dataset containing 24.6 hours of speech from a single professional female speaker.
Preprocessing:
Mel-spectrograms are computed from STFT using a 50 ms frame size, 12.5 ms frame hop, and a Hann window function. Log dynamic range compression is applied to the mel-spectrograms.
Mel-spectrograms are computed from STFT using a 50 ms frame size, 12.5 ms frame hop, and a Hann window function. Log dynamic range compression is applied to the mel-spectrograms.
Performance:
The system achieves an MOS of 4.526, comparable to the MOS of 4.582 for professionally recorded speech. It significantly outperforms previous TTS systems.
The system achieves an MOS of 4.526, comparable to the MOS of 4.582 for professionally recorded speech. It significantly outperforms previous TTS systems.
Contributions:
Introduced Tacotron 2, a unified neural approach for TTS, combining sequence-to-sequence prediction with WaveNet vocoder. Achieved state-of-the-art sound quality close to natural human speech.
Introduced Tacotron 2, a unified neural approach for TTS, combining sequence-to-sequence prediction with WaveNet vocoder. Achieved state-of-the-art sound quality close to natural human speech.