Tactron 2

Authors:
Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, and Yonghui Wu

 

Description:
Tacotron 2 is a neural network architecture for speech synthesis directly from text. It combines a recurrent sequence-to-sequence feature prediction network with a modified WaveNet vocoder to generate time-domain waveforms from mel-spectrograms.

 

Training and Data:
The feature prediction network is trained with a batch size of 64 on a single GPU using the Adam optimizer with a learning rate of 10^-3, decaying to 10^-5. The modified WaveNet is trained with a batch size of 128 across 32 GPUs with synchronous updates and an Adam optimizer learning rate of 10^-4.

 

Advantages:
Suitable for applications requiring high-quality, natural-sounding speech synthesis, such as voice assistants, audiobook generation, and other TTS systems. High-quality, natural-sounding speech synthesis comparable to human speech. The system simplifies the traditional TTS pipeline by eliminating the need for hand-engineered features and uses a more compact acoustic representation.

 

Limitations:
Occasional mispronunciations and unnatural prosody. Requires extensive training data that covers the intended usage to avoid pronunciation difficulties, especially with out-of-domain text.

 

Model Architecture:
The architecture consists of an encoder-decoder with attention for mel-spectrogram prediction and a modified WaveNet vocoder. The encoder has convolutional and bi-directional LSTM layers, and the decoder has uni-directional LSTM layers with a pre-net and a post-net for improved prediction.

 

Dependencies:
Lists Python, TensorFlow, and Linux dependencies required for running the models.

 

Synthesis:
The system generates mel-spectrograms from text using the feature prediction network and converts these spectrograms to time-domain waveforms using a modified WaveNet vocoder.

 

Dataset:
Trained on an internal US English dataset containing 24.6 hours of speech from a single professional female speaker.

 

Preprocessing:
Mel-spectrograms are computed from STFT using a 50 ms frame size, 12.5 ms frame hop, and a Hann window function. Log dynamic range compression is applied to the mel-spectrograms.

 

Performance:
The system achieves an MOS of 4.526, comparable to the MOS of 4.582 for professionally recorded speech. It significantly outperforms previous TTS systems.

 

Contributions:
Introduced Tacotron 2, a unified neural approach for TTS, combining sequence-to-sequence prediction with WaveNet vocoder. Achieved state-of-the-art sound quality close to natural human speech.

Link to paper


Audio Samples


Last Accessed: 7/10/2024

NSF Award #2346473