Authors:
Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan O. Arık, Ajay Kannan, Sharan Narang
Description:
Deep Voice 3 is a fully-convolutional, attention-based neural text-to-speech (TTS) system designed to match state-of-the-art naturalness in synthesized speech while training significantly faster than other models. It scales to large datasets and can handle multiple speakers.
Training and Data:
Deep Voice 3 uses a fully-convolutional sequence-to-sequence architecture with attention mechanisms. The model is trained on large datasets, including LibriSpeech with 820 hours of audio. The model architecture allows parallel computation, leading to significantly faster training times compared to recurrent models.
Advantages:
Deep Voice 3 offers an order of magnitude faster training compared to Tacotron, scales to very large datasets, handles multiple speakers effectively, and achieves high naturalness in synthesized speech. The model can be efficiently deployed to handle high query throughput on GPU servers.
Limitations:
The quality of synthesized speech depends on the waveform synthesis method used. While WaveNet provides the highest naturalness, it has higher computational costs compared to WORLD or Griffin-Lim. The model’s performance on noisy datasets like LibriSpeech is lower due to varying recording conditions and background noise.
Model Architecture:
The architecture consists of three main components: an encoder, a decoder, and a converter. The encoder converts text into internal representations, the decoder uses these representations to predict mel-spectrograms, and the converter generates vocoder parameters for waveform synthesis. The architecture is fully-convolutional and attention-based.
Dependencies:
Dependencies include TensorFlow or PyTorch for model training and inference, librosa for audio processing and spectrogram computation, and specific vocoder implementations like WORLD and WaveNet. Custom GPU kernels are implemented for optimized inference speed and handling high query throughput.
Synthesis:
Uses attention-based sequence-to-sequence conversion from text to mel-spectrograms, followed by vocoder parameter prediction for waveform synthesis. Supports multiple vocoder methods, including Griffin-Lim, WORLD, and WaveNet, allowing flexibility in choosing the synthesis method based on quality and computational requirements.
Dataset:
Used LibriSpeech (820 hours, 2484 speakers) for large-scale training, VCTK (44 hours, 108 speakers) for multi-speaker synthesis, and an internal single-speaker dataset (20 hours, 48 kHz) for single-speaker synthesis. Text preprocessing includes converting text to phonemes and normalizing punctuation and spacing.
Preprocessing:
Text preprocessing includes converting text to phonemes, uppercasing characters, removing intermediate punctuation marks, and replacing spaces with special separator characters to indicate pause durations. Audio preprocessing involves converting waveforms to mel-spectrograms and other vocoder parameters.
Evaluation Metrics:
Evaluated using MOS for naturalness, attention error counts for repeated words, mispronunciations, and skipped words, and training and inference speed comparisons. Evaluations included multiple waveform synthesis methods (Griffin-Lim, WORLD, WaveNet) and comparisons with Tacotron and Deep Voice 2.
Deep Voice 3 achieves a 10x speedup in training compared to Tacotron, converging in approximately 500k iterations. Inference speed reaches 115 queries per second on a single Nvidia Tesla P100 GPU with custom GPU kernels, enabling deployment to handle ten million queries per day. MOS ratings indicate high naturalness for synthesized speech.