Authors:
Jose Sotelo, Soroush Mehri, Kundan Kumar, Joao Felipe Santos, Kyle Kastner, Aaron Courville, Yoshua Bengio
Description:
The paper presents Char2Wav, an end-to-end model for speech synthesis with two main components: a reader (encoder-decoder model with attention) and a neural vocoder (conditional SampleRNN). The reader converts text or phonemes to vocoder acoustic features, and the neural vocoder generates raw waveform samples.
Training and Data:
The reader and neural vocoder are pretrained separately using normalized WORLD vocoder features, and then fine-tuned end-to-end. The reader uses a bidirectional RNN encoder and an RNN decoder with attention, while the neural vocoder uses a conditional version of SampleRNN to generate raw waveforms.
Advantages:
Eliminates the need for expert linguistic knowledge and simplifies the creation of speech synthesizers for new languages. Learns the entire process end-to-end, capturing long-term dependencies and producing high-quality audio directly from text.
Limitations:
Quantitative evaluation and comparison with other models are not provided in the paper. The performance may vary with different datasets and languages.
Model Architecture:
The reader consists of a bidirectional RNN encoder and an RNN decoder with attention. The neural vocoder is a conditional SampleRNN that captures long-term dependencies in sequential data, using a hierarchical structure to model dynamics at different time scales.
Dependencies:
Dependencies include PyTorch for neural network implementation, WORLD vocoder for feature extraction, and SampleRNN for neural vocoder implementation.
Synthesis:
Uses normalized WORLD vocoder features as intermediate representations. The reader converts text or phonemes to vocoder features, and the neural vocoder (conditional SampleRNN) generates raw waveform samples from these features.
Dataset:
The VCTK dataset for English and the DIMEX-100 dataset for Spanish were used for training and evaluation. The VCTK dataset contains recordings of multiple speakers, while the DIMEX-100 dataset includes Spanish text and corresponding recordings.
Preprocessing:
Preprocessing involves normalizing WORLD vocoder features, which include Mel-cepstral coefficients (MCCs), logarithmic fundamental frequency (log F0), and aperiodicities (APs). These features are used as targets for the reader and inputs for the neural vocoder during pretraining.
Evaluation Metrics:
Subjective evaluation based on the intelligibility and naturalness of the generated speech. Samples provided for qualitative assessment.
Char2Wav generates intelligible and natural-sounding speech directly from text, with subjective evaluation demonstrating high-quality speech synthesis for different languages. The model’s hierarchical structure and attention mechanism contribute to capturing long-term dependencies and generating realistic audio