WaveNet

Authors:
Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu

 

Description:
WaveNet is a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, conditioning the predictive distribution of each audio sample on all previous ones. It achieves state-of-the-art performance in text-to-speech (TTS) and can also model music and other audio modalities.

 

Training and Data:
Training involves transforming raw audio into quantized 8-bit µ-law encoded values. The model is trained on datasets like VCTK (for multi-speaker speech) and proprietary datasets for TTS in English and Mandarin. The training utilizes a softmax distribution over possible audio sample values to model the conditional probabilities.

 

Advantages:
WaveNet can generate highly natural-sounding speech, outperforming previous TTS systems. It can model various speakers with high fidelity and generate realistic musical fragments. The model’s use of dilated convolutions allows it to capture long-range dependencies without the computational cost associated with recurrent networks.

 

Limitations:
The model requires significant computational resources for training and inference. While WaveNet can generate high-quality audio, it can still face challenges in ensuring long-range coherence in generated music and maintaining consistent prosody in speech synthesis.

 

Model Architecture:
The architecture includes stacks of dilated causal convolutional layers to increase the receptive field exponentially with depth. It uses gated activation units, residual and skip connections, and can be globally or locally conditioned on additional inputs (e.g., speaker identity, linguistic features).

 

Dependencies:
Dependencies include TensorFlow or similar frameworks for training the convolutional neural networks, tools for audio preprocessing and µ-law encoding, and computational resources for handling the intensive training and inference processes.

 

Synthesis Process:
Uses dilated causal convolutions to model the conditional probabilities of audio samples, conditioned on previous samples and additional features like speaker identity and linguistic information. The model generates raw audio waveforms directly, allowing for high-fidelity audio synthesis across various tasks.

 

Dataset:
Used datasets such as VCTK for multi-speaker speech generation, proprietary datasets for TTS in English and Mandarin, and MagnaTagATune and YouTube piano datasets for music modeling. These datasets provide diverse audio samples for training the generative models.

 

Preprocessing:
Raw audio is preprocessed using µ-law companding and quantized into 256 possible values. Linguistic features for TTS are derived from input texts, and additional conditioning information (e.g., speaker identity) is encoded appropriately for the model’s conditioning mechanisms.

 

Evaluation Metrics:
Evaluated using subjective listening tests, including MOS for naturalness and preference scores from paired comparison tests. Performance is compared against baseline TTS systems and assessed for the quality of generated music and phoneme recognition accuracy.

 

Performance:
WaveNet achieves superior performance in TTS, with MOS scores above 4.0 for naturalness in both English and Mandarin, significantly better than baseline systems. It also performs well in generating music and phoneme recognition, showcasing its versatility and effectiveness in various audio generation tasks.
Link to paper


Audio Samples


Last Accessed: 7/15/2024

NSF Award #2346473