AdaSpeech

Authors:
Mingjian Chen, Xu Tan, Bohan Li, Yanqing Liu, Tao Qin, Sheng Zhao, Tie-Yan Liu

 

Description:
AdaSpeech is an adaptive text-to-speech (TTS) system designed to customize new voices efficiently and with high quality. It addresses the challenges of diverse acoustic conditions and memory efficiency in custom voice adaptation.

 

Training and Data:
AdaSpeech employs a three-stage pipeline: pre-training on a large multi-speaker dataset, fine-tuning on target speaker data with diverse acoustic conditions, and inference using a combination of shared and speaker-specific parameters.

 

Advantages:
AdaSpeech is an adaptive text-to-speech (TTS) system designed to customize new voices efficiently and with high quality. It addresses the challenges of diverse acoustic conditions and memory efficiency in custom voice adaptation.

 

Limitations:
The model requires external tools for accurate alignment and pitch extraction. Future work could explore fully end-to-end solutions without these dependencies and further optimize adaptation for more diverse acoustic conditions.

 

Model Architecture:
The model is based on FastSpeech 2 and includes a phoneme encoder, mel-spectrogram decoder, variance adaptor, and two additional components: acoustic condition modeling and conditional layer normalization for efficient adaptation.

 

Dependencies:
Dependencies include PyTorch for model training, MFA for phoneme duration extraction, and MelGAN for waveform synthesis. Computational resources include GPUs for efficient training and adaptation.

 

Synthesis:
The system uses a combination of phoneme encoding, variance adaptation, mel-spectrogram decoding, and acoustic condition modeling to generate high-quality speech. The acoustic conditions are modeled at both utterance and phoneme levels.

 

Dataset:
Pre-trained on LibriTTS, fine-tuned on VCTK and LJSpeech datasets, covering a wide range of acoustic conditions.

 

Preprocessing:
Speech data is converted to 16kHz sampling rate and mel-spectrograms with 12.5ms hop size and 50ms window size. Text is converted to phoneme sequences for encoder input.

 

Evaluation Metrics:

MOS and SMOS for perceptual quality and similarity, number of adaptation parameters for memory efficiency, and qualitative analysis of acoustic condition modeling using t-SNE visualizations.

 

Performance:
AdaSpeech achieves higher MOS and SMOS scores compared to baselines, demonstrating better adaptation quality with minimal parameters. It also shows robustness across different datasets with varying acoustic conditions.

 

Contributions:

Introduced a novel approach to adaptive TTS with efficient parameter usage and high-quality voice synthesis, demonstrated through comprehensive evaluations.

Link to tool


Last Accessed: 7/17/2024

NSF Award #2346473