Authors:
Renqian Luo, Xu Tan, Rui Wang, Tao Qin, Jinzhu Li, Sheng Zhao, Enhong Chen, Tie-Yan Liu
Description:
LightSpeech is a lightweight and fast text-to-speech (TTS) model designed using neural architecture search (NAS) to achieve smaller memory usage and lower inference latency, suitable for deployment in resource-constrained devices like mobile phones and embedded systems.
Training and Data:
LightSpeech uses NAS, specifically GBDT-NAS, to automatically discover efficient architectures within a carefully designed search space. The training involves profiling the existing FastSpeech 2 model to identify bottlenecks and applying NAS to optimize the encoder and decoder architectures.
Advantages:
LightSpeech achieves a 15x reduction in model size, 16x fewer MACs, and 6.5x faster inference speed on a CPU compared to FastSpeech 2, with comparable audio quality. This makes it highly suitable for deployment in resource-constrained environments without significant loss in performance.
Limitations:
While LightSpeech is highly efficient, it relies on the availability of NAS infrastructure and the specific design of the search space. The model’s performance is dependent on the quality of the search algorithm and the selected proxy metrics during NAS.
Model Architecture:
The final LightSpeech model consists of an encoder and a decoder, both built using a combination of multi-head self-attention (MHSA) and depthwise separable convolutions (SepConv) with varying kernel sizes. The variance predictors (duration and pitch) also use SepConv for reduced computational complexity.
Dependencies:
Dependencies include TensorFlow or PyTorch for model training and inference, librosa for audio processing and spectrogram computation, and Parallel WaveGAN for waveform generation. The NAS process relies on GBDT-NAS for architecture search and optimization.
Synthesis:
Uses non-autoregressive generation to synthesize speech in parallel. The model generates mel-spectrograms from input text, which are then converted into audio waveforms using pre-trained Parallel WaveGAN. The synthesis process benefits from the optimized architecture discovered through NAS for faster and more efficient computation.
Dataset:
The LJSpeech dataset is used for training and evaluation. It contains 13,100 pairs of text and speech data with approximately 24 hours of speech audio. The dataset is split into training (12,900 samples), dev (100 samples), and test (100 samples) sets. Text is converted to phonemes, and waveforms are transformed into mel-spectrograms.
Preprocessing:
Preprocessing involves converting text to phonemes using a grapheme-to-phoneme tool and transforming raw waveforms into mel-spectrograms. The frame size is set to 1024 and hop size to 256 with a sample rate of 22050. The model profiles existing FastSpeech 2 components to identify and optimize computational bottlenecks.
Evaluation Metrics:
Evaluated using CMOS for audio quality, model size, MACs for computational complexity, and real-time factor (RTF) for inference speed on a CPU. The evaluation includes comparisons with FastSpeech 2 and a manually designed lightweight FastSpeech 2 model.
Performance:
LightSpeech achieves a 15x reduction in model size, 16x fewer MACs, and 6.5x faster inference speed on CPU compared to FastSpeech 2, while maintaining comparable audio quality as measured by CMOS evaluations. This demonstrates its suitability for deployment in environments with limited computational resources.
Contributions:
Introduced a novel lightweight and fast TTS model (LightSpeech) leveraging NAS to optimize architecture for deployment in resource-constrained devices. Demonstrated significant improvements in model efficiency without compromising on audio quality. Provided a comprehensive evaluation of the model’s performance.