DenoiSpeech

Authors:
Chen Zhang, Yi Ren, Xu Tan, Jinglin Liu, Kejun Zhang, Tao Qin, Sheng Zhao, Tie-Yan Liu

 

Description:
DenoiSpeech is a text-to-speech (TTS) system designed to synthesize clean speech from noisy speech data. It uses a fine-grained frame-level noise condition module that models noise at a detailed level, outperforming previous methods that use coarse-grained or pre-enhanced speech data.

 

Training and Data:
DenoiSpeech uses joint training for the TTS model and the noise condition module, with a noise extractor that leverages both paired and unpaired noisy data. The noise extractor and the TTS model are optimized together to improve the overall performance and generalization.

 

Advantages:
DenoiSpeech significantly improves the quality of synthesized speech over previous methods, especially in real-world noisy conditions. It uses fine-grained frame-level noise conditions to model complex noise patterns, resulting in clearer and more natural synthesized speech.

 

Limitations:
The model requires careful tuning of the noise extractor and noise encoder. Real-world performance might still depend on the diversity and representativeness of the training data. Further improvements could explore handling even more diverse types of noise and applying few-shot learning.

 

Model Architecture:
The architecture includes a phoneme encoder, length regulator, noise condition module, pitch predictor, and mel-spectrogram decoder. The noise condition module contains a noise extractor (based on UNet) and a noise encoder, along with an adversarial CTC module to ensure noise-only extraction.

 

Dependencies:
Dependencies include PyTorch for model training, UNet for noise extraction, FastSpeech 2 for TTS model structure, and various tools for phoneme conversion, mel-spectrogram extraction, and audio preprocessing. Computational resources include GPUs for efficient training.

 

Synthesis:
Uses a combination of phoneme encoding, length regulation, noise condition modeling, pitch prediction, and mel-spectrogram decoding to generate high-quality speech from noisy inputs. The noise condition module captures detailed noise information to enhance synthesis quality.

 

Dataset:
The datasets include VCTK corpus for clean speech and NonSpeech100 for noise in artificial noisy datasets, and an internal dataset for real-world noisy speech. These datasets provide a range of noise conditions for robust training and evaluation.

 

Preprocessing:
Speech data is preprocessed into phoneme sequences and mel-spectrograms. Phoneme conversion uses an open-source tool, and mel-spectrogram extraction follows established parameters (frame size of 50 ms, hop size of 12.5 ms, and sample rate of 22050 Hz) for consistency.

 

Evaluation Metrics:

Evaluated using MOS tests to measure perceptual quality, comparing synthesized speech against ground truth clean recordings, noisy recordings, and baseline methods. MOS scores are recorded with native speakers rating the audio quality.

 

Performance:
DenoiSpeech achieves higher MOS scores compared to baseline methods, demonstrating superior audio quality and robustness in noisy conditions. The model efficiently handles complex noise patterns, ensuring clear and natural synthesized speech in various noisy environments.

 

Contributions:

Introduced a novel frame-level noise modeling approach for TTS, demonstrated significant improvements in audio quality over previous methods, provided comprehensive evaluations on artificial and real-world datasets, and released audio samples and implementation details for reproducibility.

Link to tool


Last Accessed: 7/17/2024

NSF Award #2346473