AISHELL-3: A MULTI-SPEAKER MANDARIN TTS CORPUS AND THE BASELINES
Yao Shi, Hui Bu, Xin Xu, Shaoji Zhang, Ming Li
2020
In this paper, we present AISHELL-3, a large-scale and high-fidelity multi-speaker Mandarin speech corpus which could be used to train multi-speaker Text-to-Speech (TTS) systems. The corpus contains roughly 85 hours of emotion-neutral recordings spoken by 218 native Chinese mandarin speakers.
Controllable Context-aware Conversational Speech Synthesis
Jian Cong, Shan Yang, Na Hu, Guangzhi Li, Lei Xie, Dan Su
2021
Generation, AI x Human Dialogue, Chinese
This study presents a framework for synthesizing human-like conversational speech by modeling spontaneous behaviors, such as filled pauses and prolongations, and speech entrainment. By predicting and controlling these behaviors, the approach generates realistic, contextually aligned speech, with experiments demonstrating its effectiveness in producing natural-sounding conversations.
Conversational End-to-End TTS for Voice Agents
Haohan Guo; Shaofei Zhang; Frank K. Soong; Lei He; Lei Xie
2021
Generation, AI x Human Dialogue, Chinese
It is still a challenge to build a high-quality conversational TTS due to the limitations of corpus and modeling capability. This study aims at building a conversational TTS for a voice agent under sequence to sequence modeling framework.
SPONTTS: MODELING AND TRANSFERRING SPONTANEOUS STYLE FOR TTS
Hanzhao Li, Xinfa Zhu, Liumeng Xue, Yang Song, Yunlin Chen, Lei Xie
2024
The paper introduces SponTTS, a two-stage approach for text-to-speech (TTS) that models and transfers spontaneous speaking styles using neural bottleneck features. By capturing prosody and spontaneous phenomena, SponTTS effectively generates natural and expressive spontaneous speech for target speakers, even in zero-shot scenarios for speakers without prior spontaneous data.