Chinese

AISHELL-3: A MULTI-SPEAKER MANDARIN TTS CORPUS AND THE BASELINES


Yao Shi, Hui Bu, Xin Xu, Shaoji Zhang, Ming Li
2020

Dataset, Chinese


In this paper, we present AISHELL-3, a large-scale and high-fidelity multi-speaker Mandarin speech corpus which could be used to train multi-speaker Text-to-Speech (TTS) systems. The corpus contains roughly 85 hours of emotion-neutral recordings spoken by 218 native Chinese mandarin speakers.

Controllable Context-aware Conversational Speech Synthesis


Jian Cong, Shan Yang, Na Hu, Guangzhi Li, Lei Xie, Dan Su

2021

Generation, AI x Human Dialogue, Chinese


This study presents a framework for synthesizing human-like conversational speech by modeling spontaneous behaviors, such as filled pauses and prolongations, and speech entrainment. By predicting and controlling these behaviors, the approach generates realistic, contextually aligned speech, with experiments demonstrating its effectiveness in producing natural-sounding conversations.

Conversational End-to-End TTS for Voice Agents


Haohan Guo; Shaofei Zhang; Frank K. Soong; Lei He; Lei Xie

2021

Generation, AI x Human Dialogue, Chinese


It is still a challenge to build a high-quality conversational TTS due to the limitations of corpus and modeling capability. This study aims at building a conversational TTS for a voice agent under sequence to sequence modeling framework.

SPONTTS: MODELING AND TRANSFERRING SPONTANEOUS STYLE FOR TTS


Hanzhao Li, Xinfa Zhu, Liumeng Xue, Yang Song, Yunlin Chen, Lei Xie

2024

Generation, Chinese


The paper introduces SponTTS, a two-stage approach for text-to-speech (TTS) that models and transfers spontaneous speaking styles using neural bottleneck features. By capturing prosody and spontaneous phenomena, SponTTS effectively generates natural and expressive spontaneous speech for target speakers, even in zero-shot scenarios for speakers without prior spontaneous data.

NSF Award #2346473