Japanese

Towards human-like spoken dialogue generation between AI agents from written dialogue


Kentaro Mitsui, Yukiya Hono, Kei Sawada

2023

AI x AI Dialogue, Generation, Japanese


This study proposes CHATS – CHatty Agents Text-to-Speech – a discrete token-based system designed to generate spoken dialogues based on written dialogues. Our system can generate speech for both the speaker side and the listener side simultaneously, using only the transcription from the speaker side, which eliminates the need for transcriptions of backchannels or laughter. Moreover, CHATS facilitates natural turn-taking; it determines the appropriate duration of silence after each utterance in the absence of overlap, and it initiates the generation of overlapping speech based on the phoneme sequence of the next utterance in case of overlap.

Training Multi-Speaker Neural Text-to-Speech Systems using Speaker-Imbalanced Speech Corpora


Hieu-Thi Luong, Xin Wang, Junichi Yamagishi, Nobuyuki Nishizawa

2019

Dataset, Generation, Japanese


This study shows that multi-speaker TTS models, especially ensemble models trained on subsets of data, outperform or match single-speaker models in synthetic speech quality, even with limited data per speaker. The ensemble approach notably improves output for underrepresented speakers by effectively leveraging available data across multiple speakers.

 

NSF Award #2346473