Generation

Controllable Context-aware Conversational Speech Synthesis


Jian Cong, Shan Yang, Na Hu, Guangzhi Li, Lei Xie, Dan Su

2021

Generation, AI x Human Dialogue, Chinese


This study presents a framework for synthesizing human-like conversational speech by modeling spontaneous behaviors, such as filled pauses and prolongations, and speech entrainment. By predicting and controlling these behaviors, the approach generates realistic, contextually aligned speech, with experiments demonstrating its effectiveness in producing natural-sounding conversations.

Conversational End-to-End TTS for Voice Agents


Haohan Guo; Shaofei Zhang; Frank K. Soong; Lei He; Lei Xie

2021

Generation, AI x Human Dialogue, Chinese


It is still a challenge to build a high-quality conversational TTS due to the limitations of corpus and modeling capability. This study aims at building a conversational TTS for a voice agent under sequence to sequence modeling framework.

Evaluating Deepfake Speech and ASV Systems on African Accents


Kweku Andoh Yamoah, Hussein Baba Fuseini, David Ebo Adjepon-Yamoah, Dennis Asamoah Owusu

2023

Generation, Detection, English


This work hypothesizes whether ASV systems can be fooled by deepfake speech generated on African accents. Prior studies primarily concentrated on native English speakers. This research centers on African English speakers who frequently interact with digital systems. Experiments assessed a selected DNN-based deepfake audio system and an ASV system, demonstrating that ASV systems are less susceptible to deepfake audio deception in African accents.

Generation and Detection of Sign Language Deepfakes – A Linguistic and Visual Analysis


Shahzeb Nacem, Muhammad Riyyan Khan, Usman Tariq, Abhinav Dhall, Carlos Ivan Colon, Hasan Al-Nashash

2024

Generation, Discernment, Detection, Dataset, Accessibility, American Sign Language


This research presents a positive application of deepfake technology in upper body generation, while performing sign-language for the Deaf and Hard of Hearing (DHoH) community.

One Model, Many Languages: Meta-learning for Multilingual Text-to-Speech


Tomáš Nekvinda, Ondřej Dušek

2020

Generation, Multiple Languages


This paper introduces an approach to multilingual speech synthesis which uses the meta-learning concept of contextual parameter generation and produces natural-sounding multilingual speech using more languages and less training data than previous approaches.

SoundStorm: Efficient Parallel Audio Generation


Zalán Borsos, Matt Sharifi, Damien Vincent, Eugene Kharitonov, Neil Zeghidour, Marco Tagliasacchi

2023

AI x AI Dialogue, Generation, English


We present SoundStorm, a model for efficient, non-autoregressive audio generation. We demonstrate the ability of our model to scale audio generation to longer sequences by synthesizing high-quality, natural dialogue segments, given a transcript annotated with speaker turns and a short prompt with the speakers’ voices.

SpeechAgents: Human-Communication Simulation with Multi-Modal Multi-Agent Systems


Dong Zhang, Zhaowei Li, Pengyu Wang, Xin Zhang, Yaqian Zhou, Xipeng Qiu

2024

AI x AI Dialogue, Generation, English


In this paper, we propose SpeechAgents, a multi-modal LLM based multi-agent system designed for simulating human communication. Experimental results demonstrate that SpeechAgents can simulate human communication dialogues with consistent content, authentic rhythm, and rich emotions and demonstrate excellent scalability even with up to 25 agents, which can apply to tasks such as drama creation and audio novels generation.

SPONTTS: MODELING AND TRANSFERRING SPONTANEOUS STYLE FOR TTS


Hanzhao Li, Xinfa Zhu, Liumeng Xue, Yang Song, Yunlin Chen, Lei Xie

2024

Generation, Chinese


The paper introduces SponTTS, a two-stage approach for text-to-speech (TTS) that models and transfers spontaneous speaking styles using neural bottleneck features. By capturing prosody and spontaneous phenomena, SponTTS effectively generates natural and expressive spontaneous speech for target speakers, even in zero-shot scenarios for speakers without prior spontaneous data.

Towards Human-like Spoken Dialogue Generation Between AI Agents from Written Dialogue


Kentaro Mitsui, Yukiya Hono, Kei Sawada

2023

AI x AI Dialogue, Generation, Japanese


This study proposes CHATS – CHatty Agents Text-to-Speech – a discrete token-based system designed to generate spoken dialogues based on written dialogues. Our system can generate speech for both the speaker side and the listener side simultaneously, using only the transcription from the speaker side, which eliminates the need for transcriptions of backchannels or laughter. Moreover, CHATS facilitates natural turn-taking; it determines the appropriate duration of silence after each utterance in the absence of overlap, and it initiates the generation of overlapping speech based on the phoneme sequence of the next utterance in case of overlap.

Training Multi-Speaker Neural Text-to-Speech Systems using Speaker-Imbalanced Speech Corpora


Hieu-Thi Luong, Xin Wang, Junichi Yamagishi, Nobuyuki Nishizawa

2019

Dataset, Generation, Japanese


This study shows that multi-speaker TTS models, especially ensemble models trained on subsets of data, outperform or match single-speaker models in synthetic speech quality, even with limited data per speaker. The ensemble approach notably improves output for underrepresented speakers by effectively leveraging available data across multiple speakers.

TTS-Portuguese Corpus: a corpus for speech synthesis in Brazilian Portuguese


Edresson Casanova, Arnaldo Candido Junior, Christopher Shulby, Frederico Santos de Oliveira, Joao Paulo Teixeira, Moacir Antonelli Ponti1, Sandra Maria Aluisio

2022

Dataset, Generation, Brazilian Portuguese


This work consists of creating publicly available resources for Brazilian Portuguese in the form of a novel dataset along with deep learning models for end-to-end speech synthesis. Such dataset has 10.5 hours from a single speaker, from which a Tacotron 2 model with the RTISI-LA vocoder presented the best performance, achieving a 4.03 MOS value.

Voice Conversion and Spoofed Voice Detection from Parallel English and Urdu Corpus using Cyclic GANs


Summra Saleem; Aniqa Dilawari; Muhammad Usman Ghani Khan; Muhammad Husnain

2019

Generation, Detection, English, Urdu


This study addresses the threat of identity theft in automatic speech verification systems using a Cyclic GAN-based model to generate and detect spoofed voices, specifically focusing on Urdu and English speech datasets. By leveraging adversarial examples for spoof detection and using Gradient Boosting to differentiate real from fake voices, the approach shows promise but requires further data and refinement for practical large-scale implementation.

Zero-Shot Voice Cloning Text-to-Speech for Dysphonia Disorder Speakers


Kurniawati Azizah

2024

Generation, Accessibility, English


This research enhances zero-shot voice cloning TTS for individuals with dysphonia, improving speaker similarity, intelligibility, and sound quality through adjustments in model architecture and loss functions. The optimized model shows notable improvements over the baseline in cosine similarity, character error rate, and mean opinion score, making dysphonic speech clearer and closer in quality to the original voices of speakers with dysphonia.

NSF Award #2346473