Considering Temporal Connection between Turns for Conversational Speech Synthesis
Kangdi Mei, Zhaoci Liu, Huipeng Du, Hengyu Li, Yang Ai, Liping Chen, Zhenhua Ling
2024
Most studies in conversational speech synthesis only focus on the synthesis performance of the current speaker’s turn and neglect the temporal relationship between turns of interlocutors. Therefore, we consider the temporal connection between turns for conversational speech synthesis, which is crucial for the naturalness and coherence of conversations. Specifically, this paper formulates a task in which there is no overlap between turns and only one history turn is considered.
Embodied Conversational AI Agents in a Multi-modal Multi-agent Competitive Dialogue
Rahul R. Divekar, Xiangyang Mou, Lisha Chen, Maira Gatti de Bayser, Melina Alberio Guerra, Hui Su
2019
In a setting where two AI agents embodied as animated humanoid avatars are engaged in a conversation with one human and each other, we see two challenges. One, determination by the AI agents about which one of them is being addressed. Two, determination by the AI agents if they may/could/should speak at the end of a turn. This work brings these two challenges together and explores the participation of AI agents in multiparty conversations.
Evaluating Comprehension of Natural and Synthetic Conversational Speech
Mirjam Wester, Oliver Watts and Gustav Eje Henter
2016
AI x Human Dialogue, Discernment, English
In an effort to develop more ecologically relevant evaluation techniques that go beyond isolated sentences, this paper investigates comprehension of natural and synthetic speech dialogues.
Evaluating Deepfake Speech and ASV Systems on African Accents
Kweku Andoh Yamoah, Hussein Baba Fuseini, David Ebo Adjepon-Yamoah, Dennis Asamoah Owusu
2023
Generation, Detection, English
This work hypothesizes whether ASV systems can be fooled by deepfake speech generated on African accents. Prior studies primarily concentrated on native English speakers. This research centers on African English speakers who frequently interact with digital systems. Experiments assessed a selected DNN-based deepfake audio system and an ASV system, demonstrating that ASV systems are less susceptible to deepfake audio deception in African accents.
Faked Speech Detection with Zero Prior Knowledge
Sahar Al Ajmi, Khizar Hayat, Alaa M. Al Obaidi, Naresh Kumar, Munaf Najmuldeen
2024
Detection, Dataset, English, Arabic, Multiple Languages
This work introduces a neural network method to develop a classifier that will blindly classify an input audio as real or mimicked; the word ’blindly’ refers to the ability to detect mimicked audio without references or real sources.
SoundStorm: Efficient Parallel Audio Generation
Zalán Borsos, Matt Sharifi, Damien Vincent, Eugene Kharitonov, Neil Zeghidour, Marco Tagliasacchi
2023
AI x AI Dialogue, Generation, English
We present SoundStorm, a model for efficient, non-autoregressive audio generation. We demonstrate the ability of our model to scale audio generation to longer sequences by synthesizing high-quality, natural dialogue segments, given a transcript annotated with speaker turns and a short prompt with the speakers’ voices.
SpeechAgents: Human-Communication Simulation with Multi-Modal Multi-Agent Systems
Dong Zhang, Zhaowei Li, Pengyu Wang, Xin Zhang, Yaqian Zhou, Xipeng Qiu
2024
AI x AI Dialogue, Generation, English
In this paper, we propose SpeechAgents, a multi-modal LLM based multi-agent system designed for simulating human communication. Experimental results demonstrate that SpeechAgents can simulate human communication dialogues with consistent content, authentic rhythm, and rich emotions and demonstrate excellent scalability even with up to 25 agents, which can apply to tasks such as drama creation and audio novels generation.
Chaeeun Han, Prasenjit Mitra, Syed Masum Billah
2024
Discernment, Accessibility, English
This paper explores how blind and sighted individuals perceive real and spoofed audio, highlighting differences and similarities between the groups.
Voice Conversion and Spoofed Voice Detection from Parallel English and Urdu Corpus using Cyclic GANs
Summra Saleem; Aniqa Dilawari; Muhammad Usman Ghani Khan; Muhammad Husnain
2019
Generation, Detection, English, Urdu
This study addresses the threat of identity theft in automatic speech verification systems using a Cyclic GAN-based model to generate and detect spoofed voices, specifically focusing on Urdu and English speech datasets. By leveraging adversarial examples for spoof detection and using Gradient Boosting to differentiate real from fake voices, the approach shows promise but requires further data and refinement for practical large-scale implementation.
Zero-Shot Voice Cloning Text-to-Speech for Dysphonia Disorder Speakers
Kurniawati Azizah
2024
Generation, Accessibility, English
This research enhances zero-shot voice cloning TTS for individuals with dysphonia, improving speaker similarity, intelligibility, and sound quality through adjustments in model architecture and loss functions. The optimized model shows notable improvements over the baseline in cosine similarity, character error rate, and mean opinion score, making dysphonic speech clearer and closer in quality to the original voices of speakers with dysphonia.