Controllable Context-aware Conversational Speech Synthesis
Jian Cong, Shan Yang, Na Hu, Guangzhi Li, Lei Xie, Dan Su
2021
Generation, AI x Human Dialogue, Chinese
This study presents a framework for synthesizing human-like conversational speech by modeling spontaneous behaviors, such as filled pauses and prolongations, and speech entrainment. By predicting and controlling these behaviors, the approach generates realistic, contextually aligned speech, with experiments demonstrating its effectiveness in producing natural-sounding conversations.
Conversational End-to-End TTS for Voice Agents
Haohan Guo; Shaofei Zhang; Frank K. Soong; Lei He; Lei Xie
2021
Generation, AI x Human Dialogue, Chinese
It is still a challenge to build a high-quality conversational TTS due to the limitations of corpus and modeling capability. This study aims at building a conversational TTS for a voice agent under sequence to sequence modeling framework.
Evaluating Deepfake Speech and ASV Systems on African Accents
Kweku Andoh Yamoah, Hussein Baba Fuseini, David Ebo Adjepon-Yamoah, Dennis Asamoah Owusu
2023
Generation, Detection, English
This work hypothesizes whether ASV systems can be fooled by deepfake speech generated on African accents. Prior studies primarily concentrated on native English speakers. This research centers on African English speakers who frequently interact with digital systems. Experiments assessed a selected DNN-based deepfake audio system and an ASV system, demonstrating that ASV systems are less susceptible to deepfake audio deception in African accents.
Generation and Detection of Sign Language Deepfakes – A Linguistic and Visual Analysis
Shahzeb Nacem, Muhammad Riyyan Khan, Usman Tariq, Abhinav Dhall, Carlos Ivan Colon, Hasan Al-Nashash
2024
Generation, Discernment, Detection, Dataset, Accessibility, American Sign Language
This research presents a positive application of deepfake technology in upper body generation, while performing sign-language for the Deaf and Hard of Hearing (DHoH) community.
One Model, Many Languages: Meta-learning for Multilingual Text-to-Speech
Tomáš Nekvinda, Ondřej Dušek
2020
Generation, Multiple Languages
This paper introduces an approach to multilingual speech synthesis which uses the meta-learning concept of contextual parameter generation and produces natural-sounding multilingual speech using more languages and less training data than previous approaches.
SoundStorm: Efficient Parallel Audio Generation
Zalán Borsos, Matt Sharifi, Damien Vincent, Eugene Kharitonov, Neil Zeghidour, Marco Tagliasacchi
2023
AI x AI Dialogue, Generation, English
We present SoundStorm, a model for efficient, non-autoregressive audio generation. We demonstrate the ability of our model to scale audio generation to longer sequences by synthesizing high-quality, natural dialogue segments, given a transcript annotated with speaker turns and a short prompt with the speakers’ voices.
SpeechAgents: Human-Communication Simulation with Multi-Modal Multi-Agent Systems
Dong Zhang, Zhaowei Li, Pengyu Wang, Xin Zhang, Yaqian Zhou, Xipeng Qiu
2024
AI x AI Dialogue, Generation, English
In this paper, we propose SpeechAgents, a multi-modal LLM based multi-agent system designed for simulating human communication. Experimental results demonstrate that SpeechAgents can simulate human communication dialogues with consistent content, authentic rhythm, and rich emotions and demonstrate excellent scalability even with up to 25 agents, which can apply to tasks such as drama creation and audio novels generation.
SPONTTS: MODELING AND TRANSFERRING SPONTANEOUS STYLE FOR TTS
Hanzhao Li, Xinfa Zhu, Liumeng Xue, Yang Song, Yunlin Chen, Lei Xie
2024
The paper introduces SponTTS, a two-stage approach for text-to-speech (TTS) that models and transfers spontaneous speaking styles using neural bottleneck features. By capturing prosody and spontaneous phenomena, SponTTS effectively generates natural and expressive spontaneous speech for target speakers, even in zero-shot scenarios for speakers without prior spontaneous data.
Towards Human-like Spoken Dialogue Generation Between AI Agents from Written Dialogue
Kentaro Mitsui, Yukiya Hono, Kei Sawada
2023
AI x AI Dialogue, Generation, Japanese
This study proposes CHATS – CHatty Agents Text-to-Speech – a discrete token-based system designed to generate spoken dialogues based on written dialogues. Our system can generate speech for both the speaker side and the listener side simultaneously, using only the transcription from the speaker side, which eliminates the need for transcriptions of backchannels or laughter. Moreover, CHATS facilitates natural turn-taking; it determines the appropriate duration of silence after each utterance in the absence of overlap, and it initiates the generation of overlapping speech based on the phoneme sequence of the next utterance in case of overlap.
Training Multi-Speaker Neural Text-to-Speech Systems using Speaker-Imbalanced Speech Corpora
Hieu-Thi Luong, Xin Wang, Junichi Yamagishi, Nobuyuki Nishizawa
2019
This study shows that multi-speaker TTS models, especially ensemble models trained on subsets of data, outperform or match single-speaker models in synthetic speech quality, even with limited data per speaker. The ensemble approach notably improves output for underrepresented speakers by effectively leveraging available data across multiple speakers.
TTS-Portuguese Corpus: a corpus for speech synthesis in Brazilian Portuguese
Edresson Casanova, Arnaldo Candido Junior, Christopher Shulby, Frederico Santos de Oliveira, Joao Paulo Teixeira, Moacir Antonelli Ponti1, Sandra Maria Aluisio
2022
Dataset, Generation, Brazilian Portuguese
This work consists of creating publicly available resources for Brazilian Portuguese in the form of a novel dataset along with deep learning models for end-to-end speech synthesis. Such dataset has 10.5 hours from a single speaker, from which a Tacotron 2 model with the RTISI-LA vocoder presented the best performance, achieving a 4.03 MOS value.
Voice Conversion and Spoofed Voice Detection from Parallel English and Urdu Corpus using Cyclic GANs
Summra Saleem; Aniqa Dilawari; Muhammad Usman Ghani Khan; Muhammad Husnain
2019
Generation, Detection, English, Urdu
This study addresses the threat of identity theft in automatic speech verification systems using a Cyclic GAN-based model to generate and detect spoofed voices, specifically focusing on Urdu and English speech datasets. By leveraging adversarial examples for spoof detection and using Gradient Boosting to differentiate real from fake voices, the approach shows promise but requires further data and refinement for practical large-scale implementation.
Zero-Shot Voice Cloning Text-to-Speech for Dysphonia Disorder Speakers
Kurniawati Azizah
2024
Generation, Accessibility, English
This research enhances zero-shot voice cloning TTS for individuals with dysphonia, improving speaker similarity, intelligibility, and sound quality through adjustments in model architecture and loss functions. The optimized model shows notable improvements over the baseline in cosine similarity, character error rate, and mean opinion score, making dysphonic speech clearer and closer in quality to the original voices of speakers with dysphonia.