Navigation
AISHELL-3: A MULTI-SPEAKER MANDARIN TTS CORPUS AND THE BASELINES
Yao Shi, Hui Bu, Xin Xu, Shaoji Zhang, Ming Li
2020
Dataset, Chinese
In this paper, we present AISHELL-3, a large-scale and high-fidelity multi-speaker Mandarin speech corpus which could be used to train multi-speaker Text-to-Speech (TTS) systems. The corpus contains roughly 85 hours of emotion-neutral recordings spoken by 218 native Chinese mandarin speakers.
CMU WILDERNESS MULTILINGUAL SPEECH DATASET
Alan W Black & Language Technologies Institute, Carnegie Mellon University
2019
Dataset, Multiple Languages
This paper describes the CMU Wilderness Multilingual Speech Dataset. A dataset of over 700 different languages providing audio, aligned text and word pronunciations.
Common Voice: A Massively-Multilingual Speech Corpus
Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, Gregor Weber
2020
Dataset, Multiple Languages
The Common Voice corpus is a massively-multilingual collection of transcribed speech intended for speech technology research and development. Common Voice is designed for Automatic Speech Recognition purposes but can be useful in other domains (e.g. language identification).
Konstantinos Papadopoulos and Eleni Koustriava
2015
Accessibility, Greek, Discernment
The present study examines the comprehension of texts presented via synthetic and natural speech in individuals with and without visual impairments. Twenty adults with visual impairments and 65 sighted adults participated in the study.
Considering Temporal Connection between Turns for Conversational Speech Synthesis
Kangdi Mei, Zhaoci Liu, Huipeng Du, Hengyu Li, Yang Ai, Liping Chen, Zhenhua Ling
2024
AI x Human Dialogue, English
Most studies in conversational speech synthesis only focus on the synthesis performance of the current speaker’s turn and neglect the temporal relationship between turns of interlocutors. Therefore, we consider the temporal connection between turns for conversational speech synthesis, which is crucial for the naturalness and coherence of conversations. Specifically, this paper formulates a task in which there is no overlap between turns and only one history turn is considered.
Controllable Context-aware Conversational Speech Synthesis
Jian Cong, Shan Yang, Na Hu, Guangzhi Li, Lei Xie, Dan Su
2021
Generation, AI x Human Dialogue, Chinese
This study presents a framework for synthesizing human-like conversational speech by modeling spontaneous behaviors, such as filled pauses and prolongations, and speech entrainment. By predicting and controlling these behaviors, the approach generates realistic, contextually aligned speech, with experiments demonstrating its effectiveness in producing natural-sounding conversations.
Conversational End-to-End TTS for Voice Agents
Haohan Guo; Shaofei Zhang; Frank K. Soong; Lei He; Lei Xie
2021
Generation, AI x Human Dialogue, Chinese
It is still a challenge to build a high-quality conversational TTS due to the limitations of corpus and modeling capability. This study aims at building a conversational TTS for a voice agent under sequence to sequence modeling framework.
Deepfake Defense: Constructing and Evaluating a Specialized Urdu Deepfake Audio Dataset
Sheza Munir, Wassay Sajjad, Mukeet Raza, Emaan Mujahid Abbas, Abdul Hameed Azeemi, Ihsan Ayyub Qazi, Agha Ali Raza
2023
Discernment, Detection, Dataset, Urdu
This paper addresses the escalating challenges posed by deepfake attacks on Automatic Speaker Verification (ASV) systems. We present a novel Urdu deepfake audio dataset for deepfake detection, focusing on two spoofing attacks – Tacotron and VITS TTS.
Embodied Conversational AI Agents in a Multi-modal Multi-agent Competitive Dialogue
Rahul R. Divekar, Xiangyang Mou, Lisha Chen, Maira Gatti de Bayser, Melina Alberio Guerra, Hui Su
2019
AI x AI Dialogue, English
In a setting where two AI agents embodied as animated humanoid avatars are engaged in a conversation with one human and each other, we see two challenges. One, determination by the AI agents about which one of them is being addressed. Two, determination by the AI agents if they may/could/should speak at the end of a turn. This work brings these two challenges together and explores the participation of AI agents in multiparty conversations.
Evaluating comprehension of natural and synthetic conversational speech
Mirjam Wester, Oliver Watts and Gustav Eje Henter
2016
AI x Human Dialogue, Discernment, English
In an effort to develop more ecologically relevant evaluation techniques that go beyond isolated sentences, this paper investigates comprehension of natural and synthetic speech dialogues.
Evaluating Deepfake Speech and ASV Systems on African Accents
Kweku Andoh Yamoah, Hussein Baba Fuseini, David Ebo Adjepon-Yamoah, Dennis Asamoah Owusu
2023
Generation, Detection, English
This work hypothesizes whether ASV systems can be fooled by deepfake speech generated on African accents. Prior studies primarily concentrated on native English speakers. This research centers on African English speakers who frequently interact with digital systems. Experiments assessed a selected DNN-based deepfake audio system and an ASV system, demonstrating that ASV systems are less susceptible to deepfake audio deception in African accents.
Faked Speech Detection with Zero Prior Knowledge
Sahar Al Ajmi, Khizar Hayat, Alaa M. Al Obaidi, Naresh Kumar, Munaf Najmuldeen
2024
Detection, Dataset, English, Arabic, Multiple Languages
This work introduces a neural network method to develop a classifier that will blindly classify an input audio as real or mimicked; the word ’blindly’ refers to the ability to detect mimicked audio without references or real sources.
Generation and Detection of Sign Language Deepfakes – A Linguistic and Visual Analysis
Shahzeb Nacem, Muhammad Riyyan Khan, Usman Tariq, Abhinav Dhall, Carlos Ivan Colon, Hasan Al-Nashash
2024
Generation, Discernment, Detection, Dataset, Accessibility, ASL
This research presents a positive application of deepfake technology in upper body generation, while performing sign-language for the Deaf and Hard of Hearing (DHoH) community.
MLAAD: The Multi-Language Audio Anti-Spoofing Dataset
Nicolas M. Müller, Piotr Kawa, Wei Herng Choong, Edresson Casanova, Eren Gölge, Thorsten Müller, Piotr Syga, Philip Sperl, Konstantin Böttinger
2024
Dataset, Multiple Languages
This paper presents the Multi-Language Audio Anti-Spoof Dataset (MLAAD), created using 82 TTS models, comprising 33 different architectures, to generate 378.0 hours of synthetic voice in 38 different languages.
One Model, Many Languages: Meta-learning for Multilingual Text-to-Speech
Tomáš Nekvinda, Ondřej Dušek
2020
Generation, Multiple Languages
This paper introduces an approach to multilingual speech synthesis which uses the meta-learning concept of contextual parameter generation and produces natural-sounding multilingual speech using more languages and less training data than previous approaches.
Jan “Yenda” Trmal
N/A
Dataset, Multiple Languages
OpenSLR is a site devoted to hosting speech and language resources, such as training corpora for speech recognition, and software related to speech recognition.
SoundStorm: Efficient Parallel Audio Generation
Zalán Borsos, Matt Sharifi, Damien Vincent, Eugene Kharitonov, Neil Zeghidour, Marco Tagliasacchi
2023
AI x AI Dialogue, Generation, English
We present SoundStorm, a model for efficient, non-autoregressive audio generation. We demonstrate the ability of our model to scale audio generation to longer sequences by synthesizing high-quality, natural dialogue segments, given a transcript annotated with speaker turns and a short prompt with the speakers’ voices.
SpeechAgents: Human-Communication Simulation with Multi-Modal Multi-Agent Systems
Dong Zhang, Zhaowei Li, Pengyu Wang, Xin Zhang, Yaqian Zhou, Xipeng Qiu
2024
AI x AI Dialogue, Generation, English
In this paper, we propose SpeechAgents, a multi-modal LLM based multi-agent system designed for simulating human communication. Experimental results demonstrate that SpeechAgents can simulate human communication dialogues with consistent content, authentic rhythm, and rich emotions and demonstrate excellent scalability even with up to 25 agents, which can apply to tasks such as drama creation and audio novels generation.
SPONTTS: MODELING AND TRANSFERRING SPONTANEOUS STYLE FOR TTS
Hanzhao Li, Xinfa Zhu, Liumeng Xue, Yang Song, Yunlin Chen, Lei Xie
2024
Generation, Chinese
The paper introduces SponTTS, a two-stage approach for text-to-speech (TTS) that models and transfers spontaneous speaking styles using neural bottleneck features. By capturing prosody and spontaneous phenomena, SponTTS effectively generates natural and expressive spontaneous speech for target speakers, even in zero-shot scenarios for speakers without prior spontaneous data.
Systemic Biases in Sign Language AI Research: A Deaf-Led Call to Reevaluate Research Agendas
Aashaka Desai, Maartje De Meulder, Julie A. Hochgesang, Annemarie Kocab, and Alex X. Lu
2024
Accessibility
This study conducts a systematic review of 101 recent papers in sign language AI. The analysis identifies significant biases in the current state of sign language AI research, including an overfocus on addressing perceived communication barriers, a lack of use of representative datasets, use of annotations lacking linguistic foundations, and development of methods that build on flawed models.
Towards human-like spoken dialogue generation between AI agents from written dialogue
Kentaro Mitsui, Yukiya Hono, Kei Sawada
2023
AI x AI Dialogue, Generation, Japanese
This study proposes CHATS – CHatty Agents Text-to-Speech – a discrete token-based system designed to generate spoken dialogues based on written dialogues. Our system can generate speech for both the speaker side and the listener side simultaneously, using only the transcription from the speaker side, which eliminates the need for transcriptions of backchannels or laughter. Moreover, CHATS facilitates natural turn-taking; it determines the appropriate duration of silence after each utterance in the absence of overlap, and it initiates the generation of overlapping speech based on the phoneme sequence of the next utterance in case of overlap.
Training Multi-Speaker Neural Text-to-Speech Systems using Speaker-Imbalanced Speech Corpora
Hieu-Thi Luong, Xin Wang, Junichi Yamagishi, Nobuyuki Nishizawa
2019
Dataset, Generation, Japanese
This study shows that multi-speaker TTS models, especially ensemble models trained on subsets of data, outperform or match single-speaker models in synthetic speech quality, even with limited data per speaker. The ensemble approach notably improves output for underrepresented speakers by effectively leveraging available data across multiple speakers.
TTS-Portuguese Corpus: a corpus for speech synthesis in Brazilian Portuguese
Edresson Casanova, Arnaldo Candido Junior, Christopher Shulby, Frederico Santos de Oliveira, Joao Paulo Teixeira, Moacir Antonelli Ponti1, Sandra Maria Aluisio
2022
Dataset, Generation, Brazilian Portuguese
This work consists of creating publicly available resources for Brazilian Portuguese in the form of a novel dataset along with deep learning models for end-to-end speech synthesis. Such dataset has 10.5 hours from a single speaker, from which a Tacotron 2 model with the RTISI-LA vocoder presented the best performance, achieving a 4.03 MOS value.
Chaeeun Han, Prasenjit Mitra, Syed Masum Billah
2024
Discernment, Accessibility, English
This paper explores how blind and sighted individuals perceive real and spoofed audio, highlighting differences and similarities between the groups.
Voice Conversion and Spoofed Voice Detection from Parallel English and Urdu Corpus using Cyclic GANs
Summra Saleem; Aniqa Dilawari; Muhammad Usman Ghani Khan; Muhammad Husnain
2019
Generation, Detection, English, Urdu
This study addresses the threat of identity theft in automatic speech verification systems using a Cyclic GAN-based model to generate and detect spoofed voices, specifically focusing on Urdu and English speech datasets. By leveraging adversarial examples for spoof detection and using Gradient Boosting to differentiate real from fake voices, the approach shows promise but requires further data and refinement for practical large-scale implementation.
Zero-Shot Voice Cloning Text-to-Speech for Dysphonia Disorder Speakers
Kurniawati Azizah
2024
Generation, Accessibility, English
This research enhances zero-shot voice cloning TTS for individuals with dysphonia, improving speaker similarity, intelligibility, and sound quality through adjustments in model architecture and loss functions. The optimized model shows notable improvements over the baseline in cosine similarity, character error rate, and mean opinion score, making dysphonic speech clearer and closer in quality to the original voices of speakers with dysphonia.