Linguistics

Navigation

AISHELL-3: A MULTI-SPEAKER MANDARIN TTS CORPUS AND THE BASELINES

Yao Shi, Hui Bu, Xin Xu, Shaoji Zhang, Ming Li
2020

Dataset, Chinese

In this paper, we present AISHELL-3, a large-scale and high-fidelity multi-speaker Mandarin speech corpus which could be used to train multi-speaker Text-to-Speech (TTS) systems. The corpus contains roughly 85 hours of emotion-neutral recordings spoken by 218 native Chinese mandarin speakers.

CMU WILDERNESS MULTILINGUAL SPEECH DATASET

Alan W Black & Language Technologies Institute, Carnegie Mellon University
2019

Dataset, Multiple Languages

This paper describes the CMU Wilderness Multilingual Speech Dataset. A dataset of over 700 different languages providing audio, aligned text and word pronunciations.

Common Voice: A Massively-Multilingual Speech Corpus

Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, Gregor Weber

2020

Dataset, Multiple Languages

The Common Voice corpus is a massively-multilingual collection of transcribed speech intended for speech technology research and development. Common Voice is designed for Automatic Speech Recognition purposes but can be useful in other domains (e.g. language identification).

Comprehension of Synthetic and Natural Speech: Differences among Sighted and Visually Impaired Young Adults

Konstantinos Papadopoulos and Eleni Koustriava

2015

Accessibility, Greek, Discernment

The present study examines the comprehension of texts presented via synthetic and natural speech in individuals with and without visual impairments. Twenty adults with visual impairments and 65 sighted adults participated in the study.

Considering Temporal Connection between Turns for Conversational Speech Synthesis

Kangdi Mei, Zhaoci Liu, Huipeng Du, Hengyu Li, Yang Ai, Liping Chen, Zhenhua Ling

2024

AI x Human Dialogue, English

Most studies in conversational speech synthesis only focus on the synthesis performance of the current speaker’s turn and neglect the temporal relationship between turns of interlocutors. Therefore, we consider the temporal connection between turns for conversational speech synthesis, which is crucial for the naturalness and coherence of conversations. Specifically, this paper formulates a task in which there is no overlap between turns and only one history turn is considered.

Controllable Context-aware Conversational Speech Synthesis

Jian Cong, Shan Yang, Na Hu, Guangzhi Li, Lei Xie, Dan Su

2021

Generation, AI x Human Dialogue, Chinese

This study presents a framework for synthesizing human-like conversational speech by modeling spontaneous behaviors, such as filled pauses and prolongations, and speech entrainment. By predicting and controlling these behaviors, the approach generates realistic, contextually aligned speech, with experiments demonstrating its effectiveness in producing natural-sounding conversations.

Conversational End-to-End TTS for Voice Agents

Haohan Guo; Shaofei Zhang; Frank K. Soong; Lei He; Lei Xie

2021

Generation, AI x Human Dialogue, Chinese

It is still a challenge to build a high-quality conversational TTS due to the limitations of corpus and modeling capability. This study aims at building a conversational TTS for a voice agent under sequence to sequence modeling framework.

Deepfake Defense: Constructing and Evaluating a Specialized Urdu Deepfake Audio Dataset

Sheza Munir, Wassay Sajjad, Mukeet Raza, Emaan Mujahid Abbas, Abdul Hameed Azeemi, Ihsan Ayyub Qazi, Agha Ali Raza

2023

Discernment, Detection, Dataset, Urdu

This paper addresses the escalating challenges posed by deepfake attacks on Automatic Speaker Verification (ASV) systems. We present a novel Urdu deepfake audio dataset for deepfake detection, focusing on two spoofing attacks – Tacotron and VITS TTS.

Embodied Conversational AI Agents in a Multi-modal Multi-agent Competitive Dialogue

Rahul R. Divekar, Xiangyang Mou, Lisha Chen, Maira Gatti de Bayser, Melina Alberio Guerra, Hui Su

2019

AI x AI Dialogue, English

In a setting where two AI agents embodied as animated humanoid avatars are engaged in a conversation with one human and each other, we see two challenges. One, determination by the AI agents about which one of them is being addressed. Two, determination by the AI agents if they may/could/should speak at the end of a turn. This work brings these two challenges together and explores the participation of AI agents in multiparty conversations.

Evaluating Comprehension of Natural and Synthetic Conversational Speech

Mirjam Wester, Oliver Watts and Gustav Eje Henter

2016

AI x Human Dialogue, Discernment, English

In an effort to develop more ecologically relevant evaluation techniques that go beyond isolated sentences, this paper investigates comprehension of natural and synthetic speech dialogues.

Evaluating Deepfake Speech and ASV Systems on African Accents

Kweku Andoh Yamoah, Hussein Baba Fuseini, David Ebo Adjepon-Yamoah, Dennis Asamoah Owusu

2023

Generation, Detection, English

This work hypothesizes whether ASV systems can be fooled by deepfake speech generated on African accents. Prior studies primarily concentrated on native English speakers. This research centers on African English speakers who frequently interact with digital systems. Experiments assessed a selected DNN-based deepfake audio system and an ASV system, demonstrating that ASV systems are less susceptible to deepfake audio deception in African accents.

Faked Speech Detection with Zero Prior Knowledge

Sahar Al Ajmi, Khizar Hayat, Alaa M. Al Obaidi, Naresh Kumar, Munaf Najmuldeen

2024

Detection, Dataset, English, Arabic, Multiple Languages

This work introduces a neural network method to develop a classifier that will blindly classify an input audio as real or mimicked; the word ’blindly’ refers to the ability to detect mimicked audio without references or real sources.

Generation and Detection of Sign Language Deepfakes – A Linguistic and Visual Analysis

Shahzeb Nacem, Muhammad Riyyan Khan, Usman Tariq, Abhinav Dhall, Carlos Ivan Colon, Hasan Al-Nashash

2024

Generation, Discernment, Detection, Dataset, Accessibility, American Sign Language

This research presents a positive application of deepfake technology in upper body generation, while performing sign-language for the Deaf and Hard of Hearing (DHoH) community.

MLAAD: The Multi-Language Audio Anti-Spoofing Dataset

Nicolas M. Müller, Piotr Kawa, Wei Herng Choong, Edresson Casanova, Eren Gölge, Thorsten Müller, Piotr Syga, Philip Sperl, Konstantin Böttinger

2024

Dataset, Multiple Languages

This paper presents the Multi-Language Audio Anti-Spoof Dataset (MLAAD), created using 82 TTS models, comprising 33 different architectures, to generate 378.0 hours of synthetic voice in 38 different languages.

One Model, Many Languages: Meta-learning for Multilingual Text-to-Speech

Tomáš Nekvinda, Ondřej Dušek

2020

Generation, Multiple Languages

This paper introduces an approach to multilingual speech synthesis which uses the meta-learning concept of contextual parameter generation and produces natural-sounding multilingual speech using more languages and less training data than previous approaches.

OpenSLR

Jan “Yenda” Trmal

N/A

Dataset, Multiple Languages

OpenSLR is a site devoted to hosting speech and language resources, such as training corpora for speech recognition, and software related to speech recognition.

SoundStorm: Efficient Parallel Audio Generation

Zalán Borsos, Matt Sharifi, Damien Vincent, Eugene Kharitonov, Neil Zeghidour, Marco Tagliasacchi

2023

AI x AI Dialogue, Generation, English

We present SoundStorm, a model for efficient, non-autoregressive audio generation. We demonstrate the ability of our model to scale audio generation to longer sequences by synthesizing high-quality, natural dialogue segments, given a transcript annotated with speaker turns and a short prompt with the speakers’ voices.

SpeechAgents: Human-Communication Simulation with Multi-Modal Multi-Agent Systems

Dong Zhang, Zhaowei Li, Pengyu Wang, Xin Zhang, Yaqian Zhou, Xipeng Qiu

2024

AI x AI Dialogue, Generation, English

In this paper, we propose SpeechAgents, a multi-modal LLM based multi-agent system designed for simulating human communication. Experimental results demonstrate that SpeechAgents can simulate human communication dialogues with consistent content, authentic rhythm, and rich emotions and demonstrate excellent scalability even with up to 25 agents, which can apply to tasks such as drama creation and audio novels generation.

SPONTTS: MODELING AND TRANSFERRING SPONTANEOUS STYLE FOR TTS

Hanzhao Li, Xinfa Zhu, Liumeng Xue, Yang Song, Yunlin Chen, Lei Xie

2024

Generation, Chinese

The paper introduces SponTTS, a two-stage approach for text-to-speech (TTS) that models and transfers spontaneous speaking styles using neural bottleneck features. By capturing prosody and spontaneous phenomena, SponTTS effectively generates natural and expressive spontaneous speech for target speakers, even in zero-shot scenarios for speakers without prior spontaneous data.

Systemic Biases in Sign Language AI Research: A Deaf-Led Call to Reevaluate Research Agendas

Aashaka Desai, Maartje De Meulder, Julie A. Hochgesang, Annemarie Kocab, and Alex X. Lu

2024

Accessibility, American Sign Language

This study conducts a systematic review of 101 recent papers in sign language AI. The analysis identifies significant biases in the current state of sign language AI research, including an overfocus on addressing perceived communication barriers, a lack of use of representative datasets, use of annotations lacking linguistic foundations, and development of methods that build on flawed models.

Towards Human-like Spoken Dialogue Generation Between AI Agents from Written Dialogue

Kentaro Mitsui, Yukiya Hono, Kei Sawada

2023

AI x AI Dialogue, Generation, Japanese

This study proposes CHATS – CHatty Agents Text-to-Speech – a discrete token-based system designed to generate spoken dialogues based on written dialogues. Our system can generate speech for both the speaker side and the listener side simultaneously, using only the transcription from the speaker side, which eliminates the need for transcriptions of backchannels or laughter. Moreover, CHATS facilitates natural turn-taking; it determines the appropriate duration of silence after each utterance in the absence of overlap, and it initiates the generation of overlapping speech based on the phoneme sequence of the next utterance in case of overlap.

Training Multi-Speaker Neural Text-to-Speech Systems using Speaker-Imbalanced Speech Corpora

Hieu-Thi Luong, Xin Wang, Junichi Yamagishi, Nobuyuki Nishizawa

2019

Dataset, Generation, Japanese

This study shows that multi-speaker TTS models, especially ensemble models trained on subsets of data, outperform or match single-speaker models in synthetic speech quality, even with limited data per speaker. The ensemble approach notably improves output for underrepresented speakers by effectively leveraging available data across multiple speakers.

TTS-Portuguese Corpus: a corpus for speech synthesis in Brazilian Portuguese

Edresson Casanova, Arnaldo Candido Junior, Christopher Shulby, Frederico Santos de Oliveira, Joao Paulo Teixeira, Moacir Antonelli Ponti1, Sandra Maria Aluisio

2022

Dataset, Generation, Brazilian Portuguese

This work consists of creating publicly available resources for Brazilian Portuguese in the form of a novel dataset along with deep learning models for end-to-end speech synthesis. Such dataset has 10.5 hours from a single speaker, from which a Tacotron 2 model with the RTISI-LA vocoder presented the best performance, achieving a 4.03 MOS value.

Uncovering Human Traits in Determining Real and Spoofed Audio: Insights from Blind and Sighted Individuals

Chaeeun Han, Prasenjit Mitra, Syed Masum Billah

2024

Discernment, Accessibility, English

This paper explores how blind and sighted individuals perceive real and spoofed audio, highlighting differences and similarities between the groups.

Voice Conversion and Spoofed Voice Detection from Parallel English and Urdu Corpus using Cyclic GANs

Summra Saleem; Aniqa Dilawari; Muhammad Usman Ghani Khan; Muhammad Husnain

2019

Generation, Detection, English, Urdu

This study addresses the threat of identity theft in automatic speech verification systems using a Cyclic GAN-based model to generate and detect spoofed voices, specifically focusing on Urdu and English speech datasets. By leveraging adversarial examples for spoof detection and using Gradient Boosting to differentiate real from fake voices, the approach shows promise but requires further data and refinement for practical large-scale implementation.

Zero-Shot Voice Cloning Text-to-Speech for Dysphonia Disorder Speakers

Kurniawati Azizah

2024

Generation, Accessibility, English

This research enhances zero-shot voice cloning TTS for individuals with dysphonia, improving speaker similarity, intelligibility, and sound quality through adjustments in model architecture and loss functions. The optimized model shows notable improvements over the baseline in cosine similarity, character error rate, and mean opinion score, making dysphonic speech clearer and closer in quality to the original voices of speakers with dysphonia.

NSF Award #2346473

Community Infrastructure to Strengthen AI for Audio Deepfake analysis (CISAAD)

College of Engineering and Information Technology

Community Infrastructure to Strengthen AI for Audio Deepfake analysis (CISAAD)

Linguistics

Navigation

Community Infrastructure to Strengthen AI for Audio Deepfake analysis (CISAAD)

Navigation

Subscribe to UMBC Weekly Top Stories

I am interested in: