Dataset

AISHELL-3: A MULTI-SPEAKER MANDARIN TTS CORPUS AND THE BASELINES

Yao Shi, Hui Bu, Xin Xu, Shaoji Zhang, Ming Li
2020

In this paper, we present AISHELL-3, a large-scale and high-fidelity multi-speaker Mandarin speech corpus which could be used to train multi-speaker Text-to-Speech (TTS) systems. The corpus contains roughly 85 hours of emotion-neutral recordings spoken by 218 native Chinese mandarin speakers.

CMU WILDERNESS MULTILINGUAL SPEECH DATASET

Alan W Black & Language Technologies Institute, Carnegie Mellon University
2019

Dataset, Multiple Languages

This paper describes the CMU Wilderness Multilingual Speech Dataset. A dataset of over 700 different languages providing audio, aligned text and word pronunciations.

Common Voice: A Massively-Multilingual Speech Corpus

Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, Gregor Weber

2020

Dataset, Multiple Languages

The Common Voice corpus is a massively-multilingual collection of transcribed speech intended for speech technology research and development. Common Voice is designed for Automatic Speech Recognition purposes but can be useful in other domains (e.g. language identification).

Deepfake Defense: Constructing and Evaluating a Specialized Urdu Deepfake Audio Dataset

Sheza Munir, Wassay Sajjad, Mukeet Raza, Emaan Mujahid Abbas, Abdul Hameed Azeemi, Ihsan Ayyub Qazi, Agha Ali Raza

2023

Discernment, Detection, Dataset, Urdu

This paper addresses the escalating challenges posed by deepfake attacks on Automatic Speaker Verification (ASV) systems. We present a novel Urdu deepfake audio dataset for deepfake detection, focusing on two spoofing attacks – Tacotron and VITS TTS.

Faked Speech Detection with Zero Prior Knowledge

Sahar Al Ajmi, Khizar Hayat, Alaa M. Al Obaidi, Naresh Kumar, Munaf Najmuldeen

2024

Detection, Dataset, English, Arabic, Multiple Languages

This work introduces a neural network method to develop a classifier that will blindly classify an input audio as real or mimicked; the word ’blindly’ refers to the ability to detect mimicked audio without references or real sources.

Generation and Detection of Sign Language Deepfakes – A Linguistic and Visual Analysis

Shahzeb Nacem, Muhammad Riyyan Khan, Usman Tariq, Abhinav Dhall, Carlos Ivan Colon, Hasan Al-Nashash

2024

Generation, Discernment, Detection, Dataset, Accessibility, American Sign Language

This research presents a positive application of deepfake technology in upper body generation, while performing sign-language for the Deaf and Hard of Hearing (DHoH) community.

MLAAD: The Multi-Language Audio Anti-Spoofing Dataset

Nicolas M. Müller, Piotr Kawa, Wei Herng Choong, Edresson Casanova, Eren Gölge, Thorsten Müller, Piotr Syga, Philip Sperl, Konstantin Böttinger

2024

Dataset, Multiple Languages

This paper presents the Multi-Language Audio Anti-Spoof Dataset (MLAAD), created using 82 TTS models, comprising 33 different architectures, to generate 378.0 hours of synthetic voice in 38 different languages.

OpenSLR

Jan “Yenda” Trmal

N/A

Dataset, Multiple Languages

OpenSLR is a site devoted to hosting speech and language resources, such as training corpora for speech recognition, and software related to speech recognition.

Training Multi-Speaker Neural Text-to-Speech Systems using Speaker-Imbalanced Speech Corpora

Hieu-Thi Luong, Xin Wang, Junichi Yamagishi, Nobuyuki Nishizawa

2019

Dataset, Generation, Japanese

This study shows that multi-speaker TTS models, especially ensemble models trained on subsets of data, outperform or match single-speaker models in synthetic speech quality, even with limited data per speaker. The ensemble approach notably improves output for underrepresented speakers by effectively leveraging available data across multiple speakers.

TTS-Portuguese Corpus: a corpus for speech synthesis in Brazilian Portuguese

Edresson Casanova, Arnaldo Candido Junior, Christopher Shulby, Frederico Santos de Oliveira, Joao Paulo Teixeira, Moacir Antonelli Ponti1, Sandra Maria Aluisio

2022

Dataset, Generation, Brazilian Portuguese

This work consists of creating publicly available resources for Brazilian Portuguese in the form of a novel dataset along with deep learning models for end-to-end speech synthesis. Such dataset has 10.5 hours from a single speaker, from which a Tacotron 2 model with the RTISI-LA vocoder presented the best performance, achieving a 4.03 MOS value.

NSF Award #2346473

Community Infrastructure to Strengthen AI for Audio Deepfake analysis (CISAAD)

College of Engineering and Information Technology

Community Infrastructure to Strengthen AI for Audio Deepfake analysis (CISAAD)

Dataset

Community Infrastructure to Strengthen AI for Audio Deepfake analysis (CISAAD)

Subscribe to UMBC Weekly Top Stories

I am interested in: