Dataset

AISHELL-3: A MULTI-SPEAKER MANDARIN TTS CORPUS AND THE BASELINES


Yao Shi, Hui Bu, Xin Xu, Shaoji Zhang, Ming Li
2020

Dataset, Chinese


In this paper, we present AISHELL-3, a large-scale and high-fidelity multi-speaker Mandarin speech corpus which could be used to train multi-speaker Text-to-Speech (TTS) systems. The corpus contains roughly 85 hours of emotion-neutral recordings spoken by 218 native Chinese mandarin speakers.

CMU WILDERNESS MULTILINGUAL SPEECH DATASET


Alan W Black & Language Technologies Institute, Carnegie Mellon University
2019

Dataset, Multiple Languages


This paper describes the CMU Wilderness Multilingual Speech Dataset. A dataset of over 700 different languages providing audio, aligned text and word pronunciations.

Common Voice: A Massively-Multilingual Speech Corpus


Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, Gregor Weber

2020

Dataset, Multiple Languages


The Common Voice corpus is a massively-multilingual collection of transcribed speech intended for speech technology research and development. Common Voice is designed for Automatic Speech Recognition purposes but can be useful in other domains (e.g. language identification).

Deepfake Defense: Constructing and Evaluating a Specialized Urdu Deepfake Audio Dataset


Sheza Munir, Wassay Sajjad, Mukeet Raza, Emaan Mujahid Abbas, Abdul Hameed Azeemi, Ihsan Ayyub Qazi, Agha Ali Raza

2023

Discernment, Detection, Dataset, Urdu


This paper addresses the escalating challenges posed by deepfake attacks on Automatic Speaker Verification (ASV) systems. We present a novel Urdu deepfake audio dataset for deepfake detection, focusing on two spoofing attacks – Tacotron and VITS TTS.

Faked Speech Detection with Zero Prior Knowledge


Sahar Al Ajmi, Khizar Hayat, Alaa M. Al Obaidi, Naresh Kumar, Munaf Najmuldeen

2024

Detection, Dataset, English, Arabic, Multiple Languages


This work introduces a neural network method to develop a classifier that will blindly classify an input audio as real or mimicked; the word ’blindly’ refers to the ability to detect mimicked audio without references or real sources.

Generation and Detection of Sign Language Deepfakes – A Linguistic and Visual Analysis


Shahzeb Nacem, Muhammad Riyyan Khan, Usman Tariq, Abhinav Dhall, Carlos Ivan Colon, Hasan Al-Nashash

2024

Generation, Discernment, Detection, Dataset, Accessibility, American Sign Language


This research presents a positive application of deepfake technology in upper body generation, while performing sign-language for the Deaf and Hard of Hearing (DHoH) community.

MLAAD: The Multi-Language Audio Anti-Spoofing Dataset


Nicolas M. Müller, Piotr Kawa, Wei Herng Choong, Edresson Casanova, Eren Gölge, Thorsten Müller, Piotr Syga, Philip Sperl, Konstantin Böttinger

2024

Dataset, Multiple Languages


This paper presents the Multi-Language Audio Anti-Spoof Dataset (MLAAD), created using 82 TTS models, comprising 33 different architectures, to generate 378.0 hours of synthetic voice in 38 different languages.

OpenSLR


Jan “Yenda” Trmal

N/A

Dataset, Multiple Languages


OpenSLR is a site devoted to hosting speech and language resources, such as training corpora for speech recognition, and software related to speech recognition.

Training Multi-Speaker Neural Text-to-Speech Systems using Speaker-Imbalanced Speech Corpora


Hieu-Thi Luong, Xin Wang, Junichi Yamagishi, Nobuyuki Nishizawa

2019

Dataset, Generation, Japanese


This study shows that multi-speaker TTS models, especially ensemble models trained on subsets of data, outperform or match single-speaker models in synthetic speech quality, even with limited data per speaker. The ensemble approach notably improves output for underrepresented speakers by effectively leveraging available data across multiple speakers.

TTS-Portuguese Corpus: a corpus for speech synthesis in Brazilian Portuguese


Edresson Casanova, Arnaldo Candido Junior, Christopher Shulby, Frederico Santos de Oliveira, Joao Paulo Teixeira, Moacir Antonelli Ponti1, Sandra Maria Aluisio

2022

Dataset, Generation, Brazilian Portuguese


This work consists of creating publicly available resources for Brazilian Portuguese in the form of a novel dataset along with deep learning models for end-to-end speech synthesis. Such dataset has 10.5 hours from a single speaker, from which a Tacotron 2 model with the RTISI-LA vocoder presented the best performance, achieving a 4.03 MOS value.

NSF Award #2346473