Authors:
Joel Frank, Lea Schönherr
Institution:
Ruhr University Bochum, Horst Görtz Institute for IT-Security
Abstract:
The dataset addresses the threat of audio deepfakes by providing a novel dataset of generated audio samples from six different network architectures across two languages. It aims to facilitate the development of detection methods for audio deepfakes by offering a comprehensive analysis and baseline classifiers for comparison.
Description of the Dataset:
The dataset consists of generated audio clips (16-bit PCM WAV) from six state-of-the-art architectures: MelGAN, Parallel WaveGAN (PWG), Multi-band MelGAN (MB-MelGAN), Full-band MelGAN (FB-MelGAN), HiFi-GAN, and WaveGlow. The dataset aims to facilitate the development of detection methods for audio deepfakes by providing a comprehensive analysis of the samples and baseline classifiers for comparison.
Data Creation Method:
Collected ten sample sets from six different network architectures, spanning two languages (English and Japanese). The dataset includes samples that resemble the training distributions, enabling one-to-one comparisons of audio clips between different architectures.
Number of Speakers:
- 2 speakers (one for each reference dataset).
Total Size:
- Approximately 196 hours
Number of Real Samples:
- Not specified
Number of Fake Samples:
- 117,985 generated audio clips
Extra Details:
The dataset includes samples based on the LJSPEECH and JSUT datasets, covering passages from non-fiction books and basic kanji of the Japanese language, respectively. It also provides a detailed analysis of the frequency statistics and prosody of the generated samples.
Data Type:
- 16-bit PCM WAV files
Average Length:
- The average length of the clips is around 4.8 seconds (JSUT), 6 seconds (LJSpeech), and (3.8 seconds) Text-to-Speech with the dataset totaling approximately 175 hours.
Keywords:
- Audio Deepfake Detection, Speech Synthesis, Training Data, GANs, TTS, Generative Models
When Published:
- August 26, 2021
Annotation Process:
The dataset includes ten sample sets from six different network architectures. The samples were generated by first extracting Mel spectrograms from the original audio files and then feeding these spectrograms to the respective models to obtain the data set. The dataset is annotated with frequency analysis and prosody statistics.
Usage Scenarios:
Training and evaluation of audio deepfake detection models, comparison of different generative models, research into audio deepfakes, and development of robust ASR systems.