Ricardo Reimao (York University)
Vassilios Tzerpos (York University)
Description of the Dataset:
The dataset includes both real and synthetic speech samples for the purpose of detecting synthetic speech using machine learning and deep learning models. The data was collected from the latest deep-learning-based speech synthesizers (such as DeepVoice, Google Cloud TTS, Amazon Polly, and Microsoft Azure) and real speech from various sources.
The dataset was created by collecting 87,000 synthetic utterances from various TTS (Text-to-Speech) systems and 111,000 real utterances from open-source speech datasets and online sources like TED Talks and YouTube. The dataset was processed into four versions: original, normalized, truncated (2 seconds), and re-recorded versions to simulate voice channel scenarios.
Number of Speakers:
- 33 synthesized voices and a variety of real speakers from multiple datasets and sources.
Total Size:
- 198,000 utterances
Number of Real Samples:
- 111,000 real utterances
Number of Fake Samples:
- 87,000 synthetic utterances
The dataset is structured into four versions:
- for-original: Collected without any modifications.
- for-norm: Preprocessed and balanced by gender and class (real vs. synthetic).
- for-2seconds: Files truncated to 2 seconds.
- for-rerecorded: Simulates real-world synthetic speech attacks by re-recording audio through a speaker and microphone.
Data Type:
- WAV and MP3 audio files, processed into WAV format for machine learning purposes.
Average Length:
- Real speech: 5.05 seconds, Synthetic speech: 2.35 seconds (with truncated versions set to 2 seconds).
Keywords:
- Synthetic speech detection, Deep neural networks, Machine learning, Text-to-speech, Audio classification
When Published:
- 2019
Files were organized by speech source, gender, and class (real or synthetic). Preprocessing steps included normalization, downsampling, and conversion to mono audio to ensure consistency.
Usage Scenarios:
- Detection of synthetic speech in machine learning and deep learning tasks
- Research on speaker verification and spoofing detection
- Study of audio feature differences between real and synthetic speech
Data Accessibility:
- for-original, for-norm, for-2sec, for-rerec
Miscellaneous Information:
The dataset supports advanced research in synthetic speech detection. It is designed to aid in the development of models for distinguishing between real and synthetic speech using various audio processing techniques.
Credits:
Datasets Used:
- Various TTS systems and open-source speech datasets
Speech Synthesis Models Referenced:
- DeepVoice
- Google Cloud TTS
- Amazon Polly
- Microsoft Azure