FoR: Fake or Real Dataset for Synthetic Speech Detection

Authors:
Ricardo Reimao (York University)
Vassilios Tzerpos (York University)

 

Description of the Dataset:

The dataset includes both real and synthetic speech samples for the purpose of detecting synthetic speech using machine learning and deep learning models. The data was collected from the latest deep-learning-based speech synthesizers (such as DeepVoice, Google Cloud TTS, Amazon Polly, and Microsoft Azure) and real speech from various sources.

 

Data Creation Method:
The dataset was created by collecting 87,000 synthetic utterances from various TTS (Text-to-Speech) systems and 111,000 real utterances from open-source speech datasets and online sources like TED Talks and YouTube. The dataset was processed into four versions: original, normalized, truncated (2 seconds), and re-recorded versions to simulate voice channel scenarios.

 

Number of Speakers:

  • 33 synthesized voices and a variety of real speakers from multiple datasets and sources.

Total Size:

  • 198,000 utterances

Number of Real Samples:

  • 111,000 real utterances

Number of Fake Samples:

  • 87,000 synthetic utterances

 

Extra Details:
The dataset is structured into four versions:

  • for-original: Collected without any modifications.
  • for-norm: Preprocessed and balanced by gender and class (real vs. synthetic).
  • for-2seconds: Files truncated to 2 seconds.
  • for-rerecorded: Simulates real-world synthetic speech attacks by re-recording audio through a speaker and microphone.

 

Data Type:

  • WAV and MP3 audio files, processed into WAV format for machine learning purposes.

Average Length:

  • Real speech: 5.05 seconds, Synthetic speech: 2.35 seconds (with truncated versions set to 2 seconds).

Keywords:

  • Synthetic speech detection, Deep neural networks, Machine learning, Text-to-speech, Audio classification

When Published:

  • 2019

 

Annotation Process:
Files were organized by speech source, gender, and class (real or synthetic). Preprocessing steps included normalization, downsampling, and conversion to mono audio to ensure consistency.

 

Usage Scenarios:

  • Detection of synthetic speech in machine learning and deep learning tasks
  • Research on speaker verification and spoofing detection
  • Study of audio feature differences between real and synthetic speech

 

Data Accessibility:

  • for-original, for-norm, for-2sec, for-rerec

 

Miscellaneous Information:
The dataset supports advanced research in synthetic speech detection. It is designed to aid in the development of models for distinguishing between real and synthetic speech using various audio processing techniques.

 

Credits:
Datasets Used:

  • Various TTS systems and open-source speech datasets

Speech Synthesis Models Referenced:

  • DeepVoice
  • Google Cloud TTS
  • Amazon Polly
  • Microsoft Azure

Dataset Link


Main Paper Link


License Link


Last Accessed: 6/24/2024

NSF Award #2346473