VoxCeleb2

Authors:
Joon Son Chung, Arsha Nagrani, Andrew Zisserman

 

Abstract:
VoxCeleb2 contains audio-visual recordings from over 6,000 speakers, extracted from YouTube videos. The recordings are ‘in the wild,’ containing various background noises and conditions, making it suitable for training robust speaker recognition systems. The dataset includes videos shot in diverse environments such as red carpets, outdoor stadiums, and indoor settings, which include speeches and interviews. It provides data with varied accents, ethnicities, and languages from speakers of 145 different nationalities.

 

Data Creation Method:
Data collected from YouTube videos using an automatic computer vision pipeline involving stages of face and voice detection, verification, and active speaker verification.

 

Number of Speakers:

  • 6000 speakers

Total Size:

  • Over 1 million utterances

Number of Real Samples:

  • All samples are real-world recordings; no specific number for ‘real’ samples as all are from actual video uploads

Number of Fake Samples:

  • N/A

Description of the Dataset:

  • The dataset includes videos with varied accents, ethnicities, and languages. The recordings are ‘in the wild’ with various background noises and conditions, making it ideal for robust speaker recognition and other audio-visual machine learning applications.

 

Extra Details:
Automated annotations are done using tools like SyncNet to verify the active speaker; face and voice verification algorithms classify and verify the data without manual labeling.

 

Data Type:

  • Audio-visual

Average Length:

  • Approximately 7.8 seconds

Keywords:

  • Speaker identification, speaker verification, large-scale, dataset, convolutional neural network

When Published:

  • 27 Jun 2018

 

Annotation Process:
Automated annotations are done using tools like SyncNet to verify the active speaker; face and voice verification algorithms classify and verify the data without manual labeling.

 

Usage Scenarios:
Suitable for tasks like speaker verification, speaker identification, and other audio-visual machine learning applications like visual speech synthesis, speech separation, and cross-modal learning.

 

 

Miscellaneous Information:
The dataset is designed for robust speaker recognition and other audio-visual machine learning applications.

 

Credits:
Datasets Used:

  • YouTube videos (VoxCeleb2)

Speech Synthesis Models Referenced:

  • Various convolutional neural networks
Dataset Link


Main Paper Link


License Link


Last Accessed: 7/18/2024

NSF Award #2346473