Authors:
Joon Son Chung, Arsha Nagrani, Andrew Zisserman
Joon Son Chung, Arsha Nagrani, Andrew Zisserman
Abstract:
VoxCeleb2 contains audio-visual recordings from over 6,000 speakers, extracted from YouTube videos. The recordings are ‘in the wild,’ containing various background noises and conditions, making it suitable for training robust speaker recognition systems. The dataset includes videos shot in diverse environments such as red carpets, outdoor stadiums, and indoor settings, which include speeches and interviews. It provides data with varied accents, ethnicities, and languages from speakers of 145 different nationalities.
VoxCeleb2 contains audio-visual recordings from over 6,000 speakers, extracted from YouTube videos. The recordings are ‘in the wild,’ containing various background noises and conditions, making it suitable for training robust speaker recognition systems. The dataset includes videos shot in diverse environments such as red carpets, outdoor stadiums, and indoor settings, which include speeches and interviews. It provides data with varied accents, ethnicities, and languages from speakers of 145 different nationalities.
Data Creation Method:
Data collected from YouTube videos using an automatic computer vision pipeline involving stages of face and voice detection, verification, and active speaker verification.
Data collected from YouTube videos using an automatic computer vision pipeline involving stages of face and voice detection, verification, and active speaker verification.
Number of Speakers:
- 6000 speakers
Total Size:
- Over 1 million utterances
Number of Real Samples:
- All samples are real-world recordings; no specific number for ‘real’ samples as all are from actual video uploads
Number of Fake Samples:
- N/A
Description of the Dataset:
- The dataset includes videos with varied accents, ethnicities, and languages. The recordings are ‘in the wild’ with various background noises and conditions, making it ideal for robust speaker recognition and other audio-visual machine learning applications.
Extra Details:
Automated annotations are done using tools like SyncNet to verify the active speaker; face and voice verification algorithms classify and verify the data without manual labeling.
Automated annotations are done using tools like SyncNet to verify the active speaker; face and voice verification algorithms classify and verify the data without manual labeling.
Data Type:
- Audio-visual
Average Length:
- Approximately 7.8 seconds
Keywords:
- Speaker identification, speaker verification, large-scale, dataset, convolutional neural network
When Published:
- 27 Jun 2018
Annotation Process:
Automated annotations are done using tools like SyncNet to verify the active speaker; face and voice verification algorithms classify and verify the data without manual labeling.
Automated annotations are done using tools like SyncNet to verify the active speaker; face and voice verification algorithms classify and verify the data without manual labeling.
Usage Scenarios:
Suitable for tasks like speaker verification, speaker identification, and other audio-visual machine learning applications like visual speech synthesis, speech separation, and cross-modal learning.
Suitable for tasks like speaker verification, speaker identification, and other audio-visual machine learning applications like visual speech synthesis, speech separation, and cross-modal learning.
License Link:
Creative Commons BY-SA 4.0 License
Creative Commons BY-SA 4.0 License
Miscellaneous Information:
The dataset is designed for robust speaker recognition and other audio-visual machine learning applications.
The dataset is designed for robust speaker recognition and other audio-visual machine learning applications.
Credits:
Datasets Used:
Datasets Used:
- YouTube videos (VoxCeleb2)
Speech Synthesis Models Referenced:
- Various convolutional neural networks