Authors:
Zhixi Cai, Shreya Ghosh, Aman Pankaj, Munawar Hayat, Abhinav Dhall, Tom Gedeon, Kalin Stefanov
Zhixi Cai, Shreya Ghosh, Aman Pankaj, Munawar Hayat, Abhinav Dhall, Tom Gedeon, Kalin Stefanov
Abstract:
AV-Deepfake1M is a large-scale dataset designed for temporal deepfake localization, containing over 1 million deepfake videos. The dataset manipulates both audio and visual content, creating highly realistic deepfakes. It includes various combinations of real and fake audio-visual segments, providing a comprehensive benchmark for state-of-the-art deepfake detection and localization methods.
AV-Deepfake1M is a large-scale dataset designed for temporal deepfake localization, containing over 1 million deepfake videos. The dataset manipulates both audio and visual content, creating highly realistic deepfakes. It includes various combinations of real and fake audio-visual segments, providing a comprehensive benchmark for state-of-the-art deepfake detection and localization methods.
Data Creation Method:
Generated using a pipeline that incorporates a large language model (ChatGPT) for transcript manipulations and the latest open-source audio-visual generation methods. Includes three manipulation strategies: word replacement, deletion, and insertion to create deepfakes.
Generated using a pipeline that incorporates a large language model (ChatGPT) for transcript manipulations and the latest open-source audio-visual generation methods. Includes three manipulation strategies: word replacement, deletion, and insertion to create deepfakes.
Number of Speakers:
- 2,068 unique subjects
Total Size:
- 1,886 hours of audio-visual data, 1,146,760 total videos
Number of Real Samples:
- 286,721 real videos
Number of Fake Samples:
- 860,039 fake videos
Description of the Dataset:
- Each video varies in length but averages around 9 seconds.
Extra Details:
The dataset is specifically designed to detect partial manipulations embedded in real content. It is significantly larger and more challenging than previous deepfake datasets, featuring advanced manipulation strategies and high-quality content.
The dataset is specifically designed to detect partial manipulations embedded in real content. It is significantly larger and more challenging than previous deepfake datasets, featuring advanced manipulation strategies and high-quality content.
Data Type:
- Audio-visual data, .wav and video files with associated manipulations
Average Length:
- Approximately 9 seconds per video
Keywords:
- Deepfake, Audio-visual, Temporal localization, LLM, Manipulation strategies, Detection, Benchmark
When Published:
- 2024
Annotation Process:
Videos are labeled with frame-level, segment-level, and video-level annotations based on manipulations (replace, delete, insert) and modality (audio, visual, or both). Ground truth labels are used to evaluate deepfake localization performance.
Videos are labeled with frame-level, segment-level, and video-level annotations based on manipulations (replace, delete, insert) and modality (audio, visual, or both). Ground truth labels are used to evaluate deepfake localization performance.
Usage Scenarios:
Deepfake detection, temporal localization of manipulations, benchmark for deepfake research, training models to handle audio-visual content manipulation.
Deepfake detection, temporal localization of manipulations, benchmark for deepfake research, training models to handle audio-visual content manipulation.
Miscellaneous Information:
The challenge dataset includes diverse audio-visual deepfake scenarios for testing and improving detection systems.
The challenge dataset includes diverse audio-visual deepfake scenarios for testing and improving detection systems.
Credits:
Datasets Used:
Datasets Used:
- AV-Deepfake1M
Speech Synthesis Models Referenced:
- Various open-source audio-visual generation methods and ChatGPT for transcript manipulations