AV-Deepfake1M: A Large-Scale LLM-Driven Audio-Visual Deepfake Dataset

Authors:
Zhixi Cai, Shreya Ghosh, Aman Pankaj, Munawar Hayat, Abhinav Dhall, Tom Gedeon, Kalin Stefanov

 

Abstract:
AV-Deepfake1M is a large-scale dataset designed for temporal deepfake localization, containing over 1 million deepfake videos. The dataset manipulates both audio and visual content, creating highly realistic deepfakes. It includes various combinations of real and fake audio-visual segments, providing a comprehensive benchmark for state-of-the-art deepfake detection and localization methods.

 

Data Creation Method:
Generated using a pipeline that incorporates a large language model (ChatGPT) for transcript manipulations and the latest open-source audio-visual generation methods. Includes three manipulation strategies: word replacement, deletion, and insertion to create deepfakes.

 

Number of Speakers:

  • 2,068 unique subjects

Total Size:

  • 1,886 hours of audio-visual data, 1,146,760 total videos

Number of Real Samples:

  • 286,721 real videos

Number of Fake Samples:

  • 860,039 fake videos

Description of the Dataset:

  • Each video varies in length but averages around 9 seconds.

 

Extra Details:
The dataset is specifically designed to detect partial manipulations embedded in real content. It is significantly larger and more challenging than previous deepfake datasets, featuring advanced manipulation strategies and high-quality content.

 

Data Type:

  • Audio-visual data, .wav and video files with associated manipulations

Average Length:

  • Approximately 9 seconds per video

Keywords:

  • Deepfake, Audio-visual, Temporal localization, LLM, Manipulation strategies, Detection, Benchmark

When Published:

  • 2024

 

Annotation Process:
Videos are labeled with frame-level, segment-level, and video-level annotations based on manipulations (replace, delete, insert) and modality (audio, visual, or both). Ground truth labels are used to evaluate deepfake localization performance.

 

Usage Scenarios:
Deepfake detection, temporal localization of manipulations, benchmark for deepfake research, training models to handle audio-visual content manipulation.

 

Miscellaneous Information:
The challenge dataset includes diverse audio-visual deepfake scenarios for testing and improving detection systems.

 

Credits:
Datasets Used:

  • AV-Deepfake1M

Speech Synthesis Models Referenced:

  • Various open-source audio-visual generation methods and ChatGPT for transcript manipulations
Dataset Link


Main Paper Link


License Link


Last Accessed: 9/4/2024

NSF Award #2346473