Transferable Adversarial Attacks on Audio Deepfake Detection

Authors:
Muhammad Umar Farooq, Awais Khan, Kutub Uddin, Khalid Mahmood Malik

 

When Published:
2025

 

Description:
This paper focuses on testing the robustness of audio deepfake detection (ADD) systems against adversarial attacks rather than proposing a new detector. The authors introduce a transferable GAN-based adversarial attack framework that generates highly realistic fake audio designed to fool multiple detection models. Their approach uses an ensemble of surrogate ADD models along with a discriminator to craft attacks that can generalize across different systems (white-box, gray-box, and black-box settings). Additionally, they incorporate a self-supervised audio model to preserve transcription accuracy and perceptual quality, ensuring the adversarial audio remains natural and convincing. Through experiments on datasets like ASVspoof, In-the-Wild, and WaveFake, they demonstrate that state-of-the-art ADD systems are highly vulnerable, with detection accuracy dropping dramatically under attack. Overall, the study reveals critical weaknesses in existing detectors and highlights the urgent need for more robust and secure audio deepfake detection methods.

 

Training and Data:
The attack framework uses surrogate ADD models trained on ASVspoof2019. Experiments are conducted on ASVspoof2019 LA, WaveFake, and In-the-Wild datasets.

 

Advantages:
The proposed attack preserves both transcription and perceptual integrity, making the adversarial samples more realistic and transferable across white-box, gray-box, and black-box settings.

 

Model Architecture:
The framework consists of a generator, discriminator, surrogate ADD models, and a transcription model. The generator creates realistic attacked audio, the discriminator checks whether audio is original or generated, the surrogate models guide transferability, and Wave2Vec/BERT help preserve transcription integrity.

 

Dependencies:
The method depends on surrogate ADD models such as Res-TSSDNet, ResNet, and Inc-TSSDNet for attack generation. It also uses Wave2Vec/Speech2Text for transcription modeling, finally selecting Wave2Vec, and applies a BERT encoder with cosine similarity to preserve transcription integrity.

 

Dataset:
ASVspoof2019 LA, WaveFake, In-the-Wild.

 

Evaluation Metrics:
The paper evaluates attack impact using detection accuracy/performance drop, and evaluates attack quality using PSNR, SSIM, and text similarity.

 

Performance:
The attack significantly reduces SOTA ADD performance. Accuracies drop from 98% to 26% in white-box, 92% to 54% in gray-box, and 94% to 84% in black-box settings. Average drops are around 57%, 30.5%, and 6% respectively.

 

Contributions:
The paper proposes a transferable adversarial attack preserving transcription/perceptual integrity, uses diverse surrogate models to improve transferability, validates similarity through qualitative and quantitative analysis, and evaluates five SOTA ADD systems on three benchmark datasets.

 

Link to paper


Last Accessed: 06/16/2026

NSF Award #2346473