Retrieval-Based Deep Difference Modeling for Audio Deepfake Detection

Authors:

Rui Liu, Jinhua Zhang, Haizhou Li

Where published:

IEEE Transactions on Audio, Speech and Language Processing, Vol. 34, 2026

 

Abstract:

The paper proposes a new retrieval-based audio deepfake detection method called RDD-ADD (Retrieval-based Deep Difference Modeling). Unlike prior approaches that only use real (bonafide) audio as references, the authors design a dual-database retrieval system that uses both real and spoofed audio to better capture differences. They introduce anchor-based similarity fusion and GRU-based contextual fusion to combine retrieved features dynamically, and a graph-based learning module to model interactions between the input audio and retrieved samples. This allows the system to learn fine-grained differences between real and fake audio more effectively. Experiments on multiple benchmark datasets show that their approach outperforms existing methods, improving the robustness of retrieval-based audio deepfake detection.

 

Dataset names (used for):

  • ASVspoof2019-LA: main benchmark for training/evaluation and ablation.
  • ASVspoof2021-LA: robustness/generalization under channel and codec variability.
  • ASVspoof2021-DF: deepfake detection under compressed/online-media-like conditions.
  • VCTK: extra bonafide utterances to increase bonafide retrieval database diversity.

 

Some description of the approach:

The paper proposes RDD-ADD, a retrieval-based audio deepfake detection method that uses both bonafide and spoof audio as references. It dynamically fuses retrieved WavLM features through anchor-based and GRU-based fusion, then applies graph-based difference modeling to capture fine-grained differences between the input audio and retrieved real/fake samples.

 

Some description of the data:

The study uses ASVspoof logical access and deepfake subsets. ASVspoof2019-LA training has 2,580 genuine and 22,800 spoof utterances generated by six TTS/VC algorithms; ASVspoof2021-LA adds channel/codec variability; ASVspoof2021-DF focuses on deepfake detection and compressed online-media-like conditions.

 

Keywords:

Audio deepfake detection; dual-database retrieval; feature fusion; graph-based learning; retrieval-based augmentation

Instance Represent:

Each instance is an audio utterance / speech waveform standardized to a fixed length of 64,600 samples through truncation or zero-padding. During inference, each query audio is represented as an embedding and used to retrieve the top-3 nearest bonafide and spoof reference samples.

Dataset Characteristics:

Binary bonafide/spoof audio deepfake detection datasets based on ASVspoof2019-LA, ASVspoof2021-LA, and ASVspoof2021-DF. They include genuine and TTS/VC-generated spoof utterances, unseen attack types in evaluation, channel/codec variability in 2021 LA, and compressed deepfake audio mimicking online media in 2021 DF.

Subject Area:

Audio deepfake detection; speech anti-spoofing; retrieval-augmented audio classification; graph-based deep learning for fake speech detection.

Associated Tools:

Code/demo: https://github.com/AI-S2-Lab/RDD-ADD. Main tools/models: WavLM feature extractor, AASIST classifier for fine-tuning, RawNet2 feature encoder, FAISS (IndexFlatL2), GRU, Graph Attention Network, Heterogeneous Graph, DDA module.

Feature Type:

The paper uses WavLM features as the primary feature type. Specifically, the features are layer-wise, self-supervised representations extracted from WavLM, represented as y ∈ R^(L×T×D), where L is the number of WavLM encoder layers, T is the number of time steps, and D is the feature dimension per time step. These are then compressed via a RawNet2-based encoder (using sinc convolution and residual blocks) into compact embeddings for efficient retrieval and graph-based modeling.

Main Paper Link


Last Accessed: 06/16/2026

NSF Award #2346473