Authors:
Rui Liu, Jinhua Zhang, Haizhou Li
Where published:
IEEE Transactions on Audio, Speech and Language Processing, Vol. 34, 2026
Abstract:
The paper proposes a new retrieval-based audio deepfake detection method called RDD-ADD (Retrieval-based Deep Difference Modeling). Unlike prior approaches that only use real (bonafide) audio as references, the authors design a dual-database retrieval system that uses both real and spoofed audio to better capture differences. They introduce anchor-based similarity fusion and GRU-based contextual fusion to combine retrieved features dynamically, and a graph-based learning module to model interactions between the input audio and retrieved samples. This allows the system to learn fine-grained differences between real and fake audio more effectively. Experiments on multiple benchmark datasets show that their approach outperforms existing methods, improving the robustness of retrieval-based audio deepfake detection.
Dataset names (used for):
- ASVspoof2019-LA: main benchmark for training/evaluation and ablation.
- ASVspoof2021-LA: robustness/generalization under channel and codec variability.
- ASVspoof2021-DF: deepfake detection under compressed/online-media-like conditions.
- VCTK: extra bonafide utterances to increase bonafide retrieval database diversity.
Some description of the approach:
The paper proposes RDD-ADD, a retrieval-based audio deepfake detection method that uses both bonafide and spoof audio as references. It dynamically fuses retrieved WavLM features through anchor-based and GRU-based fusion, then applies graph-based difference modeling to capture fine-grained differences between the input audio and retrieved real/fake samples.
Some description of the data:
The study uses ASVspoof logical access and deepfake subsets. ASVspoof2019-LA training has 2,580 genuine and 22,800 spoof utterances generated by six TTS/VC algorithms; ASVspoof2021-LA adds channel/codec variability; ASVspoof2021-DF focuses on deepfake detection and compressed online-media-like conditions.
Keywords:
Audio deepfake detection; dual-database retrieval; feature fusion; graph-based learning; retrieval-based augmentation
Instance Represent:
Each instance is an audio utterance / speech waveform standardized to a fixed length of 64,600 samples through truncation or zero-padding. During inference, each query audio is represented as an embedding and used to retrieve the top-3 nearest bonafide and spoof reference samples.
Dataset Characteristics:
Binary bonafide/spoof audio deepfake detection datasets based on ASVspoof2019-LA, ASVspoof2021-LA, and ASVspoof2021-DF. They include genuine and TTS/VC-generated spoof utterances, unseen attack types in evaluation, channel/codec variability in 2021 LA, and compressed deepfake audio mimicking online media in 2021 DF.
Subject Area:
Audio deepfake detection; speech anti-spoofing; retrieval-augmented audio classification; graph-based deep learning for fake speech detection.
Associated Tools:
Code/demo: https://github.com/AI-S2-Lab/RDD-ADD. Main tools/models: WavLM feature extractor, AASIST classifier for fine-tuning, RawNet2 feature encoder, FAISS (IndexFlatL2), GRU, Graph Attention Network, Heterogeneous Graph, DDA module.
Feature Type:
The paper uses WavLM features as the primary feature type. Specifically, the features are layer-wise, self-supervised representations extracted from WavLM, represented as y ∈ R^(L×T×D), where L is the number of WavLM encoder layers, T is the number of time steps, and D is the feature dimension per time step. These are then compressed via a RawNet2-based encoder (using sinc convolution and residual blocks) into compact embeddings for efficient retrieval and graph-based modeling.