Authors:
Zehui Jin, Linlong Lang, and Biao Leng.
Where published:
ICASSP 2025 – IEEE International Conference on Acoustics, Speech and Signal Processing.
Abstract:
This paper proposes a novel cross-modal audio deepfake detection framework designed to improve generalization to unseen (out-of-domain) deepfakes. Instead of relying on a single type of feature (like only waveform or only spectrogram), the authors combine multiple modalities and align them to learn richer and more robust representations. They introduce a multi-scale fusion strategy to capture deepfake artifacts at different levels and effectively integrate heterogeneous features. Additionally, they use a single center loss to cluster real (bonafide) audio embeddings more tightly, making it easier to distinguish fake audio, especially in unseen scenarios. Through experiments on datasets like ASVspoof2021 and In-The-Wild, they show that their method outperforms existing state-of-the-art approaches, particularly in terms of generalization and robustness.
Dataset names (used for):
- ASVspoof2019 LA: training set.
- ASVspoof2021 Deepfake and In-The-Wild: evaluation datasets.
Some description of the approach:
The paper proposes WaveSpec, a cross-modal audio deepfake detection framework that combines raw waveform and spectrogram representations. It uses Wav2Vec2 as the waveform encoder, a lightweight U-Net as the spectrogram encoder, and a Cross-Modal Feature Fusion (CMFF) module to align and fuse multi-scale features. It also uses Single Center Loss to make bonafide embeddings more compact and improve detection of unseen fake audio.
Some description of the data:
The model uses each audio utterance in two forms: the raw waveform and its corresponding CQT spectrogram. The spectrogram keeps both phase and magnitude information as a two-channel image. The method is trained on ASVspoof2019 LA and evaluated on ASVspoof2021 Deepfake and In-The-Wild to test out-of-domain generalization.
Keywords:
Audio Deepfake Detection; Multi-Modal Feature Fusion; Forgery Detection.
Instance Represent:
Each instance is an audio utterance represented through two synchronized modalities: a raw waveform and a CQT spectrogram. The CQT spectrogram is represented as a two-channel image containing phase and magnitude, and the final task label is real/fake.
Dataset Characteristics:
The paper focuses on challenging out-of-domain audio deepfake detection. The models are trained on ASVspoof2019 LA and tested on ASVspoof2021 Deepfake and In-The-Wild.
Subject Area:
Audio deepfake detection, spoofing countermeasures, cross-modal feature fusion, waveform/spectrogram representation learning, and out-of-domain forgery detection.
Associated Tools:
WaveSpec, Wav2Vec2 XLSR 300M, CQT Transform, lightweight U-Net, CMFF, ResNet18, Adam, CosineAnnealing scheduler, and t-SNE visualization.
Feature Type:
raw waveform features, CQT spectrogram features.
Number of Features:
The WaveSpec framework uses 3 scales of multi-modal features: Shallow blocks → F¹ₐ and F¹ₛ (local textures and detailed information); Medium blocks → F²ₐ and F²ₛ (intermediate forgery traces); Deep blocks → F³ₐ and F³ₛ (high-level semantic/global information). So there are 6 feature maps total (3 from each encoder), fused pairwise at each scale through the CMFF module.