Wave-Spectrogram Cross-Modal Aggregation for Audio Deepfake Detection

Authors:

Zehui Jin, Linlong Lang, and Biao Leng.

Where published:

ICASSP 2025 – IEEE International Conference on Acoustics, Speech and Signal Processing.

 

Abstract:

This paper proposes a novel cross-modal audio deepfake detection framework designed to improve generalization to unseen (out-of-domain) deepfakes. Instead of relying on a single type of feature (like only waveform or only spectrogram), the authors combine multiple modalities and align them to learn richer and more robust representations. They introduce a multi-scale fusion strategy to capture deepfake artifacts at different levels and effectively integrate heterogeneous features. Additionally, they use a single center loss to cluster real (bonafide) audio embeddings more tightly, making it easier to distinguish fake audio, especially in unseen scenarios. Through experiments on datasets like ASVspoof2021 and In-The-Wild, they show that their method outperforms existing state-of-the-art approaches, particularly in terms of generalization and robustness.

 

Dataset names (used for):

  • ASVspoof2019 LA: training set.
  • ASVspoof2021 Deepfake and In-The-Wild: evaluation datasets.

 

Some description of the approach:

The paper proposes WaveSpec, a cross-modal audio deepfake detection framework that combines raw waveform and spectrogram representations. It uses Wav2Vec2 as the waveform encoder, a lightweight U-Net as the spectrogram encoder, and a Cross-Modal Feature Fusion (CMFF) module to align and fuse multi-scale features. It also uses Single Center Loss to make bonafide embeddings more compact and improve detection of unseen fake audio.

 

Some description of the data:

The model uses each audio utterance in two forms: the raw waveform and its corresponding CQT spectrogram. The spectrogram keeps both phase and magnitude information as a two-channel image. The method is trained on ASVspoof2019 LA and evaluated on ASVspoof2021 Deepfake and In-The-Wild to test out-of-domain generalization.

 

Keywords:

Audio Deepfake Detection; Multi-Modal Feature Fusion; Forgery Detection.

Instance Represent:

Each instance is an audio utterance represented through two synchronized modalities: a raw waveform and a CQT spectrogram. The CQT spectrogram is represented as a two-channel image containing phase and magnitude, and the final task label is real/fake.

Dataset Characteristics:

The paper focuses on challenging out-of-domain audio deepfake detection. The models are trained on ASVspoof2019 LA and tested on ASVspoof2021 Deepfake and In-The-Wild.

Subject Area:

Audio deepfake detection, spoofing countermeasures, cross-modal feature fusion, waveform/spectrogram representation learning, and out-of-domain forgery detection.

Associated Tools:

WaveSpec, Wav2Vec2 XLSR 300M, CQT Transform, lightweight U-Net, CMFF, ResNet18, Adam, CosineAnnealing scheduler, and t-SNE visualization.

Feature Type:

raw waveform features, CQT spectrogram features.

Number of Features:

The WaveSpec framework uses 3 scales of multi-modal features: Shallow blocks → F¹ₐ and F¹ₛ (local textures and detailed information); Medium blocks → F²ₐ and F²ₛ (intermediate forgery traces); Deep blocks → F³ₐ and F³ₛ (high-level semantic/global information). So there are 6 feature maps total (3 from each encoder), fused pairwise at each scale through the CMFF module.

Main Paper Link


Last Accessed: 06/16/2026

NSF Award #2346473