XMAD-Bench

Authors:
Ioan-Paul Ciobanu, Andrei-Iulian Hîji, Nicolae Catalin Ristea, Paul Irofti, Cristian Rusu, Radu Tudor Ionescu

 

When Published:
2026

 

Description:
The paper introduces XMAD-Bench, a large-scale multilingual benchmark dataset (668.8 hours) designed to evaluate audio deepfake detectors in realistic, cross-domain conditions. Unlike prior work where training and test data come from the same generative models (in-domain), they deliberately separate speakers, generation methods, and audio sources between training and testing to simulate real-world scenarios. Using this setup, they show that although detectors achieve near-perfect accuracy in in-domain settings, their performance drops drastically—sometimes close to random guessing—in cross-domain evaluations. Overall, they demonstrate that current models lack generalization and emphasize the need for more robust detection methods that work across different languages, speakers, and deepfake techniques.

 

Training and Data:
XMAD-Bench includes real audio from existing speech datasets and fake audio generated from the real clips. The dataset is split into training, in-domain test, and cross-domain test sets. Detectors are trained on each language’s training set and evaluated on in-domain and cross-domain test sets.

 

Advantages:
The benchmark is multilingual, balanced, publicly released, and explicitly designed for cross-domain evaluation. It includes both real and fake samples across seven languages and provides three official splits: a training set, an in-domain test set, and a cross-domain test set. Speakers are distinct across all splits, while the cross-domain test set uses different real-audio sources and different fake-generation methods from the training set.

 

Limitations:
The authors had to use different generative methods across languages because many TTS/VC tools do not support all target languages. They also did not attempt domain adaptation for the cross-domain setting.

 

Dependencies:
The study uses Torchvision for ResNet-18/50, Hugging Face for AST and wav2vec 2.0, Coqui TTS for several synthesis models, official repositories for models like KNN-VC, VALL-E-X, and MeloTTS, and Whisper-Large-v3 for frozen feature extraction.

 

Synthesis:
Fake clips are generated by passing the transcript of a real clip to a TTS model, then applying a VC tool using the original speaker’s voice as reference. Some models, such as XTTSv2, YourTTS, and VALL-E-X, directly generate fake samples from transcript and reference voice.

 

Dataset:
XMAD-Bench contains 668.8 hours of real and fake speech, 207K real samples, 207K fake samples, and 4,403 speakers across seven languages.

 

Preprocessing:
Real and fake clips are trimmed for silence and resampled to 16 kHz. For model input, clips are fixed to 5 seconds by random cropping or zero-padding. Spectrograms are generated with a 320-point Short-Time Fourier Transform and 160-hop Hann window; wav2vec 2.0 consumes raw 16 kHz waveforms.

 

Evaluation Metrics:
Detection performance is evaluated using accuracy (ACC), AUC, and EER. Dataset/audio quality is also assessed using SAR, SNR, SIG, BAK, and OVRL.

 

Performance:
Models often achieve near-perfect in-domain performance but drop substantially in cross-domain testing. wav2vec 2.0 and Whisper+MLP generally perform best in cross-domain settings.

 

Contributions:
The paper contributes XMAD-Bench and uses it to evaluate state-of-the-art detectors under in-domain and cross-domain conditions, showing poor generalization capacity in cross-domain settings.

 

Link to paper


Last Accessed: 06/16/2026

NSF Award #2346473