Detection Methods

This page provides research work that explore detection of audio deepfakes and voice spoofing using methods including deep learning, machine learning, and linguistic analysis.

Navigation

AASIST: Audio Anti-Spoofing Using Integrated Spectro-Temporal Graph Attention Networks

Jee-weon Jung and Hee-Soo Heo and Hemlata Tak and Hye-jin Shim and Joon Son Chung and Bong-Jin Lee and Ha-jin Yu and Nicholas W. D. Evans

2022

This study uses RawNet2-based encoder to get the audio representations. In contrast to RawNet2, they consider the output of the sinc-convolution layer as a 2-dimensional image with a single channel (spectrogram) rather than a 1-dimensional sequence. They also use some residual blocks with pre-activation to extract the high-level representation. Then, the representation goes to the graph module containing graph attention network. They introduce an innovative heterogeneous stacking graph attention layer that leverages a heterogeneous attention mechanism and a stack node to capture artifacts across various temporal and spectral domains.

A Deep Learning Framework for Audio DeepFake Detection

Janavi Khochare, Chaitali Joshi, Bakul Yenarkar, Shraddha Suratkar, Faruk Kazi
2021

This study uses Temporal Convolutional Network (TCN) and Spatial Transformer Network (STN) to classify a benchmark Fake or Real dataset. Using MEL spectrogram as the input feature of the audio data. Being limited to the FoR dataset reduces the generalizability of the model, since it only consists of one sub-type of audio deepfake called text-to-speech.

ALLM4ADD: Unlocking the Capabilities of Audio Large Language Models for Audio DeepFake Detection

Hao Gu, Jiangyan Yi, Chenglong Wang, Jianhua Tao, Zheng Lian, Jiayi He, Yong Ren, Yujie Chen, and Zhengqi Wen
2025

This paper investigates whether Audio Large Language Models (ALLMs) can be used for audio deepfake detection (ADD). The authors first evaluate ALLMs in a zero-shot setting and find that they perform poorly at distinguishing real from fake audio. To address this, they propose a new framework called ALLM4ADD, which reformulates the detection task as an audio question answering problem by prompting the model with queries like “Is this audio fake or real?”. They then apply supervised fine-tuning to train the ALLM to assess audio authenticity. Through extensive experiments, the study demonstrates that this approach significantly improves detection performance, particularly in data-scarce scenarios, and highlights the potential of leveraging ALLMs for more effective deepfake detection systems.

A Review of Modern DeepFake Detection Methods: Challenges and Future Directions

Zaynab Almutairi
2022

The article reviews existing audio deepfake (AD) detection methods and compares faked audio datasets. It introduces types of AD attacks and analyzes detection methods and datasets for imitation and synthetic-based deepfakes.

A Robust Audio DeepFake Detection System via Multi-View Feature

Yang, Yujie and Qin, Haochen and Zhou, Hang and Wang, Chengcheng and Guo, Tianyu and Han, Kai and Wang, Yunhe
2024

This study showes three pretrained models that had the least performance drop and most generalizability are: Wav2Vec-XLSR [3] HuBERT [4] and WavLM [5]. The study also proposes a feature fusion to incorporate all of these representations in the classifier. Also showes another popular audio pre-trained representation [6] does not show enough generalizability.

A Systematic Review of Audio DeepFake Detection Techniques for Digital Investigation

Mahra Alnaqbi, Richard Adeyemi Ikuesan
2026

This paper conducts a systematic literature review (SLR) of audio deepfake detection research using PRISMA guidelines, analyzing findings from 27 studies. It comprehensively examines detection methods, features, datasets, and evaluation metrics, comparing traditional feature-based approaches (like MFCC and LFCC) with more advanced models such as CNNs, transformers, and multimodal systems. The study finds that while modern deep learning methods achieve higher accuracy, they often fail to generalize well across real-world scenarios and different datasets. It also identifies key limitations in the field, including dataset bias, vulnerability to adversarial attacks, and sensitivity to noise. By synthesizing these insights, the paper highlights a major research gap—the need for a robust and generalizable audio deepfake detection framework that performs consistently across diverse conditions and datasets.

Audio-DeepFake Detection: Adversarial Attacks and Countermeasures

Mouna Rabhi, Spiridon Bakiras, Roberto Di Pietro

2024

The study explores vulnerabilities of audio deepfake detection systems to adversarial attacks, demonstrating that current methods like Deep4SNet can be manipulated to nearly 0% detection accuracy using GAN-based attacks. A new, generalizable defense mechanism is proposed to enhance system resilience.

Audio DeepFake Detection Using Deep Learning

Ousama A. Shaaban, Remzi Yildirim
2025

This study introduces an enhanced Siamese convolutional neural network (Siamese CNN) architecture with a novel StacLoss function and self-attention modules for efficient identification of audio deepfakes. This module directly compares unprocessed original audio with modified audio by initially applying convolutional operations and dual branches to extract complex characteristics from raw audio signals. These operations are followed by residual connections, which enhance the network’s performance.

Audio DeepFake Detection: What Has Been Achieved and What Lies Ahead

Bowen Zhang, Hui Cui, Van Nguyen, and Monica Whitty
2025

This study is essentially a comprehensive survey paper on audio deepfake detection. That means it doesn’t introduce a single new model, instead, it organizes, analyzes, and critiques the entire research landscape so others can understand where the field stands and where to go next.

Audio DeepFake Detectors vs. Real Fraud – The Fall of Benchmarks

Joanna Gajewska, Alicja Martinek, and Ewelina Bartuzi-Trokielewicz
2026

The study evaluates how well current audio deepfake detection models perform in real-world conditions, particularly on recently observed fraud cases such as scam advertisements. The authors compiled a dataset of verified real-world deepfake audio and tested state-of-the-art detectors that were originally trained on standard benchmark datasets. They found that these models perform poorly on real-world samples, revealing a significant generalization gap. To investigate this, they experimented with modifying training data by adding realistic distortions like noise, compression, and signal degradation, which led to only modest improvements. However, incorporating actual incident-based (real-world) deepfake samples into training significantly improved detection performance. They also explored aggregation techniques that combine predictions across audio segments to better detect partially manipulated recordings. Overall, the study highlights that current detectors are not robust enough for real-world use unless they are trained and evaluated on more realistic, up-to-date data.

Audio Features Investigation for Singing Voice DeepFake Detection

Mahyar Gohari, Davide Salvi, Paolo Bestagini, and Nicola Adami
2025

This paper focuses on the specific problem of singing voice deepfake detection, which differs from standard speech deepfake detection. Instead of proposing a completely new model, the authors systematically investigate which audio features and representations work best for detecting fake singing voices. They evaluate two CNN-based detection systems using a wide range of inputs, including handcrafted features, learned representations, and pre-trained features. Through this analysis, they compare how these features perform and identify key factors that improve detection accuracy in the singing domain. The study also highlights important differences between speech and singing voice detection, showing that methods effective for speech do not always transfer well to singing. Overall, the paper provides insights and practical guidance for designing more effective models specifically tailored to singing voice deepfake detection.

Auto Annotation of Linguistic Features for Audio DeepFake Discernment

Kifekachukwu Nwosu, Chloe Evered, Zahra Khanjani, Noshaba Bhalli, Lavon Davis, Christine Mallinson, Vandana P. Janeja
2023

The study focuses on detecting audio deepfakes through linguistic analysis. It involves analyzing audio samples for specific linguistic features and testing the auto-annotation methodology.

Battling Voice Spoofing: A Review, Comparative Analysis and Generalizability Evaluation of State-of-the-Art Voice Spoofing Countermeasures

Awais Khan, Khalid Mahmood Malik, James Ryan, Mikul Saravanan

2023

The datasets used in the experiments include ASVspoof2019, ASVspoof2021, and VSDC, which are employed to evaluate the performance of various voice spoofing countermeasures.

CLAD: Robust Audio DeepFake Detection Against Manipulation Attacks with Contrastive Learning

Haolin Wu, Jing Chen, Ruiying Du, Cong Wu, Kun He, Xingcan Shang, Hao Ren, and Guowen Xu
2026

This paper studies how audio deepfake detection systems can be fooled by simple audio manipulations (like volume changes, fading, and noise), showing that even minor, human-imperceptible modifications can significantly degrade detection performance. To address this vulnerability, the authors propose a new method called CLAD (Contrastive Learning-based Audio Deepfake Detector). Their approach uses contrastive learning to make the model invariant to such manipulations by learning more stable and robust representations, and introduces an additional length loss to better cluster real (bonafide) audio in the feature space. Through extensive experiments, they demonstrate that existing detectors are highly vulnerable (with high false acceptance rates), while CLAD significantly improves robustness, reducing error rates across various attack scenarios. Overall, the paper both reveals critical weaknesses in current systems and proposes a more resilient detection framework.

Combatting Visual Fake News with a Professional Fact-Checking Tool in Education in France, Romania, Spain, and Sweden

Thomas Nygren, Mona Guath, Carl-Anton Werner Axelsson, Divina Frau-Meigs

2021

The study focuses on media and information literacy, evaluating how pupils can identify and assess the credibility of digital news using the InVID-WeVerify tool. It assesses the effectiveness of this tool in an educational context across multiple schools.

Comprehensive Layer-Wise Analysis of SSL Models for Audio DeepFake Detection

Yassine El Kheir, Younes Samih, Suraj Maharjan, Tim Polzehl, and Sebastian Möller
2025

This paper performs a detailed layer-wise analysis of self-supervised learning (SSL) models for audio deepfake detection across multiple contexts, including different languages (English, Chinese, Spanish) and scenarios such as partial speech, singing, and real-world scenes. Instead of proposing a new model, the authors investigate which transformer layers contribute most to detection performance. They find that lower layers capture the most useful and discriminative features, while higher layers are less relevant for this task. Importantly, they show that using only a subset of lower layers still achieves competitive performance, which means models can be made more efficient with reduced computation and faster inference. Overall, the study provides insights into how SSL models work internally and offers practical guidance for building more efficient and effective audio deepfake detection systems.

DeepFake Audio Detection in Voice Authentication: A Spectral and CNN-Based Comprehensive Review

Ali Osman Mohammed Salih, Abdelmajid Hassan Mansour Emam, Alwalid Bashier Gism Elseed Ahmed, Mahmoud Khalifa, Abdelrazig Suliman, and Nissrein Babiker Mohammed Babiker
2025

This study presents a comprehensive review of spectral-based techniques for audio deepfake detection in the context of voice authentication systems. It analyzes how features like spectrograms, MFCC, and CQT can reveal time–frequency inconsistencies in synthetic speech and highlights the effectiveness of integrating CNN-based spoof detection modules as a preprocessing step before identity verification to improve system security. Rather than proposing a new model, the paper synthesizes existing research, identifies key limitations—such as vulnerability to advanced generative models, lack of interpretability, and reduced robustness in noisy real-world conditions—and outlines future directions including hybrid models, adversarial training, and better multilingual datasets. Overall, it provides design insights for building more robust, generalizable, and secure voice authentication systems against deepfake threats.

Detect All-Type DeepFake Audio: Wavelet Prompt Tuning for Enhanced Auditory Perception

Yuankun Xie, Ruibo Fu, Xiaopeng Wang, Zhiyong Wang, Songjun Cao, Long Ma, Haonan Cheng, and Long Ye
2026

The paper tackles the challenge of detecting all types of audio deepfakes (speech, sound, singing, and music) instead of focusing on just one category. The authors first build a comprehensive benchmark dataset that enables evaluation across these different audio types, addressing the lack of unified testing in prior work. They then propose a new training approach called prompt tuning self-supervised learning (PT-SSL), which adapts pre-trained models efficiently using far fewer parameters than traditional fine-tuning. To further improve cross-type detection, they introduce wavelet prompt tuning (WPT-SSL), which captures shared frequency-domain patterns across different audio types without adding extra training cost. Finally, they train a universal detector using all types of deepfake audio together and show that their method (WPT-XLSR-AASIST) significantly improves performance, achieving low error rates and better generalization across diverse audio deepfake scenarios.

End-to-End Anti-Spoofing with RawNet2

Tak, Hemlata and Patino, Jose and Todisco, Massimiliano and Nautsch, Andreas and Evans, Nicholas and Larcher, Anthony

2021

RawNet2 is a DNN-based model, input with raw audio waveforms. RawNet2 is a combination of the original RawNet1 [1] architecture and SincNet. “The first layer of RawNet2 is essentially the same as that of SincNet”, because SincNet processes raw audio waveforms directly, avoiding the need for handcrafted feature extraction. SincNet employs sinc functions as predefined filter shapes [2]. In RawNet2, the upper layers are the same as the residual blocks and GRU layer as RawNet1. They also used a “filter-wise feature map scaling (FMS) using a sigmoid function applied to residual block outputs” . The role of FMS is acting like an attention mechanism to provide more discriminative representations.

FakeVoiceFinder: An Open-Source Framework for Synthetic and DeepFake Audio Detection

Cesar Pachon and Dora Ballesteros
2026

The paper introduces FakeVoiceFinder, a unified experimental framework for audio deepfake detection that systematically evaluates how different factors affect performance. Instead of focusing only on model architectures (like CNNs or Transformers), the authors design a platform that jointly analyzes model choices, hyperparameters (e.g., learning rate, epochs), and audio representations (e.g., spectrograms, scalograms, CQT). The framework allows researchers to run controlled and reproducible experiments, compare custom models with standard baselines under the same conditions, and study the interaction between model design and data representation. Overall, they provide a comprehensive benchmarking and evaluation tool that helps better understand what combinations of models and features work best for detecting synthetic audio.

Fighting AI with AI: Fake Speech Detection Using Deep Learning

Hafiz Malik, Raghavendar Changalvala

2023

The study evaluates the performance of a deep learning-based fake speech detection method using a dataset of cloned and bona-fide speech samples. It focuses on detecting fake speech through a deep learning model using spectrograms of the audio recordings.

Fooled Twice: People Cannot Detect DeepFakes but Think They Can

Nils C. Köbis, Barbora Dolezalová, Ivan Soraperra

2019

The study investigates human detection abilities for deepfake videos, focusing on detection accuracy, cognitive biases, and overconfidence. Participants view 16 videos (8 authentic and 8 deepfakes) and provide responses on whether the video is a deepfake, along with confidence ratings and demographic information.

Forensic DeepFake Audio Detection Using Segmental Speech Features

Tianle Yang, Chengzhe Sun, Siwei Lyu, and Phil Rose
2026

This paper explores a new, interpretable approach to audio deepfake detection using segmental speech features—i.e., fine-grained acoustic properties tied to how humans physically produce sounds (articulation). The authors show that these segment-level features (commonly used in forensic voice comparison) are effective at distinguishing real from fake audio, while broader global features are less useful. Based on this, they argue that deepfake detection should not rely solely on traditional feature sets. Additionally, they propose a speaker-specific detection framework, which focuses on identifying deepfakes for a particular individual rather than generalizing across all speakers. This approach is especially valuable in forensic and security applications, where interpretability and sensitivity to individual voice characteristics are critical.

Hybrid Feature-Based Two-Stage Framework for Audio DeepFake Detection and Generative Model Attribution

Bahriye Isgor, Cengizhan Tumer, and Murat Koklu
2026

The paper proposes a two-stage hierarchical system for audio deepfake analysis that goes beyond simple real vs. fake classification by also identifying the source of the fake audio (attribution). In the first stage, the system detects whether an audio sample is real or fake, and in the second stage, it determines which generation model produced the fake voice. To support this, the authors use a hybrid feature representation (2189 features) combining embeddings from models like YAMNet, Wav2Vec 2.0, and Resemblyzer, along with 17 data augmentation techniques to handle class imbalance and improve robustness. They evaluate their approach on a dataset containing real audio and fakes generated by multiple models (e.g., FlashSpeech, VALL-E). Through experiments with several machine learning algorithms, they show that SVM (RBF kernel) performs best for detection, while Logistic Regression works best for attribution, achieving high overall accuracy. The study demonstrates that combining detection and source attribution in a hierarchical framework improves effectiveness for forensic analysis and cybersecurity applications.

Human Perception of Audio DeepFakes

Nicolas M. Müller, Karla Pizzi, Jennifer Williams

2022

The dataset is used in a gamified online experiment where participants distinguish between real and fake audio samples. It includes both bona-fide and deepfake audio samples, with users’ classifications and AI model predictions recorded.

Interpretable All-Type Audio DeepFake Detection with Audio LLMs via Frequency-Time Reinforcement Learning

Yuankun Xie, Xiaoxuan Guo, Jiayi Zhou, Tao Wang, Jian Liu, Ruibo Fu, Xiaopeng Wang, Haonan Cheng, and Long Ye
2026

The paper explores using Audio Large Language Models (ALLMs) for all-type audio deepfake detection (speech, music, environmental sounds, etc.) while also improving interpretability. The authors first evaluate standard training approaches—supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT)—and show their limitations: SFT leads to black-box predictions, while RFT can produce unreliable or hallucinated explanations. To address this, they build a large annotated dataset (~340K samples) with frequency-time structured chain-of-thought (CoT) explanations, generated through an automatic annotation pipeline. They then propose a new training method called FT-GRPO (Frequency Time–Group Relative Policy Optimization), which combines SFT with reinforcement learning guided by frequency-time constraints to ensure meaningful and grounded reasoning. Their approach achieves strong performance across different types of audio deepfakes while also producing interpretable, physically grounded explanations for its decisions.

Investigating Voiced and Unvoiced Regions of Speech for Audio DeepFake Detection

Ganesh Sivaraman, Hemlata Tak, and Elie Khoury
2025

This paper investigates how voiced and unvoiced parts of speech contribute to audio deepfake detection, with the goal of improving both performance and interpretability. Instead of treating the audio as a whole, the authors separate speech into voiced (periodic, vowel-like sounds) and unvoiced (noise-like sounds such as fricatives) using a signal periodicity measure. They then train a graph attention-based detection model (AASIST) independently on each component and compare their effectiveness. Their findings show that unvoiced regions are more informative for detecting deepfakes, likely because synthetic models struggle to accurately reproduce subtle noise-like speech characteristics. Finally, they combine both components using score-level fusion, achieving improved performance (lower EER) compared to using full audio alone. Overall, the study demonstrates that analyzing specific speech components can enhance detection accuracy and provide more interpretable insights into how deepfake detection works.

Learning a Self-Supervised Domain-Invariant Feature Representation for Generalized Audio DeepFake Detection

Xie, Yuankun and Cheng, Haonan and Wang, Yutian and Ye, Long

2023

This study uses Wav2Vec-XLSR [3] to get domain-invariant feature representations before feeding the embedding to the classifier. After Wav2Vec-XLSR front-end, they use a Light Convolutional Neural Network (LCNN) followed by a transformer block as back-end. This additional step gives a feature space in which real audio data stay together in the same cluster, while other audio types (any types of attacks) scatter in the feature space

Learning to Listen and Listening to Learn: Spoofed Audio Detection Through Linguistic Data Augmentation

Zahra Khanjani, Lavon Davis, Anna Tuz, Kifekachukwu Nwosu, Christine Mallinson, Vandana P. Janeja

2023

The study utilizes a hybrid dataset composed of various spoofed audio samples, including replay attacks, Text-to-Speech, Voice Conversion, and mimicry, as well as genuine samples. This dataset aims to facilitate the development and testing of spoofed audio detection techniques.

LHCC: Cross-Domain Audio DeepFake Detection via Consistency-Aware Fusion of Acoustic and Semantic Representations

Guofu Zhang, Ming Fang, Zhaopin Su, Kejiang Chen, Weiming Zhang, and Yaofei Wang
2026

The paper proposes a new audio deepfake detection framework (LHCC: Low-High Consistency Checker) designed to improve generalization to unseen deepfake methods. Instead of relying on superficial artifacts, the authors introduce the idea that deepfakes can be detected by inconsistencies between different levels of audio representation. They split a pre-trained model into two parts: a low-level component that captures acoustic details and a high-level component that captures semantic or speaker identity information. They then design an Asymmetric Fusion Module (AFM) to combine these two representations and explicitly measure their consistency, followed by a Dependency Capture Module (DCM) to model temporal patterns of forgery. By focusing on mismatches between low- and high-level features, their method can better detect unseen manipulations. Experiments on multiple cross-domain datasets show that this approach significantly improves generalization and achieves strong performance compared to existing methods.

Measuring the Robustness of Audio DeepFake Detectors

Xiang Li, Pin-Yu Chen, Wenqi Wei
2025

This paper systematically evaluates how robust audio deepfake detection (ADD) models are under real-world audio distortions. Instead of proposing a new detection method, the authors test 10 different models (including both traditional deep learning models and modern foundation models) against 16 types of audio corruptions, such as noise, audio modifications, and compression. Their experiments reveal that while most models handle noise relatively well, they are highly vulnerable to audio modifications and compression, especially with neural codecs. The study also shows that foundation models outperform traditional models, likely due to large-scale self-supervised pretraining, and that increasing model size improves robustness but with diminishing benefits. Additionally, they demonstrate that targeted data augmentation during training can improve resilience to unseen distortions. A real-world case study on political speech deepfakes further validates their findings. Overall, the paper highlights critical weaknesses in current systems and emphasizes the need for more robust and deployment-ready audio deepfake detection frameworks.

Region-Based Optimization in Continual Learning for Audio DeepFake Detection

Yujie Chen, Jiangyan Yi, Cunhang Fan, Jianhua Tao, Yong Ren, Siding Zeng, Chu Yuan Zhang, Xinrui Yan, Hao Gu, Jun Xue, Chenglong Wang, Zhao Lv, Xiaohui Zhang
2025

This paper proposes a new continual learning method called Region-Based Optimization (RegO) to improve audio deepfake detection in dynamic, real-world environments where new types of deepfakes keep emerging. Instead of training a model once, their approach enables the system to adapt over time without forgetting previous knowledge. They use the Fisher information matrix to identify and divide model neurons into regions based on their importance for detecting real and fake audio. Each region is then updated differently—for example, less important regions are freely fine-tuned for fast adaptation, while more critical regions are updated carefully using constrained or orthogonal gradients to preserve prior knowledge. They also introduce an Ebbinghaus forgetting mechanism to remove redundant neurons from older tasks, improving efficiency and generalization. Through experiments, they show that RegO significantly outperforms existing continual learning methods (e.g., achieving a 21.3% improvement in EER), demonstrating better adaptability and robustness for evolving deepfake detection scenarios.

Retrieval-Based Deep Difference Modeling for Audio DeepFake Detection

Rui Liu, Jinhua Zhang, Haizhou Li
2026

The paper proposes a new retrieval-based audio deepfake detection method called RDD-ADD (Retrieval-based Deep Difference Modeling). Unlike prior approaches that only use real (bonafide) audio as references, the authors design a dual-database retrieval system that uses both real and spoofed audio to better capture differences. They introduce anchor-based similarity fusion and GRU-based contextual fusion to combine retrieved features dynamically, and a graph-based learning module to model interactions between the input audio and retrieved samples. This allows the system to learn fine-grained differences between real and fake audio more effectively. Experiments on multiple benchmark datasets show that their approach outperforms existing methods, improving the robustness of retrieval-based audio deepfake detection.

Superlatives, Clickbaits, Appeals to Authority, Poor Grammar, or Boldface: Is Editorial Style Related to the Credibility of Online Health Messages?

Katarína Greškovicová, Radomír Masaryk, Nikola Synak, Vladimíra Cavojová

2022

The study analyzes how different editorial styles affect the perceived credibility of health messages among adolescents. It explores factors like media literacy and scientific reasoning.

Towards Generalisable and Calibrated Audio DeepFake Detection with Self-Supervised Representations

Pascu, Octavian and Stan, Adriana and Oneata, Dan and Oneata, Elisabeta and Cucu, Horia

2024

This study uses a common model which is previously utilized in image deepfake detection for the purpose of generalization [7] as its representation learning to achieve generalization in SAD. It also proposed a “direct method of estimating the uncertainty from the output probabilities of the detector, by computing the entropy over the outputs” as its calibration technique. This study is considered as the-state-of-the-art in terms of generalization and calibration performance.

Wave-Spectrogram Cross-Modal Aggregation for Audio DeepFake Detection

Zehui Jin, Linlong Lang, and Biao Leng
2025

This paper proposes a novel cross-modal audio deepfake detection framework designed to improve generalization to unseen (out-of-domain) deepfakes. Instead of relying on a single type of feature (like only waveform or only spectrogram), the authors combine multiple modalities and align them to learn richer and more robust representations. They introduce a multi-scale fusion strategy to capture deepfake artifacts at different levels and effectively integrate heterogeneous features. Additionally, they use a single center loss to cluster real (bonafide) audio embeddings more tightly, making it easier to distinguish fake audio, especially in unseen scenarios. Through experiments on datasets like ASVspoof2021 and In-The-Wild, they show that their method outperforms existing state-of-the-art approaches, particularly in terms of generalization and robustness.

Where Are We in Audio DeepFake Detection? A Systematic Analysis over Generative and Detection Models

Xiang Li, Pin-Yu Chen, and Wenqi Wei
2025

This paper introduces SONAR, a comprehensive framework and benchmark for detecting AI-synthesized audio, and focuses on evaluating how well current detection methods generalize across diverse and realistic scenarios. The authors build a new dataset collected from 9 different audio synthesis platforms (including modern TTS systems) and use it to uniformly evaluate both traditional deep learning models and newer foundation models. Through extensive experiments, they show that existing detection methods suffer from poor generalization, while foundation models perform better, especially due to large-scale pretraining. They also demonstrate that these models have strong cross-lingual capabilities, maintaining good performance even when trained only on English data, suggesting that detection challenges are more related to audio realism than language differences. Additionally, the paper explores few-shot fine-tuning as an efficient way to adapt models for specific use cases. Overall, the study provides a standardized benchmark, reveals key limitations in current systems, and highlights promising directions for improving generalizable audio deepfake detection.

Who Are You (I Really Wanna Know)? Detecting Audio DeepFakes Through Vocal Tract Reconstruction

Logan Blue, Kevin Warren, Hadi Abdullah, Cassidy Gibson, Luis Vargas, Jessica O’Dell, Kevin Butler, Patrick Traynor

2022

The dataset utilized in this study is the TIMIT Acoustic-Phonetic Continuous Speech Corpus. It is used to test the deepfake audio detection method described.

NSF Award #2346473

Search UMBC

Navigation

Subscribe to UMBC Weekly Top Stories

I am interested in: