Detection Methods

Navigation

AASIST: Audio Anti-Spoofing Using Integrated Spectro-Temporal Graph Attention Networks

Jee-weon Jung and Hee-Soo Heo and Hemlata Tak and Hye-jin Shim and Joon Son Chung and Bong-Jin Lee and Ha-jin Yu and Nicholas W. D. Evans

2022

This study uses RawNet2-based encoder to get the audio representations. In contrast to RawNet2, they consider the output of the sinc-convolution layer as a 2-dimensional image with a single channel (spectrogram) rather than a 1-dimensional sequence. They also use some residual blocks with pre-activation to extract the high-level representation. Then, the representation goes to the graph module containing graph attention network. They introduce an innovative heterogeneous stacking graph attention layer that leverages a heterogeneous attention mechanism and a stack node to capture artifacts across various temporal and spectral domains.

A Deep Learning Framework for Audio Deepfake Detection

Janavi Khochare, Chaitali Joshi, Bakul Yenarkar, Shraddha Suratkar, Faruk Kazi
2021

This study uses Temporal Convolutional Network (TCN) and Spatial Transformer Network (STN) to classify a benchmark Fake or Real dataset. Using MEL spectrogram as the input feature of the audio data. Being limited to the FoR dataset reduces the generalizability of the model, since it only consists of one sub-type of audio deepfake called text-to-speech.

A Review of Modern Deepfake Detection Methods: Challenges and Future Directions

Zaynab Almutairi
2022

The article reviews existing audio deepfake (AD) detection methods and compares faked audio datasets. It introduces types of AD attacks and analyzes detection methods and datasets for imitation and synthetic-based deepfakes.

A Robust Audio Deepfake Detection System via Multi-view Feature

Yang, Yujie and Qin, Haochen and Zhou, Hang and Wang, Chengcheng and Guo, Tianyu and Han, Kai and Wang, Yunhe
2024

This study showes three pretrained models that had the least performance drop and most generalizability are: Wav2Vec-XLSR [3] HuBERT [4] and WavLM [5]. The study also proposes a feature fusion to incorporate all of these representations in the classifier. Also showes another popular audio pre-trained representation [6] does not show enough generalizability.

Auto Annotation of Linguistic Features for Audio Deepfake Discernment

Kifekachukwu Nwosu, Chloe Evered, Zahra Khanjani, Noshaba Bhalli, Lavon Davis, Christine Mallinson, Vandana P. Janeja
2023

The study focuses on detecting audio deepfakes through linguistic analysis. It involves analyzing audio samples for specific linguistic features and testing the auto-annotation methodology.

Audio-deepfake detection: Adversarial attacks and countermeasures

Mouna Rabhi, Spiridon Bakiras, Roberto Di Pietro

2024

The study explores vulnerabilities of audio deepfake detection systems to adversarial attacks, demonstrating that current methods like Deep4SNet can be manipulated to nearly 0% detection accuracy using GAN-based attacks. A new, generalizable defense mechanism is proposed to enhance system resilience.

Battling voice spoofing: areview, comparative analysis and generalizability evaluation of state-of-the-art voice spoofing countermeasures

Awais Khan, Khalid Mahmood Malik, James Ryan, Mikul Saravanan

2023

The datasets used in the experiments include ASVspoof2019, ASVspoof2021, and VSDC, which are employed to evaluate the performance of various voice spoofing countermeasures.

Combatting Visual Fake News with a Professional Fact-Checking Tool in Education in France, Romania, Spain, and Sweden

Thomas Nygren, Mona Guath, Carl-Anton Werner Axelsson, Divina Frau-Meigs

2021

The study focuses on media and information literacy, evaluating how pupils can identify and assess the credibility of digital news using the InVID-WeVerify tool. It assesses the effectiveness of this tool in an educational context across multiple schools.

End-to-End anti-spoofing with RawNet2

Tak, Hemlata and Patino, Jose and Todisco, Massimiliano and Nautsch, Andreas and Evans, Nicholas and Larcher, Anthony

2021

RawNet2 is a DNN-based model, input with raw audio waveforms. RawNet2 is a combination of the original RawNet1 [1] architecture and SincNet. “The first layer of RawNet2 is essentially the same as that of SincNet”, because SincNet processes raw audio waveforms directly, avoiding the need for handcrafted feature extraction. SincNet employs sinc functions as predefined filter shapes [2]. In RawNet2, the upper layers are the same as the residual blocks and GRU layer as RawNet1. They also used a “filter-wise feature map scaling (FMS) using a sigmoid function applied to residual block outputs” . The role of FMS is acting like an attention mechanism to provide more discriminative representations.

Fighting AI with AI: Fake Speech Detection using Deep Learning

Hafiz Malik, Raghavendar Changalvala

2023

The study evaluates the performance of a deep learning-based fake speech detection method using a dataset of cloned and bona-fide speech samples. It focuses on detecting fake speech through a deep learning model using spectrograms of the audio recordings.

Fooled twice: People cannot detect deepfakes but think they can

Nils C. Köbis, Barbora Dolezalová, Ivan Soraperra

2019

The study investigates human detection abilities for deepfake videos, focusing on detection accuracy, cognitive biases, and overconfidence. Participants view 16 videos (8 authentic and 8 deepfakes) and provide responses on whether the video is a deepfake, along with confidence ratings and demographic information.

Human Perception of Audio Deepfakes

Nicolas M. Müller, Karla Pizzi, Jennifer Williams

2022

The dataset is used in a gamified online experiment where participants distinguish between real and fake audio samples. It includes both bona-fide and deepfake audio samples, with users’ classifications and AI model predictions recorded.

Learning A Self-Supervised Domain-Invariant Feature Representation for Generalized Audio Deepfake Detection

Xie, Yuankun and Cheng, Haonan and Wang, Yutian and Ye, Long

2023

This study uses Wav2Vec-XLSR [3] to get domain-invariant feature representations before feeding the embedding to the classifier. After Wav2Vec-XLSR front-end, they use a Light Convolutional Neural Network (LCNN) followed by a transformer block as back-end. This additional step gives a feature space in which real audio data stay together in the same cluster, while other audio types (any types of attacks) scatter in the feature space

Learning to Listen and Listening to Learn: Spoofed Audio Detection Through Linguistic Data Augmentation

Zahra Khanjani, Lavon Davis, Anna Tuz, Kifekachukwu Nwosu, Christine Mallinson, Vandana P. Janeja

2023

The study utilizes a hybrid dataset composed of various spoofed audio samples, including replay attacks, Text-to-Speech, Voice Conversion, and mimicry, as well as genuine samples. This dataset aims to facilitate the development and testing of spoofed audio detection techniques.

Superlatives, Clickbaits, Appeals to Authority, Poor Grammar, or Boldface: Is Editorial Style Related to the Credibility of Online Health Messages?

Katarína Greškovicová, Radomír Masaryk, Nikola Synak, Vladimíra Cavojová

2022

The study analyzes how different editorial styles affect the perceived credibility of health messages among adolescents. It explores factors like media literacy and scientific reasoning.

Towards Generalisable and Calibrated Audio Deepfake Detection with Self-supervised Representations

Pascu, Octavian and Stan, Adriana and Oneata, Dan and Oneata, Elisabeta and Cucu, Horia

2024

This study uses a common model which is previously utilized in image deepfake detection for the purpose of generalization [7] as its representation learning to achieve generalization in SAD. It also proposed a “direct method of estimating the uncertainty from the output probabilities of the detector, by computing the entropy over the outputs” as its calibration technique. This study is considered as the-state-of-the-art in terms of generalization and calibration performance.

Who Are You (I Really Wanna Know)? Detecting Audio DeepFakes Through Vocal Tract Reconstruction

Logan Blue, Kevin Warren, Hadi Abdullah, Cassidy Gibson, Luis Vargas, Jessica O’Dell, Kevin Butler, Patrick Traynor

2022

The dataset utilized in this study is the TIMIT Acoustic-Phonetic Continuous Speech Corpus. It is used to test the deepfake audio detection method described.

NSF Award #2346473

Community Infrastructure to Strengthen AI for Audio Deepfake analysis (CISAAD)

College of Engineering and Information Technology

Community Infrastructure to Strengthen AI for Audio Deepfake analysis (CISAAD)

Detection Methods

Navigation

Community Infrastructure to Strengthen AI for Audio Deepfake analysis (CISAAD)

Navigation

Subscribe to UMBC Weekly Top Stories

I am interested in: