Navigation
AASIST: Audio Anti-Spoofing Using Integrated Spectro-Temporal Graph Attention Networks
Jee-weon Jung and Hee-Soo Heo and Hemlata Tak and Hye-jin Shim and Joon Son Chung and Bong-Jin Lee and Ha-jin Yu and Nicholas W. D. Evans
2022
This study uses RawNet2-based encoder to get the audio representations. In contrast to RawNet2, they consider the output of the sinc-convolution layer as a 2-dimensional image with a single channel (spectrogram) rather than a 1-dimensional sequence. They also use some residual blocks with pre-activation to extract the high-level representation. Then, the representation goes to the graph module containing graph attention network. They introduce an innovative heterogeneous stacking graph attention layer that leverages a heterogeneous attention mechanism and a stack node to capture artifacts across various temporal and spectral domains.
A Deep Learning Framework for Audio Deepfake Detection
Janavi Khochare, Chaitali Joshi, Bakul Yenarkar, Shraddha Suratkar, Faruk Kazi
2021
This study uses Temporal Convolutional Network (TCN) and Spatial Transformer Network (STN) to classify a benchmark Fake or Real dataset. Using MEL spectrogram as the input feature of the audio data. Being limited to the FoR dataset reduces the generalizability of the model, since it only consists of one sub-type of audio deepfake called text-to-speech.
A Review of Modern Deepfake Detection Methods: Challenges and Future Directions
Zaynab Almutairi
2022
The article reviews existing audio deepfake (AD) detection methods and compares faked audio datasets. It introduces types of AD attacks and analyzes detection methods and datasets for imitation and synthetic-based deepfakes.
A Robust Audio Deepfake Detection System via Multi-view Feature
Yang, Yujie and Qin, Haochen and Zhou, Hang and Wang, Chengcheng and Guo, Tianyu and Han, Kai and Wang, Yunhe
2024
This study showes three pretrained models that had the least performance drop and most generalizability are: Wav2Vec-XLSR [3] HuBERT [4] and WavLM [5]. The study also proposes a feature fusion to incorporate all of these representations in the classifier. Also showes another popular audio pre-trained representation [6] does not show enough generalizability.
Auto Annotation of Linguistic Features for Audio Deepfake Discernment
Kifekachukwu Nwosu, Chloe Evered, Zahra Khanjani, Noshaba Bhalli, Lavon Davis, Christine Mallinson, Vandana P. Janeja
2023
The study focuses on detecting audio deepfakes through linguistic analysis. It involves analyzing audio samples for specific linguistic features and testing the auto-annotation methodology.
Audio-deepfake detection: Adversarial attacks and countermeasures
Mouna Rabhi, Spiridon Bakiras, Roberto Di Pietro
2024
The study explores vulnerabilities of audio deepfake detection systems to adversarial attacks, demonstrating that current methods like Deep4SNet can be manipulated to nearly 0% detection accuracy using GAN-based attacks. A new, generalizable defense mechanism is proposed to enhance system resilience.
Awais Khan, Khalid Mahmood Malik, James Ryan, Mikul Saravanan
2023
The datasets used in the experiments include ASVspoof2019, ASVspoof2021, and VSDC, which are employed to evaluate the performance of various voice spoofing countermeasures.
Thomas Nygren, Mona Guath, Carl-Anton Werner Axelsson, Divina Frau-Meigs
2021
The study focuses on media and information literacy, evaluating how pupils can identify and assess the credibility of digital news using the InVID-WeVerify tool. It assesses the effectiveness of this tool in an educational context across multiple schools.
End-to-End anti-spoofing with RawNet2
Tak, Hemlata and Patino, Jose and Todisco, Massimiliano and Nautsch, Andreas and Evans, Nicholas and Larcher, Anthony
2021
RawNet2 is a DNN-based model, input with raw audio waveforms. RawNet2 is a combination of the original RawNet1 [1] architecture and SincNet. “The first layer of RawNet2 is essentially the same as that of SincNet”, because SincNet processes raw audio waveforms directly, avoiding the need for handcrafted feature extraction. SincNet employs sinc functions as predefined filter shapes [2]. In RawNet2, the upper layers are the same as the residual blocks and GRU layer as RawNet1. They also used a “filter-wise feature map scaling (FMS) using a sigmoid function applied to residual block outputs” . The role of FMS is acting like an attention mechanism to provide more discriminative representations.
Fighting AI with AI: Fake Speech Detection using Deep Learning
Hafiz Malik, Raghavendar Changalvala
2023
The study evaluates the performance of a deep learning-based fake speech detection method using a dataset of cloned and bona-fide speech samples. It focuses on detecting fake speech through a deep learning model using spectrograms of the audio recordings.
Fooled twice: People cannot detect deepfakes but think they can
Nils C. Köbis, Barbora Dolezalová, Ivan Soraperra
2019
The study investigates human detection abilities for deepfake videos, focusing on detection accuracy, cognitive biases, and overconfidence. Participants view 16 videos (8 authentic and 8 deepfakes) and provide responses on whether the video is a deepfake, along with confidence ratings and demographic information.
Human Perception of Audio Deepfakes
Nicolas M. Müller, Karla Pizzi, Jennifer Williams
2022
The dataset is used in a gamified online experiment where participants distinguish between real and fake audio samples. It includes both bona-fide and deepfake audio samples, with users’ classifications and AI model predictions recorded.
Xie, Yuankun and Cheng, Haonan and Wang, Yutian and Ye, Long
2023
This study uses Wav2Vec-XLSR [3] to get domain-invariant feature representations before feeding the embedding to the classifier. After Wav2Vec-XLSR front-end, they use a Light Convolutional Neural Network (LCNN) followed by a transformer block as back-end. This additional step gives a feature space in which real audio data stay together in the same cluster, while other audio types (any types of attacks) scatter in the feature space
Zahra Khanjani, Lavon Davis, Anna Tuz, Kifekachukwu Nwosu, Christine Mallinson, Vandana P. Janeja
2023
The study utilizes a hybrid dataset composed of various spoofed audio samples, including replay attacks, Text-to-Speech, Voice Conversion, and mimicry, as well as genuine samples. This dataset aims to facilitate the development and testing of spoofed audio detection techniques.
Katarína Greškovicová, Radomír Masaryk, Nikola Synak, Vladimíra Cavojová
2022
The study analyzes how different editorial styles affect the perceived credibility of health messages among adolescents. It explores factors like media literacy and scientific reasoning.
Towards Generalisable and Calibrated Audio Deepfake Detection with Self-supervised Representations
Pascu, Octavian and Stan, Adriana and Oneata, Dan and Oneata, Elisabeta and Cucu, Horia
2024
This study uses a common model which is previously utilized in image deepfake detection for the purpose of generalization [7] as its representation learning to achieve generalization in SAD. It also proposed a “direct method of estimating the uncertainty from the output probabilities of the detector, by computing the entropy over the outputs” as its calibration technique. This study is considered as the-state-of-the-art in terms of generalization and calibration performance.
Who Are You (I Really Wanna Know)? Detecting Audio DeepFakes Through Vocal Tract Reconstruction
Logan Blue, Kevin Warren, Hadi Abdullah, Cassidy Gibson, Luis Vargas, Jessica O’Dell, Kevin Butler, Patrick Traynor
2022
The dataset utilized in this study is the TIMIT Acoustic-Phonetic Continuous Speech Corpus. It is used to test the deepfake audio detection method described.