Authors:
Ousama A. Shaaban, Remzi Yildirim
Where published:
Engineering Reports, 2025, volume 7.
Published:
2025
Abstract:
This study introduces an enhanced Siamese convolutional neural network (Siamese CNN) architecture with a novel StacLoss function and self-attention modules for efficient identification of audio deepfakes. This module directly compares unprocessed original audio with modified audio by initially applying convolutional operations and dual branches to extract complex characteristics from raw audio signals. These operations are followed by residual connections, which enhance the network’s performance.
Dataset names (used for):
- ASVspoof2019, FakeAVCeleb.
Some description of the approach:
This study introduces an enhanced Siamese convolutional neural network (Siamese CNN) architecture with a novel StacLoss function and self-attention modules for efficient identification of audio deep fakes.
Some description of the data:
ASVspoof2019: main benchmark for evaluation of the proposed Siamese CNN model. FakeAVCeleb: used to test generalization/robustness, split into training, validation, and test sets.
Keywords:
audio deepfake; deep learning; deepfake; machine learning; Siamese CNN.
Instance Represent:
Each instance is an audio sample, and the Siamese training setup represents data as triplets: Anchor, Positive, and Negative audio samples. During preprocessing, two audio signals are transformed into Mel-Frequency Cepstral Coefficients (MFCC) to capture their time-frequency representations before being compared by the Siamese CNN.
Dataset Characteristics:
The ASVspoof2019 dataset includes Logical Access (LA) and Physical Access (PA) partitions. Each partition has training, development, and evaluation sets. Audio files are sampled at 16 kHz, and the paper uses MFCC and Log-Spectrogram features.
Subject Area:
Audio deepfake detection, deep learning, digital media security, spoofing detection, and audio forensics.
Associated Tools:
Development utilized open-source frameworks such as TensorFlow, PyTorch, and Librosa. Training was conducted on an NVIDIA Tesla V100 GPU (16GB) via cloud services such as AWSorGCP.
Feature Type:
MFCC, Log-Spectrogram
Number of Instances:
376,161 samples
Number of Features:
The paper uses 2 features (combined): LFCC (Linear Frequency Cepstral Coefficients), MFCC (Mel-Frequency Cepstral Coefficients). This LFCC-MFCC combination is what the Proposed Enhanced Siamese CNN uses, as seen in Tables 7–9 where the feature column lists “LFCC MFCC.