Authors:
Yujie Chen, Jiangyan Yi, Cunhang Fan, Jianhua Tao, Yong Ren, Siding Zeng, Chu Yuan Zhang, Xinrui Yan, Hao Gu, Jun Xue, Chenglong Wang, Zhao Lv, Xiaohui Zhang
Where published:
The Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25)
Abstract:
This paper proposes a new continual learning method called Region-Based Optimization (RegO) to improve audio deepfake detection in dynamic, real-world environments where new types of deepfakes keep emerging. Instead of training a model once, their approach enables the system to adapt over time without forgetting previous knowledge. They use the Fisher information matrix to identify and divide model neurons into regions based on their importance for detecting real and fake audio. Each region is then updated differently—for example, less important regions are freely fine-tuned for fast adaptation, while more critical regions are updated carefully using constrained or orthogonal gradients to preserve prior knowledge. They also introduce an Ebbinghaus forgetting mechanism to remove redundant neurons from older tasks, improving efficiency and generalization. Through experiments, they show that RegO significantly outperforms existing continual learning methods (e.g., achieving a 21.3% improvement in EER), demonstrating better adaptability and robustness for evolving deepfake detection scenarios.
Dataset names (used for):
- EVDA benchmark: main continual learning benchmark for audio deepfake detection. It includes FMFCC, In the Wild, ADD 2022, ASVspoof2015, ASVspoof2019, ASVspoof2021, FoR, HAD.
Some description of the approach:
The paper proposes Region-Based Optimization (RegO), a continual learning method for audio deepfake detection. It uses the Fisher Information Matrix to divide neurons into four regions based on importance for real/fake audio detection, then applies different gradient update strategies to different regions. It also uses an Ebbinghaus forgetting mechanism to release redundant neurons from old tasks.
Some description of the data:
The main audio experiments use EVDA, composed of eight public datasets for incremental synthesis-algorithm audio deepfake detection. For each EVDA baseline dataset, 2,000 samples are randomly selected for training and 5,000 for testing. The configuration includes cross-lingual Chinese/English and cross-task whole-segment/partial-segment fake detection.
Keywords:
audio deepfake detection; continual learning; Region-Based Optimization.
Instance Represent:
Each instance is a real or fake audio recording, represented as a raw waveform processed by the Wav2vec 2.0 / XLSR-53 feature extractor into 1024-dimensional speech features, then passed through a 5-layer SimpleMlp backend for binary classification as real or fake audio.
Dataset Characteristics:
Audio deepfake detection benchmark; incremental / continual learning setup; eight public datasets; cross-lingual Chinese and English; cross-task whole-segment and partial-segment fake detection; 2,000 train and 5,000 test samples per dataset in the EVDA baseline.
Subject Area:
Audio deepfake detection; fake audio detection; continual learning; machine learning security.
Associated Tools:
Code: https://github.com/cyjie429/RegO. Model components: Wav2vec 2.0 / XLSR-53 feature extractor, 5-layer SimpleMlp backend. Compared methods: Finetune, EWC, GDumb, GEM, CWRStar, SI, OWM, RAWM, RWM.
Feature Type:
Speech features extracted by Wav2vec 2.0 / XLSR-53.
Number of Instances:
56,000 sampled audio instances used in that baseline setup.
Number of Features:
1024-dimensional speech features from the Wav2vec 2.0 / XLSR-53 feature extractor