Authors:
Yuankun Xie; Yi Lu; Ruibo Fu; Zhengqi Wen; Zhiyong Wang; Jianhua Tao, Xin Qi, Xiaopeng Wang, Yukun Liu, Haonan Cheng, Long Ye, Yi Sun
Yuankun Xie; Yi Lu; Ruibo Fu; Zhengqi Wen; Zhiyong Wang; Jianhua Tao, Xin Qi, Xiaopeng Wang, Yukun Liu, Haonan Cheng, Long Ye, Yi Sun
When Published:
2025
2025
Description:
The paper focuses on detecting audio deepfakes generated by Audio Language Models (ALMs), which produce highly realistic and diverse audio. The authors first create a large-scale dataset called Codecfake, containing over 1 million audio samples (real and fake in English and Chinese), specifically targeting ALM-based audio generated through neural codec-to-waveform processes. They then propose a new training strategy called CSAM (a modified Sharpness-Aware Minimization) to improve model generalization and avoid domain bias. Using this dataset and training approach, they show that detectors can better identify ALM-generated deepfakes and achieve very low error rates, significantly outperforming existing methods.
The paper focuses on detecting audio deepfakes generated by Audio Language Models (ALMs), which produce highly realistic and diverse audio. The authors first create a large-scale dataset called Codecfake, containing over 1 million audio samples (real and fake in English and Chinese), specifically targeting ALM-based audio generated through neural codec-to-waveform processes. They then propose a new training strategy called CSAM (a modified Sharpness-Aware Minimization) to improve model generalization and avoid domain bias. Using this dataset and training approach, they show that detectors can better identify ALM-generated deepfakes and achieve very low error rates, significantly outperforming existing methods.
Training and Data:
Neural codec models were trained on LibriTTS, then used to re-encode/decode VCTK and AISHELL3 audio. ADD models were trained under three settings: vocoder-trained, codec-trained, and co-trained.
Neural codec models were trained on LibriTTS, then used to re-encode/decode VCTK and AISHELL3 audio. ADD models were trained under three settings: vocoder-trained, codec-trained, and co-trained.
Advantages:
Codecfake improves detection of ALM-based audio by using neural codec-based fake audio, providing broad test conditions, two languages, and over 1 million samples.
Codecfake improves detection of ALM-based audio by using neural codec-based fake audio, providing broad test conditions, two languages, and over 1 million samples.
Limitations:
A3 non-speech / audio-event condition remains difficult.
A3 non-speech / audio-event condition remains difficult.
Model Architecture:
The paper uses seven neural codec architectures for fake-audio generation and four ADD baseline models for detection: Mel-LCNN, W2V2-LCNN, WavLM-AASIST, and W2V2-AASIST. It also proposes CSAM, a co-training SAM strategy to reduce domain ascent bias.
The paper uses seven neural codec architectures for fake-audio generation and four ADD baseline models for detection: Mel-LCNN, W2V2-LCNN, WavLM-AASIST, and W2V2-AASIST. It also proposes CSAM, a co-training SAM strategy to reduce domain ascent bias.
Dependencies:
The implementation uses Wav2Vec-XLS-R, WavLM-large, Adam optimizer, weighted cross-entropy, official EER calculation code, and SKlearn for confusion matrices.
The implementation uses Wav2Vec-XLS-R, WavLM-large, Adam optimizer, weighted cross-entropy, official EER calculation code, and SKlearn for confusion matrices.
Synthesis:
Fake audio is generated by seven neural codec methods: SoundStream, SpeechTokenizer, FunCodec, EnCodec, AudioDec, AcademicCodec, and DAC. Additional ALM test conditions use VALL-E, VALL-E X, and AudioGen.
Fake audio is generated by seven neural codec methods: SoundStream, SpeechTokenizer, FunCodec, EnCodec, AudioDec, AcademicCodec, and DAC. Additional ALM test conditions use VALL-E, VALL-E X, and AudioGen.
Dataset:
Codecfake contains 1,058,216 audio samples, including 132,277 real samples and 925,939 fake samples. It includes English and Chinese data from VCTK and AISHELL3.
Codecfake contains 1,058,216 audio samples, including 132,277 real samples and 925,939 fake samples. It includes English and Chinese data from VCTK and AISHELL3.
Preprocessing:
Audio samples were downsampled to 16 kHz and trimmed or padded to 4 seconds. For self-supervised features, frozen Wav2Vec-XLS-R and WavLM-large models were used to extract 1024-dimensional hidden-state representations.
Audio samples were downsampled to 16 kHz and trimmed or padded to 4 seconds. For self-supervised features, frozen Wav2Vec-XLS-R and WavLM-large models were used to extract 1024-dimensional hidden-state representations.
Evaluation Metrics:
The main evaluation metric is EER; confusion matrices are computed using a 0.5 threshold.
The main evaluation metric is EER; confusion matrices are computed using a 0.5 threshold.
Performance:
Vocoder-trained models perform poorly on Codecfake, while codec-trained models improve greatly. The best codec-trained model, W2V2-AASIST, achieves 0.177% average EER across C1–C7. The final W2V2-AASIST + CSAM achieves the lowest overall average EER of 0.616%.
Vocoder-trained models perform poorly on Codecfake, while codec-trained models improve greatly. The best codec-trained model, W2V2-AASIST, achieves 0.177% average EER across C1–C7. The final W2V2-AASIST + CSAM achieves the lowest overall average EER of 0.616%.
Contributions:
The paper contributes the Codecfake dataset, validates that Codecfake-trained ADD models detect codec-based audio better than vocoder-trained models, and proposes CSAM for generalized detection across vocoder- and codec-based audio.
The paper contributes the Codecfake dataset, validates that Codecfake-trained ADD models detect codec-based audio better than vocoder-trained models, and proposes CSAM for generalized detection across vocoder- and codec-based audio.