Codecfake

Authors:
Yuankun Xie; Yi Lu; Ruibo Fu; Zhengqi Wen; Zhiyong Wang; Jianhua Tao, Xin Qi, Xiaopeng Wang, Yukun Liu, Haonan Cheng, Long Ye, Yi Sun

When Published:
2025

Description:
The paper focuses on detecting audio deepfakes generated by Audio Language Models (ALMs), which produce highly realistic and diverse audio. The authors first create a large-scale dataset called Codecfake, containing over 1 million audio samples (real and fake in English and Chinese), specifically targeting ALM-based audio generated through neural codec-to-waveform processes. They then propose a new training strategy called CSAM (a modified Sharpness-Aware Minimization) to improve model generalization and avoid domain bias. Using this dataset and training approach, they show that detectors can better identify ALM-generated deepfakes and achieve very low error rates, significantly outperforming existing methods.

Training and Data:
Neural codec models were trained on LibriTTS, then used to re-encode/decode VCTK and AISHELL3 audio. ADD models were trained under three settings: vocoder-trained, codec-trained, and co-trained.

Advantages:
Codecfake improves detection of ALM-based audio by using neural codec-based fake audio, providing broad test conditions, two languages, and over 1 million samples.

Limitations:
A3 non-speech / audio-event condition remains difficult.

Model Architecture:
The paper uses seven neural codec architectures for fake-audio generation and four ADD baseline models for detection: Mel-LCNN, W2V2-LCNN, WavLM-AASIST, and W2V2-AASIST. It also proposes CSAM, a co-training SAM strategy to reduce domain ascent bias.

Dependencies:
The implementation uses Wav2Vec-XLS-R, WavLM-large, Adam optimizer, weighted cross-entropy, official EER calculation code, and SKlearn for confusion matrices.

Synthesis:
Fake audio is generated by seven neural codec methods: SoundStream, SpeechTokenizer, FunCodec, EnCodec, AudioDec, AcademicCodec, and DAC. Additional ALM test conditions use VALL-E, VALL-E X, and AudioGen.

Dataset:
Codecfake contains 1,058,216 audio samples, including 132,277 real samples and 925,939 fake samples. It includes English and Chinese data from VCTK and AISHELL3.

Preprocessing:
Audio samples were downsampled to 16 kHz and trimmed or padded to 4 seconds. For self-supervised features, frozen Wav2Vec-XLS-R and WavLM-large models were used to extract 1024-dimensional hidden-state representations.

Evaluation Metrics:
The main evaluation metric is EER; confusion matrices are computed using a 0.5 threshold.

Performance:
Vocoder-trained models perform poorly on Codecfake, while codec-trained models improve greatly. The best codec-trained model, W2V2-AASIST, achieves 0.177% average EER across C1–C7. The final W2V2-AASIST + CSAM achieves the lowest overall average EER of 0.616%.

Contributions:
The paper contributes the Codecfake dataset, validates that Codecfake-trained ADD models detect codec-based audio better than vocoder-trained models, and proposes CSAM for generalized detection across vocoder- and codec-based audio.

Link to paper

Last Accessed: 06/16/2026

NSF Award #2346473

Search UMBC

Subscribe to UMBC Weekly Top Stories

I am interested in: