Learning A Self-Supervised Domain-Invariant Feature Representation for Generalized Audio Deepfake Detection

Authors:

Xie, Yuankun and Cheng, Haonan and Wang, Yutian and Ye, Long

Where published:

INTERSPEECH

 

Dataset names (used for):

  • ASVspoof 2019 LA
  • WaveFake
  • FakeAVCeleb

 

Some description of the approach:

This uses Wav2Vec-XLSR [3] to get domain-invariant feature representations before feeding the embedding to the classifier. After Wav2Vec-XLSR front-end, they use a Light Convolutional Neural Network (LCNN) followed by a transformer block as back-end. This additional step gives a feature space in which real audio data stay together in the same cluster, while other audio types (any types of attacks) scatter in the feature space

 

Some description of the data (number of data points, any other features that describe the data):

Across training dataets, it containes total data points for 26,065 real utterances and 212,035 fake utterances

 

Keywords:

Audio deepfake detection, self-supervised representation, domain generalization, feature space

Instance Represent:

Real and fake audio utterances across multiple domains.

Dataset Characteristics:

Raw waveforms processed for domain diversity

Subject Area:

Security of audio authentication systems


Main Paper Link


License Link


Last Accessed: 11/26/2024

NSF Award #2346473