Authors:
Xie, Yuankun and Cheng, Haonan and Wang, Yutian and Ye, Long
Where published:
INTERSPEECH
Dataset names (used for):
- ASVspoof 2019 LA
- WaveFake
- FakeAVCeleb
Some description of the approach:
This uses Wav2Vec-XLSR [3] to get domain-invariant feature representations before feeding the embedding to the classifier. After Wav2Vec-XLSR front-end, they use a Light Convolutional Neural Network (LCNN) followed by a transformer block as back-end. This additional step gives a feature space in which real audio data stay together in the same cluster, while other audio types (any types of attacks) scatter in the feature space
Some description of the data (number of data points, any other features that describe the data):
Across training dataets, it containes total data points for 26,065 real utterances and 212,035 fake utterances
Keywords:
Audio deepfake detection, self-supervised representation, domain generalization, feature space
Instance Represent:
Real and fake audio utterances across multiple domains.
Dataset Characteristics:
Raw waveforms processed for domain diversity
Subject Area:
Security of audio authentication systems