Authors:
Vamshi Nallaguntla, Aishwarya Fursule, Shruti Kshirsagar, Anderson R. Avila
Vamshi Nallaguntla, Aishwarya Fursule, Shruti Kshirsagar, Anderson R. Avila
When Published:
2026
2026
Description:
The paper develops a phoneme-level framework for analyzing and detecting audio deepfakes. The authors create a new dataset called PhonemeDF, which contains paired real and synthetic speech aligned at the phoneme level, using real samples from LibriSpeech and synthetic audio generated by multiple TTS and VC systems. They use forced alignment (via MFA) to segment speech into phonemes and then compute Kullback–Leibler Divergence (KLD) to measure how different the phoneme distributions of synthetic speech are from real speech. Based on this, they rank generation models by how closely they mimic natural speech. Their results show that phoneme-level differences are strongly correlated with how well classifiers can distinguish real vs. fake audio, demonstrating that KLD can help identify the most informative phonemes for deepfake detection.
The paper develops a phoneme-level framework for analyzing and detecting audio deepfakes. The authors create a new dataset called PhonemeDF, which contains paired real and synthetic speech aligned at the phoneme level, using real samples from LibriSpeech and synthetic audio generated by multiple TTS and VC systems. They use forced alignment (via MFA) to segment speech into phonemes and then compute Kullback–Leibler Divergence (KLD) to measure how different the phoneme distributions of synthetic speech are from real speech. Based on this, they rank generation models by how closely they mimic natural speech. Their results show that phoneme-level differences are strongly correlated with how well classifiers can distinguish real vs. fake audio, demonstrating that KLD can help identify the most informative phonemes for deepfake detection.
Training and Data:
PhonemeDF is built from a 100-hour LibriSpeech subset containing 28,539 real utterances, which are used as the source data to generate synthetic counterparts with seven TTS/VC systems. The resulting PhonemeDF dataset contains 199,773 synthetic files, with an average duration of 11.37 seconds and a total duration of approximately 730 hours.
PhonemeDF is built from a 100-hour LibriSpeech subset containing 28,539 real utterances, which are used as the source data to generate synthetic counterparts with seven TTS/VC systems. The resulting PhonemeDF dataset contains 199,773 synthetic files, with an average duration of 11.37 seconds and a total duration of approximately 730 hours.
Advantages:
PhonemeDF enables phoneme-level comparison between real and synthetic speech, which helps reveal which phonemes are easier or harder for synthesis systems to reproduce. It also supports more fine-grained deepfake detection than utterance-level or frame-level datasets.
PhonemeDF enables phoneme-level comparison between real and synthetic speech, which helps reveal which phonemes are easier or harder for synthesis systems to reproduce. It also supports more fine-grained deepfake detection than utterance-level or frame-level datasets.
Limitations:
The dataset is limited to English speech and a limited set of synthesis models. Forced alignment may introduce small segmentation errors, and the evaluation relies mainly on statistical divergence and simple classifiers rather than perceptual validation.
The dataset is limited to English speech and a limited set of synthesis models. Forced alignment may introduce small segmentation errors, and the evaluation relies mainly on statistical divergence and simple classifiers rather than perceptual validation.
Dependencies:
Main tools/models include Montreal Forced Aligner (MFA), pretrained American English ARPAbet model, TextGrid files, LogSpec, LFCC, WavLM, wav2vec 2.0.
Main tools/models include Montreal Forced Aligner (MFA), pretrained American English ARPAbet model, TextGrid files, LogSpec, LFCC, WavLM, wav2vec 2.0.
Synthesis:
Synthetic speech was generated using four TTS systems, namely MeloTTS, XTTS v2, Chatterbox TTS, VITS TTS; and three VC models, referred to as Chatterbox VC, FreeVC, and StarGAN VC. TTS models synthesized speech from transcripts, while VC models converted original real audio.
Synthetic speech was generated using four TTS systems, namely MeloTTS, XTTS v2, Chatterbox TTS, VITS TTS; and three VC models, referred to as Chatterbox VC, FreeVC, and StarGAN VC. TTS models synthesized speech from transcripts, while VC models converted original real audio.
Dataset:
PhonemeDF contains approximately 730 hours of speech, equivalent to 199,773 synthetic speech samples, along with their respective TextGrid files, derived from 28,539 (100 hours) real utterances from the LibriSpeech corpus.
PhonemeDF contains approximately 730 hours of speech, equivalent to 199,773 synthetic speech samples, along with their respective TextGrid files, derived from 28,539 (100 hours) real utterances from the LibriSpeech corpus.
Preprocessing:
LibriSpeech files were converted to WAV, transcripts were separated for filename alignment, synthetic files were resampled to 16 kHz, stress markers were removed, and MFA generated phoneme-aligned TextGrid files.
LibriSpeech files were converted to WAV, transcripts were separated for filename alignment, synthetic files were resampled to 16 kHz, stress markers were removed, and MFA generated phoneme-aligned TextGrid files.
Evaluation Metrics:
The paper uses KLD, classification accuracy, and Pearson correlation. KLD measures distributional differences between real and synthetic phoneme embeddings; accuracy measures detectability; Pearson correlation is used to assess the relationship between phoneme-level KLD values and classification accuracy across phonemes.
The paper uses KLD, classification accuracy, and Pearson correlation. KLD measures distributional differences between real and synthetic phoneme embeddings; accuracy measures detectability; Pearson correlation is used to assess the relationship between phoneme-level KLD values and classification accuracy across phonemes.
Performance:
Handcrafted features such as LogSpec and LFCC show stronger acoustic mismatch between real and synthetic speech, while SSL embeddings capture subtler phonetic differences. Diphthongs, fricatives, and plosives are the most discriminative phoneme groups.
Handcrafted features such as LogSpec and LFCC show stronger acoustic mismatch between real and synthetic speech, while SSL embeddings capture subtler phonetic differences. Diphthongs, fricatives, and plosives are the most discriminative phoneme groups.
Contributions:
The paper introduces PhonemeDF, provides phoneme-level TextGrid annotations, develops a KLD/classifier-based phoneme-level detection framework, and compares handcrafted and SSL representations for deepfake detection.
The paper introduces PhonemeDF, provides phoneme-level TextGrid annotations, develops a KLD/classifier-based phoneme-level detection framework, and compares handcrafted and SSL representations for deepfake detection.