Forensic deepfake audio detection using segmental speech features

Authors:

Tianle Yang, Chengzhe Sun, Siwei Lyu, and Phil Rose

Where published:

Forensic Science International, Volume 379, 2026.

 

Abstract:

This paper explores a new, interpretable approach to audio deepfake detection using segmental speech features—i.e., fine-grained acoustic properties tied to how humans physically produce sounds (articulation). The authors show that these segment-level features (commonly used in forensic voice comparison) are effective at distinguishing real from fake audio, while broader global features are less useful. Based on this, they argue that deepfake detection should not rely solely on traditional feature sets. Additionally, they propose a speaker-specific detection framework, which focuses on identifying deepfakes for a particular individual rather than generalizing across all speakers. This approach is especially valuable in forensic and security applications, where interpretability and sensitivity to individual voice characteristics are critical.

 

Description of the approach:

The study uses a speaker-specific framework that extracts segmental (phoneme-level) acoustic features from speech and compares real versus synthetic audio using Gaussian Mixture Models (GMMs) and likelihood ratios. For each speaker, two separate GMMs are trained — one on real speech tokens and one on deepfake speech tokens — and a log-likelihood ratio is computed for each token. Performance is evaluated using Cllr (log-likelihood ratio cost) and EER (equal error rate). The features examined include vowel formant midpoints (MF, specifically F1/F2/F3), long-term formant distributions (LTFD), long-term fundamental frequency (LTF0), and MFCCs. This design contrasts with speaker-independent benchmark systems by focusing on individual cases with interpretable, phonetically grounded features.

 

Description of the data:

The dataset includes recordings from six real speakers (3 female, 3 male) drawn from two sources: (1) YouTube interview recordings of native US English speakers across multiple sessions spanning several years, and (2) open-access datasets (LJ Speech and M-AILABS). Deepfake audio was generated using ElevenLabs (Multilingual v2) and Parrot AI, producing over 250 synthetic sentences per speaker and more than 1,500 total synthetic sentences. Audio was transcribed using OpenAI Whisper and force-aligned using the Montreal Forced Aligner (MFA), with approximately 8.9% of word tokens excluded after quality control.

 

Dataset names (used for):

  • LJ Speech and M-AILABS Speech Dataset are used as open-access real speech datasets;
  • YouTube interview recordings are collected as real-world speech data;
  • ElevenLabs and Parrot AI are used to generate synthetic/deepfake audio.

 

Keywords:

Deepfake audio detection; Deepfake speech; Forensic voice comparison; Likelihood ratio.

Instance Represent:

The study contains audio sentences/utterances, but the main analysis unit is a speaker-specific speech token, especially vowel segments. Segmental features are extracted from aligned speech tokens, and each token is scored under real vs. fake models.

Dataset Characteristics:

The dataset is speaker-specific and designed for forensic comparison. It includes both ideal studio-style data and real-world YouTube recordings with noise, overlap, and uncontrolled conditions. It contains six real speakers and six corresponding fake/hypothetical speakers, with over 1500 synthetic audio sentences.

Subject Area:

Forensic deepfake audio detection, forensic voice comparison, speaker-specific synthetic speech detection, likelihood-ratio evidence evaluation, and interpretable acoustic phonetics.

Associated Tools:

ElevenLabs, Parrot AI, OpenAI Whisper medium.en, Montreal Forced Aligner (MFA), Praat, parselmouth, librosa Python library, and two-class Gaussian Mixture Models (GMMs).

Feature Type:

The paper studies interpretable segmental phonetic features and compares them with global acoustic features. Main features include vowel formant midpoints (MF), long-term formant distribution (LTFD), long-term fundamental frequency (LTF0), and MFCCs.

Number of Instances:

For synthetic data, the paper reports over 1500 synthetic audio sentences, generated from six real speakers.

Number of Features:

The core feature groups are MF, LTFD, LTF0, and MFCCs. For extraction details, formant/f0 trajectories are sampled at 15 equidistant points, and MFCC extraction uses the first 13 MFCCs plus delta and delta-delta coefficients, giving 39 MFCC coefficients total.

Main Paper Link


License Link


Last Accessed: 06/16/2026

NSF Award #2346473