Generation Methods

This page provides research work that explore generation of audio deepfakes and voice spoofing using methods including text-to-speech, and voice conversion.


Navigation

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

AdaSpeech


AdaSpeech is an adaptive text-to-speech (TTS) system designed to customize new voices efficiently and with high quality. It addresses the challenges of diverse acoustic conditions and memory efficiency in custom voice adaptation.

AI Got Your Tongue? Analysing the Sounds of Audio Deepfake Generation Methods


Karla Schäfer
2025


The study investigates how different audio deepfake generation methods leave detectable artifacts in speech. The authors created a controlled test dataset where spoofed audio (generated using four voice conversion (VC) and four text-to-speech (TTS) models) matched the same linguistic content and speakers as real (bona-fide) recordings. They then applied various acoustic feature analysis techniques to compare (1) real vs. fake audio, (2) VC vs. TTS outputs, and (3) differences across specific generation models. Their findings show that spoofed audio exhibits measurable deviations in waveform and spectral characteristics, with clear distinctions between VC and TTS methods. Some models (like XTTS and kNN-VC) produced more noticeable spectral differences, while others (like RVC) were closer to real speech. Overall, they demonstrate that feature representations such as MFCC and LFCC are effective in capturing these artifacts, making them useful for audio deepfake detection.

Audio Deepfake Detection in the Age of Advanced Text-to-Speech models


Robin Singh, Aditya Yogesh Nair, Fabio Palumbo, Florian Barbaro, Anna Dyka, Lohith Rachakonda
2026


The paper conducts a comparative study of audio deepfake detection against different TTS architectures. The authors generate a dataset of 12,000 synthetic speech samples using three modern TTS systems—Dia2, Maya1, and MeloTTS—based on the DailyDialog corpus. They then evaluate four types of detection frameworks that analyze audio from semantic, structural, and signal-level perspectives. Their experiments show that detectors often perform inconsistently: a method that works well for one type of TTS (e.g., streaming or non-autoregressive) may fail on others, especially LLM-based speech generation. To address this, they test a multi-view detection approach that combines multiple analysis levels, which proves to be more robust across all models. Overall, the study demonstrates that single-method detectors are insufficient, and integrating multiple perspectives is essential for reliable audio deepfake detection.

Audio-deepfake: Generation Methods, Legitimate Applications and the Potential for Misuse


Gueltoum Bendiab; Kamel Zeltni; Mohamed Bader-El-Den; Stavros Shiaeles
2025


The paper provides a comprehensive survey of audio deepfake technology by reviewing how modern models generate highly realistic synthetic speech, analyzing both beneficial applications (such as virtual assistants, accessibility, and entertainment) and the associated risks like fraud, misinformation, and impersonation. It examines the evolution of deepfake generation techniques, highlights why these systems are increasingly difficult to detect, and discusses the challenges they pose to existing security mechanisms. Additionally, the authors identify gaps in current detection approaches and outline future research directions aimed at improving robustness, interpretability, and ethical governance in audio deepfake systems.

Char2Wav


The paper presents Char2Wav, an end-to-end model for speech synthesis with two main components: a reader (encoder-decoder model with attention) and a neural vocoder (conditional SampleRNN). The reader converts text or phonemes to vocoder acoustic features, and the neural vocoder generates raw waveform samples.

CycleGAN-VC


A voice conversion method that does not rely on parallel data, using Cycle-Consistent Adversarial Networks (CycleGANs) with gated convolutional neural networks (CNNs) and an identity-mapping loss.

Deep Voice 3


Deep Voice 3 is a fully-convolutional, attention-based neural text-to-speech (TTS) system designed to match state-of-the-art naturalness in synthesized speech while training significantly faster than other models. It scales to large datasets and can handle multiple speakers.

DenoiSpeech


DenoiSpeech is a text-to-speech (TTS) system designed to synthesize clean speech from noisy speech data. It uses a fine-grained frame-level noise condition module that models noise at a detailed level, outperforming previous methods that use coarse-grained or pre-enhanced speech data.

FastSpeech 2


FastSpeech 2 is a non-autoregressive text-to-speech (TTS) model designed to synthesize speech faster and with higher quality than previous models. It addresses the limitations of FastSpeech by using ground-truth data for training and incorporating additional speech variation information such as pitch, energy, and accurate duration.

Glow TTS


Glow-TTS is a flow-based generative model for parallel text-to-speech (TTS) synthesis that eliminates the need for external aligners by internally learning the alignment between text and the latent representation of speech. It combines the properties of flows and dynamic programming to search for the most probable monotonic alignment, which allows for robust, diverse, and controllable speech synthesis.

HifiGAN


HiFi-GAN is a generative adversarial network for speech synthesis that achieves high-fidelity audio generation with efficient computational performance. It leverages periodic pattern modeling to enhance sample quality and demonstrates significant improvements over previous models like WaveNet and WaveGlow.

LightSpeech


LightSpeech is a lightweight and fast text-to-speech (TTS) model designed using neural architecture search (NAS) to achieve smaller memory usage and lower inference latency, suitable for deployment in resource-constrained devices like mobile phones and embedded systems.

MelGAN


MelGAN is a non-autoregressive feed-forward convolutional architecture designed for audio waveform generation in a GAN setup, aimed at generating high-quality coherent waveforms for tasks such as speech synthesis and music domain translation.

MelNet


The paper presents MelNet, a generative model for audio in the frequency domain. MelNet leverages time-frequency representations (spectrograms) instead of time-domain waveforms, using a highly expressive probabilistic model and a multiscale generation procedure to capture both local and global audio structures.

Parallel WaveGAN


Proposes Parallel WaveGAN, a distillation-free, fast, and small-footprint waveform generation method using a GAN. It optimizes multi-resolution spectrogram and adversarial loss functions, capturing the time-frequency distribution of realistic speech waveforms. The model is compact with only 1.44M parameters and generates 24kHz speech 28.68 times faster than real-time on a single GPU.

PhonemeDF: A Synthetic Speech Dataset for Audio Deepfake Detection and Naturalness Evaluation


Vamshi Nallaguntla, Aishwarya Fursule, Shruti Kshirsagar, Anderson R. Avila

2026


The paper develops a phoneme-level framework for analyzing and detecting audio deepfakes. The authors create a new dataset called PhonemeDF, which contains paired real and synthetic speech aligned at the phoneme level, using real samples from LibriSpeech and synthetic audio generated by multiple TTS and VC systems. They use forced alignment (via MFA) to segment speech into phonemes and then compute Kullback–Leibler Divergence (KLD) to measure how different the phoneme distributions of synthetic speech are from real speech. Based on this, they rank generation models by how closely they mimic natural speech. Their results show that phoneme-level differences are strongly correlated with how well classifiers can distinguish real vs. fake audio, demonstrating that KLD can help identify the most informative phonemes for deepfake detection.

StarGAN-VC


A method that allows non-parallel many-to-many voice conversion (VC) using Star Generative Adversarial Networks (StarGAN).

Tacotron 2


Tacotron 2 is a neural network architecture for speech synthesis directly from text. It combines a recurrent sequence-to-sequence feature prediction network with a modified WaveNet vocoder to generate time-domain waveforms from mel-spectrograms.

The Codecfake Dataset and Countermeasures for the Universally Detection of Deepfake Audio


Yuankun Xie; Yi Lu; Ruibo Fu; Zhengqi Wen; Zhiyong Wang; Jianhua Tao, Xin Qi, Xiaopeng Wang, Yukun Liu, Haonan Cheng, Long Ye, Yi Sun

2025


The paper focuses on detecting audio deepfakes generated by Audio Language Models (ALMs), which produce highly realistic and diverse audio. The authors first create a large-scale dataset called Codecfake, containing over 1 million audio samples (real and fake in English and Chinese), specifically targeting ALM-based audio generated through neural codec-to-waveform processes. They then propose a new training strategy called CSAM (a modified Sharpness-Aware Minimization) to improve model generalization and avoid domain bias. Using this dataset and training approach, they show that detectors can better identify ALM-generated deepfakes and achieve very low error rates, significantly outperforming existing methods.

Transferable Adversarial Attacks on Audio Deepfake Detection


Muhammad Umar Farooq, Awais Khan, Kutub Uddin, Khalid Mahmood Malik

2025


This paper focuses on testing the robustness of audio deepfake detection (ADD) systems against adversarial attacks rather than proposing a new detector. The authors introduce a transferable GAN-based adversarial attack framework that generates highly realistic fake audio designed to fool multiple detection models. Their approach uses an ensemble of surrogate ADD models along with a discriminator to craft attacks that can generalize across different systems (white-box, gray-box, and black-box settings). Additionally, they incorporate a self-supervised audio model to preserve transcription accuracy and perceptual quality, ensuring the adversarial audio remains natural and convincing. Through experiments on datasets like ASVspoof, In-the-Wild, and WaveFake, they demonstrate that state-of-the-art ADD systems are highly vulnerable, with detection accuracy dropping dramatically under attack. Overall, the study reveals critical weaknesses in existing detectors and highlights the urgent need for more robust and secure audio deepfake detection methods.

VAE-VC


Novel voice conversion (VC) method using a Conditional Deep Hierarchical Variational Autoencoder (CDHVAE) to improve the naturalness and similarity of converted speech without requiring parallel corpora or text transcriptions.

VoCo


VoCo is a text-based audio editing tool that allows users to replace or insert words in audio narrations seamlessly. The system synthesizes new words by stitching together snippets of audio from elsewhere in the narration, making the edited audio sound natural and consistent.

WaveGAN


WaveGAN is an approach to unsupervised synthesis of raw-waveform audio using GANs, generating one-second audio slices with global coherence, suitable for sound effects.

WaveGlow


WaveGlow is a flow-based neural network designed to generate high-quality speech from mel-spectrograms efficiently and without auto-regression. It combines insights from Glow and WaveNet to achieve this goal.

WaveNet


WaveNet is a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, conditioning the predictive distribution of each audio sample on all previous ones. It achieves state-of-the-art performance in text-to-speech (TTS) and can also model music and other audio modalities.

XMAD-Bench: Cross-Domain Multilingual Audio Deepfake Benchmark


Ioan-Paul Ciobanu, Andrei-Iulian Hîji, Nicolae Catalin Ristea, Paul Irofti, Cristian Rusu, Radu Tudor Ionescu

2022


The paper introduces XMAD-Bench, a large-scale multilingual benchmark dataset (668.8 hours) designed to evaluate audio deepfake detectors in realistic, cross-domain conditions. Unlike prior work where training and test data come from the same generative models (in-domain), they deliberately separate speakers, generation methods, and audio sources between training and testing to simulate real-world scenarios. Using this setup, they show that although detectors achieve near-perfect accuracy in in-domain settings, their performance drops drastically—sometimes close to random guessing—in cross-domain evaluations. Overall, they demonstrate that current models lack generalization and emphasize the need for more robust detection methods that work across different languages, speakers, and deepfake techniques.


NSF Award #2346473