Generation Methods

This page provides research work that explore generation of audio deepfakes and voice spoofing using methods including text-to-speech, and voice conversion.

Navigation

Acoustic Feature Analysis of Audio Deepfake Generation Methods

Analyzes acoustic artifacts in spoofed audio by comparing bona-fide recordings with outputs from four VC and four TTS models using time-domain and spectral features (waveform, RMS energy, spectral centroid, MFCC, LFCC, etc.) on a controlled matched dataset. Finds MFCC and LFCC most effective at distinguishing real from fake speech, with notable differences between VC and TTS methods and across individual generation models

AdaSpeech

AdaSpeech is an adaptive text-to-speech (TTS) system designed to customize new voices efficiently and with high quality. It addresses the challenges of diverse acoustic conditions and memory efficiency in custom voice adaptation.

Audio Deepfake Detection Against Modern TTS Architectures

Generates 12,000 synthetic speech samples using three 2024–2025 TTS systems (Dia2, Maya1, MeloTTS) from DailyDialog text, then benchmarks four detection frameworks — Whisper-MesoNet, SSL-AASIST, XLS-R-SLS, and UncovAI’s proprietary model — across semantic, structural, and signal-level perspectives to assess cross-architecture generalization.

Audio Deepfakes: Generation Methods, Applications, and Misuse Risks

Systematic review of audio deepfake generation techniques (TTS and voice conversion), covering deep learning architectures including autoregressive, GAN-based, transformer-based, and diffusion models, alongside an analysis of legitimate applications and malicious misuse scenarios.

Char2Wav

The paper presents Char2Wav, an end-to-end model for speech synthesis with two main components: a reader (encoder-decoder model with attention) and a neural vocoder (conditional SampleRNN). The reader converts text or phonemes to vocoder acoustic features, and the neural vocoder generates raw waveform samples.

Codecfake

Codecfake is a large-scale dataset, containing over 1 million audio samples (real and fake in English and Chinese), specifically targeting ALM-based audio generated through neural codec-to-waveform processes.

CycleGAN-VC

A voice conversion method that does not rely on parallel data, using Cycle-Consistent Adversarial Networks (CycleGANs) with gated convolutional neural networks (CNNs) and an identity-mapping loss.

Deep Voice 3

Deep Voice 3 is a fully-convolutional, attention-based neural text-to-speech (TTS) system designed to match state-of-the-art naturalness in synthesized speech while training significantly faster than other models. It scales to large datasets and can handle multiple speakers.

DenoiSpeech

DenoiSpeech is a text-to-speech (TTS) system designed to synthesize clean speech from noisy speech data. It uses a fine-grained frame-level noise condition module that models noise at a detailed level, outperforming previous methods that use coarse-grained or pre-enhanced speech data.

FastSpeech 2

FastSpeech 2 is a non-autoregressive text-to-speech (TTS) model designed to synthesize speech faster and with higher quality than previous models. It addresses the limitations of FastSpeech by using ground-truth data for training and incorporating additional speech variation information such as pitch, energy, and accurate duration.

Glow TTS

Glow-TTS is a flow-based generative model for parallel text-to-speech (TTS) synthesis that eliminates the need for external aligners by internally learning the alignment between text and the latent representation of speech. It combines the properties of flows and dynamic programming to search for the most probable monotonic alignment, which allows for robust, diverse, and controllable speech synthesis.

HifiGAN

HiFi-GAN is a generative adversarial network for speech synthesis that achieves high-fidelity audio generation with efficient computational performance. It leverages periodic pattern modeling to enhance sample quality and demonstrates significant improvements over previous models like WaveNet and WaveGlow.

LightSpeech

LightSpeech is a lightweight and fast text-to-speech (TTS) model designed using neural architecture search (NAS) to achieve smaller memory usage and lower inference latency, suitable for deployment in resource-constrained devices like mobile phones and embedded systems.

MelGAN

MelGAN is a non-autoregressive feed-forward convolutional architecture designed for audio waveform generation in a GAN setup, aimed at generating high-quality coherent waveforms for tasks such as speech synthesis and music domain translation.

MelNet

The paper presents MelNet, a generative model for audio in the frequency domain. MelNet leverages time-frequency representations (spectrograms) instead of time-domain waveforms, using a highly expressive probabilistic model and a multiscale generation procedure to capture both local and global audio structures.

Parallel WaveGAN

Proposes Parallel WaveGAN, a distillation-free, fast, and small-footprint waveform generation method using a GAN. It optimizes multi-resolution spectrogram and adversarial loss functions, capturing the time-frequency distribution of realistic speech waveforms. The model is compact with only 1.44M parameters and generates 24kHz speech 28.68 times faster than real-time on a single GPU.

PhonemeDF

PhonemeDF pairs real LibriSpeech speech with synthetic versions generated by four TTS and three voice conversion systems (~730 hours, ~200k utterances). Phoneme boundaries are extracted using the Montreal Forced Aligner (MFA), and each phoneme segment is independently parameterized using handcrafted features (Log-Mel Spectrograms, LFCC) and self-supervised embeddings (WavLM, wav2vec 2.0). Symmetric KLD measures how closely synthetic phonemes match real ones, while Logistic Regression and SVM classifiers evaluate per-phoneme detection accuracy.

StarGAN-VC

A method that allows non-parallel many-to-many voice conversion (VC) using Star Generative Adversarial Networks (StarGAN).

Tacotron 2

Tacotron 2 is a neural network architecture for speech synthesis directly from text. It combines a recurrent sequence-to-sequence feature prediction network with a modified WaveNet vocoder to generate time-domain waveforms from mel-spectrograms.

Transferable GAN-Based Adversarial Attacks on Audio Deepfake Detection

A transferable GAN-based adversarial attack framework that uses an ensemble of surrogate ADD models and a discriminator to generate realistic adversarial audio. A Wave2Vec transcription model with BERT-based cosine similarity loss preserves semantic and perceptual integrity, enabling attacks that generalize across white-box, gray-box, and black-box detection systems.

VAE-VC

Novel voice conversion (VC) method using a Conditional Deep Hierarchical Variational Autoencoder (CDHVAE) to improve the naturalness and similarity of converted speech without requiring parallel corpora or text transcriptions.

VoCo

VoCo is a text-based audio editing tool that allows users to replace or insert words in audio narrations seamlessly. The system synthesizes new words by stitching together snippets of audio from elsewhere in the narration, making the edited audio sound natural and consistent.

WaveGAN

WaveGAN is an approach to unsupervised synthesis of raw-waveform audio using GANs, generating one-second audio slices with global coherence, suitable for sound effects.

WaveGlow

WaveGlow is a flow-based neural network designed to generate high-quality speech from mel-spectrograms efficiently and without auto-regression. It combines insights from Glow and WaveNet to achieve this goal.

WaveNet

WaveNet is a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, conditioning the predictive distribution of each audio sample on all previous ones. It achieves state-of-the-art performance in text-to-speech (TTS) and can also model music and other audio modalities.

XMAD-Bench

XMAD-Bench is a large-scale multilingual benchmark dataset (668.8 hours) designed to evaluate audio deepfake detectors in realistic, cross-domain conditions.

NSF Award #2346473

Search UMBC

Navigation

Subscribe to UMBC Weekly Top Stories

I am interested in: