Generation Methods


Navigation

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

AdaSpeech


AdaSpeech is an adaptive text-to-speech (TTS) system designed to customize new voices efficiently and with high quality. It addresses the challenges of diverse acoustic conditions and memory efficiency in custom voice adaptation.

Char2Wav


The paper presents Char2Wav, an end-to-end model for speech synthesis with two main components: a reader (encoder-decoder model with attention) and a neural vocoder (conditional SampleRNN). The reader converts text or phonemes to vocoder acoustic features, and the neural vocoder generates raw waveform samples.


A voice conversion method that does not rely on parallel data, using Cycle-Consistent Adversarial Networks (CycleGANs) with gated convolutional neural networks (CNNs) and an identity-mapping loss.

Deep Voice 3


Deep Voice 3 is a fully-convolutional, attention-based neural text-to-speech (TTS) system designed to match state-of-the-art naturalness in synthesized speech while training significantly faster than other models. It scales to large datasets and can handle multiple speakers.

DenoiSpeech


DenoiSpeech is a text-to-speech (TTS) system designed to synthesize clean speech from noisy speech data. It uses a fine-grained frame-level noise condition module that models noise at a detailed level, outperforming previous methods that use coarse-grained or pre-enhanced speech data.

FastSpeech 2


FastSpeech 2 is a non-autoregressive text-to-speech (TTS) model designed to synthesize speech faster and with higher quality than previous models. It addresses the limitations of FastSpeech by using ground-truth data for training and incorporating additional speech variation information such as pitch, energy, and accurate duration.

Glow TTS


Glow-TTS is a flow-based generative model for parallel text-to-speech (TTS) synthesis that eliminates the need for external aligners by internally learning the alignment between text and the latent representation of speech. It combines the properties of flows and dynamic programming to search for the most probable monotonic alignment, which allows for robust, diverse, and controllable speech synthesis.

HifiGAN


HiFi-GAN is a generative adversarial network for speech synthesis that achieves high-fidelity audio generation with efficient computational performance. It leverages periodic pattern modeling to enhance sample quality and demonstrates significant improvements over previous models like WaveNet and WaveGlow.

LightSpeech


LightSpeech is a lightweight and fast text-to-speech (TTS) model designed using neural architecture search (NAS) to achieve smaller memory usage and lower inference latency, suitable for deployment in resource-constrained devices like mobile phones and embedded systems.

MelGAN


MelGAN is a non-autoregressive feed-forward convolutional architecture designed for audio waveform generation in a GAN setup, aimed at generating high-quality coherent waveforms for tasks such as speech synthesis and music domain translation.

MelNet


The paper presents MelNet, a generative model for audio in the frequency domain. MelNet leverages time-frequency representations (spectrograms) instead of time-domain waveforms, using a highly expressive probabilistic model and a multiscale generation procedure to capture both local and global audio structures.

Parallel WaveGAN


Proposes Parallel WaveGAN, a distillation-free, fast, and small-footprint waveform generation method using a GAN. It optimizes multi-resolution spectrogram and adversarial loss functions, capturing the time-frequency distribution of realistic speech waveforms. The model is compact with only 1.44M parameters and generates 24kHz speech 28.68 times faster than real-time on a single GPU.

StarGAN-VC


A method that allows non-parallel many-to-many voice conversion (VC) using Star Generative Adversarial Networks (StarGAN).

Tactron 2


Tacotron 2 is a neural network architecture for speech synthesis directly from text. It combines a recurrent sequence-to-sequence feature prediction network with a modified WaveNet vocoder to generate time-domain waveforms from mel-spectrograms.

VAE-VC


Novel voice conversion (VC) method using a Conditional Deep Hierarchical Variational Autoencoder (CDHVAE) to improve the naturalness and similarity of converted speech without requiring parallel corpora or text transcriptions.

VoCo


VoCo is a text-based audio editing tool that allows users to replace or insert words in audio narrations seamlessly. The system synthesizes new words by stitching together snippets of audio from elsewhere in the narration, making the edited audio sound natural and consistent.

WaveGAN


WaveGAN is an approach to unsupervised synthesis of raw-waveform audio using GANs, generating one-second audio slices with global coherence, suitable for sound effects.

WaveGlow


WaveGlow is a flow-based neural network designed to generate high-quality speech from mel-spectrograms efficiently and without auto-regression. It combines insights from Glow and WaveNet to achieve this goal.

WaveNet


WaveNet is a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, conditioning the predictive distribution of each audio sample on all previous ones. It achieves state-of-the-art performance in text-to-speech (TTS) and can also model music and other audio modalities.

NSF Award #2346473