VAE-VC

Authors:
Kei Akuzawa, Kotaro Onishi, Keisuke Takiguchi, Kohki Mametani, Koichiro Mori

 

Description:
Novel voice conversion (VC) method using a Conditional Deep Hierarchical Variational Autoencoder (CDHVAE) to improve the naturalness and similarity of converted speech without requiring parallel corpora or text transcriptions.

 

Training and Data:
Trained on VCTK corpus with utterances from 20 speakers (10 females, 10 males). Mel-spectrograms as acoustic features. MelGAN as vocoder. Batch size: 8. Epochs: 200. Hyperparameter β values tested: 1, 10, 50.

 

Advantages:
Requires no parallel corpora or text transcriptions, fast conversion speed, and high model expressiveness leading to improved naturalness and similarity in converted speech.

 

Limitations:
The performance could be further improved by integrating auxiliary losses and the model may need fine-tuning of hyperparameters for optimal performance in different scenarios.

 

Model Architecture:
The model employs a deep hierarchical VAE (DHVAE) with hierarchical latent variables, modified from NVAE architecture by incorporating Conditional Instance Normalization (CIN) for conditioning on speaker labels.

 

Dependencies:
Utilizes β-VAE objective, Conditional Instance Normalization (CIN), and a vocoder (MelGAN) trained on the same corpus for converting mel-spectrograms to audio.

 

Voice Conversion:
Voice conversion is performed by encoding the source speech into speaker-invariant latent variables and decoding it with the target speaker’s characteristics using a non-autoregressive decoder.

 

Dataset:
VCTK corpus, consisting of utterances from 20 speakers (10 females and 10 males), with mel-spectrograms extracted from 48kHz audio used as acoustic features.

 

Preprocessing:
The input mel-spectrograms are resized to fit the model’s input dimensions (C = 1, H = 80, W = T = 40) and speaker labels are embedded using linear transformation for CIN.

 

Evaluation Metrics:
Mean Opinion Scores (MOS) for naturalness and similarity, rated by human subjects on Amazon Mechanical Turk, along with measurements of KL divergence and reconstruction error.

 

Performance:
The proposed CDHVAE achieved MOS scores higher than 3.5 for both naturalness and similarity in inter-gender settings, outperforming existing autoencoder-based VC methods.

 

Contributions:
Introduced CDHVAE, a high expressiveness VAE model for voice conversion, demonstrated the importance of model expressiveness and the β-VAE objective, and provided a thorough evaluation of the model’s performance.
Link to paper


Last Accessed: 7/18/2024

NSF Award #2346473