Baidu Silicon Valley AI Lab cloned audio (Neural Voice Cloning with a Few Samples)

Authors:
Sercan Ö. Arık (sercanarik@baidu.com)
Jitong Chen (chenjitong01@baidu.com)
Kainan Peng (pengkainan@baidu.com)
Wei Ping (pingwei01@baidu.com)
Yanqi Zhou (yanqiz@baidu.com)

Institution:
Baidu Research, 1195 Bordeaux Dr. Sunnyvale, CA 94089

 

Abstract:
Voice cloning is a highly desired feature for personalized speech interfaces. This paper introduces a neural voice cloning system that learns to synthesize a person’s voice from only a few audio samples. Two approaches are studied: speaker adaptation and speaker encoding. Speaker adaptation fine-tunes a multi-speaker generative model, while speaker encoding trains a separate model to directly infer a new speaker embedding. Both approaches achieve good performance in terms of naturalness and similarity to the original speaker, even with a few cloning audios. Speaker adaptation offers slightly better naturalness and similarity, but speaker encoding is more favorable for low-resource deployment due to less cloning time and required memory.

 

Data Creation Method:
The data involves using multi-speaker datasets (LibriSpeech and VCTK) to train a generative model capable of synthesizing speech in the voice of any speaker from a few audio samples. The cloning audios are randomly sampled from these datasets for each experiment.

 

Number of Speakers:

  • LibriSpeech: 2484 speakers
  • VCTK: 108 speakers

Total Size:

  • 6 hours

Number of Real Samples:

  • 10

Number of Fake Samples:

  • 120

Description of the Dataset:

  • The dataset consists of audio recordings and their corresponding transcriptions used to train and evaluate the neural voice cloning system. The system can convert a few audio samples from an unseen speaker into a synthesized voice that mimics the original speaker. The dataset includes both training data for the multi-speaker generative model and cloning samples for evaluation.

 

Extra Details:
The dataset utilizes speaker adaptation and speaker encoding techniques. Speaker adaptation involves fine-tuning a pre-trained multi-speaker model, while speaker encoding estimates the speaker embedding directly from cloning audios. The models are evaluated based on naturalness and speaker similarity using both subjective human evaluations and automated metrics like speaker classification accuracy and equal error rate (EER).

 

Data Type:

  • Audio recordings and transcriptions

Average Length:

  • 2

Keywords:

  • Voice Cloning, Speech Synthesis, Speaker Adaptation, Speaker Encoding, Neural Networks

When Published:

  • 20th February 2018

 

Annotation Process:
Annotations include transcriptions used to train the multi-speaker generative model and the cloned audios. These annotations are necessary for evaluating the naturalness and speaker similarity of the synthesized speech.

 

Usage Scenarios:
Voice cloning for personalized speech interfaces, creating expressive audiobooks, generating voices for virtual assistants, and research in speech synthesis and speaker adaptation.

 

Data Accessibility:
Publicly accessible

 

Miscellaneous Information:
The dataset supports advanced voice cloning techniques such as speaker adaptation and speaker encoding. Speaker adaptation fine-tunes a pre-trained model, while speaker encoding estimates speaker embeddings directly from audio samples. The models are evaluated using mean opinion scores (MOS) and speaker verification metrics.

 

Credits:
Datasets Used:

  • LibriSpeech
  • VCTK

Speech Synthesis Models Referenced:

Dataset Link


Main Paper Link


License Link


Last Accessed: 6/20/2024

NSF Award #2346473