Authors:
Sercan Ö. Arık (sercanarik@baidu.com)
Jitong Chen (chenjitong01@baidu.com)
Kainan Peng (pengkainan@baidu.com)
Wei Ping (pingwei01@baidu.com)
Yanqi Zhou (yanqiz@baidu.com)
Institution:
Baidu Research, 1195 Bordeaux Dr. Sunnyvale, CA 94089
Abstract:
Voice cloning is a highly desired feature for personalized speech interfaces. This paper introduces a neural voice cloning system that learns to synthesize a person’s voice from only a few audio samples. Two approaches are studied: speaker adaptation and speaker encoding. Speaker adaptation fine-tunes a multi-speaker generative model, while speaker encoding trains a separate model to directly infer a new speaker embedding. Both approaches achieve good performance in terms of naturalness and similarity to the original speaker, even with a few cloning audios. Speaker adaptation offers slightly better naturalness and similarity, but speaker encoding is more favorable for low-resource deployment due to less cloning time and required memory.
Voice cloning is a highly desired feature for personalized speech interfaces. This paper introduces a neural voice cloning system that learns to synthesize a person’s voice from only a few audio samples. Two approaches are studied: speaker adaptation and speaker encoding. Speaker adaptation fine-tunes a multi-speaker generative model, while speaker encoding trains a separate model to directly infer a new speaker embedding. Both approaches achieve good performance in terms of naturalness and similarity to the original speaker, even with a few cloning audios. Speaker adaptation offers slightly better naturalness and similarity, but speaker encoding is more favorable for low-resource deployment due to less cloning time and required memory.
Data Creation Method:
The data involves using multi-speaker datasets (LibriSpeech and VCTK) to train a generative model capable of synthesizing speech in the voice of any speaker from a few audio samples. The cloning audios are randomly sampled from these datasets for each experiment.
The data involves using multi-speaker datasets (LibriSpeech and VCTK) to train a generative model capable of synthesizing speech in the voice of any speaker from a few audio samples. The cloning audios are randomly sampled from these datasets for each experiment.
Number of Speakers:
- LibriSpeech: 2484 speakers
- VCTK: 108 speakers
Total Size:
- 6 hours
Number of Real Samples:
- 10
Number of Fake Samples:
- 120
Description of the Dataset:
- The dataset consists of audio recordings and their corresponding transcriptions used to train and evaluate the neural voice cloning system. The system can convert a few audio samples from an unseen speaker into a synthesized voice that mimics the original speaker. The dataset includes both training data for the multi-speaker generative model and cloning samples for evaluation.
Extra Details:
The dataset utilizes speaker adaptation and speaker encoding techniques. Speaker adaptation involves fine-tuning a pre-trained multi-speaker model, while speaker encoding estimates the speaker embedding directly from cloning audios. The models are evaluated based on naturalness and speaker similarity using both subjective human evaluations and automated metrics like speaker classification accuracy and equal error rate (EER).
The dataset utilizes speaker adaptation and speaker encoding techniques. Speaker adaptation involves fine-tuning a pre-trained multi-speaker model, while speaker encoding estimates the speaker embedding directly from cloning audios. The models are evaluated based on naturalness and speaker similarity using both subjective human evaluations and automated metrics like speaker classification accuracy and equal error rate (EER).
Data Type:
- Audio recordings and transcriptions
Average Length:
- 2
Keywords:
- Voice Cloning, Speech Synthesis, Speaker Adaptation, Speaker Encoding, Neural Networks
When Published:
- 20th February 2018
Annotation Process:
Annotations include transcriptions used to train the multi-speaker generative model and the cloned audios. These annotations are necessary for evaluating the naturalness and speaker similarity of the synthesized speech.
Annotations include transcriptions used to train the multi-speaker generative model and the cloned audios. These annotations are necessary for evaluating the naturalness and speaker similarity of the synthesized speech.
Usage Scenarios:
Voice cloning for personalized speech interfaces, creating expressive audiobooks, generating voices for virtual assistants, and research in speech synthesis and speaker adaptation.
Voice cloning for personalized speech interfaces, creating expressive audiobooks, generating voices for virtual assistants, and research in speech synthesis and speaker adaptation.
Data Accessibility:
Publicly accessible
Publicly accessible
Miscellaneous Information:
The dataset supports advanced voice cloning techniques such as speaker adaptation and speaker encoding. Speaker adaptation fine-tunes a pre-trained model, while speaker encoding estimates speaker embeddings directly from audio samples. The models are evaluated using mean opinion scores (MOS) and speaker verification metrics.
The dataset supports advanced voice cloning techniques such as speaker adaptation and speaker encoding. Speaker adaptation fine-tunes a pre-trained model, while speaker encoding estimates speaker embeddings directly from audio samples. The models are evaluated using mean opinion scores (MOS) and speaker verification metrics.