The LJ Speech Dataset

Authors:
Keith Ito, Linda Johnson

Institution:
LibriVox project

Authors’ Contact Information:

Keith Ito: kito@kito.us

 

Abstract:
The LJ Speech Dataset is a public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from seven non-fiction books. A transcription is provided for each clip. Clips vary in length from 1 to 10 seconds and have a total length of approximately 24 hours.

 

Description of the Dataset:
The dataset contains 13,100 short audio clips of a single speaker reading passages from seven non-fiction books. Each clip is accompanied by a transcription and a normalized transcription. The audio clips range from 1 to 10 seconds in length, with a total duration of approximately 24 hours. The texts and audio recordings are in the public domain.

 

Data Creation Method:
The audio was recorded by the LibriVox project in 2016-17 and segmented automatically based on silences in the recording. The text was matched to the audio manually, and a QA pass was done to ensure accuracy. Original LibriVox recordings were distributed as 128 kbps MP3 files, and the dataset was converted to single-channel 16-bit PCM WAV with a sample rate of 22050 Hz.

 

Number of Speakers:

  • 1

Total Size:

  • 2.6 GB

Number of Real Samples:

  • 13,100 Audio clips

Number of Fake Samples:

  • N/A

 

Extra Details:
The dataset includes metadata in a file named transcripts.csv, which provides IDs, transcriptions, and normalized transcriptions for each audio clip. The audio was recorded by Linda Johnson for the LibriVox project and the text was published between 1884 and 1964.

 

Data Type:

  • 16-bit PCM WAV files

Average Length:

  • 6.57 seconds per clip

Keywords:

  • Speech Recognition, Training Data, Public Domain, Speech Dataset, Audio Clips

When Published:

  • 2017

 

Annotation Process:
Metadata is provided in transcripts.csv, which consists of one record per line, delimited by the pipe character (0x7c). The fields include the ID, Transcription, and Normalized Transcription. The text was matched to the audio manually, and a QA pass was done to ensure that the text accurately matched the words spoken in the audio.

 

Usage Scenarios:
Training and evaluation of speech synthesis and recognition models, research in natural language processing, and other applications in audio processing.

 

Data Accessibility:
Publicly accessible

 

Statistics:

  • Total Clips: 13,100
  • Total Words: 225,715
  • Total Characters: 1,308,678
  • Total Duration: 23:55:17
  • Mean Clip Duration: 6.57 sec
  • Min Clip Duration: 1.11 sec
  • Max Clip Duration: 10.10 sec

Mean Words per Clip: 17.23

 

Miscellaneous Information:
The audio clips were segmented based on silences, generally aligning with sentence or clause boundaries. Some clips may contain artifacts from the original MP3 encoding. The dataset includes non-ASCII characters in 19 of the transcriptions.

 

Credits:

  • Books Included:
    • Morris, William, et al. Arts and Crafts Essays. 1893.
    • Griffiths, Arthur. The Chronicles of Newgate, Vol. 2. 1884.
    • Roosevelt, Franklin D. The Fireside Chats of Franklin Delano Roosevelt. 1933-42.
    • Harland, Marion. Marion Harland’s Cookery for Beginners. 1893.
    • Rolt-Wheeler, Francis. The Science – History of the Universe, Vol. 5: Biology. 1910.
    • Banks, Edgar J. The Seven Wonders of the Ancient World. 1916.
    • President’s Commission on the Assassination of President Kennedy. Report of the President’s Commission on the Assassination of President Kennedy. 1964.
  • Recording by: Linda Johnson from LibriVox

Alignment and Annotation by: Keith Ito

 

Changelog:

  • Version 1.1 (current release):
    • Removed 30 .wav files without corresponding annotations in metadata.csv.
  • Version 1.0:
    • Initial release.

Dataset Link


Main Paper Link:

N/A


License Link


Last Accessed: 6/19/2024

NSF Award #2346473