Ar-DAD: Arabic Diversified Audio Dataset

Authors:
Mohammed Lataifeh (University of Sharjah)
Ashraf Elnagar (University of Sharjah)

 

Description of the Dataset:

The Ar-DAD dataset contains 15,810 audio clips of 30 popular reciters cantillating verses from the Holy Quran (chapters 78-114). In addition, the dataset includes 379 imitation audio clips of 12 skilled imitators. The dataset is structured into three directories: one for reciters’ audio clips, one for imitators, and a directory containing textual materials for the Quranic verses with and without vowelization. This dataset can be used for research on speaker identification, imitation detection, and speech analysis in Arabic.

 

Data Creation Method:
Data was scraped and extracted from a Quranic audio portal and various websites. The audio files of 30 reciters cantillating chapters from the Quran were collected, along with 12 popular imitators based on social media popularity.

 

Number of Speakers:

  • 30 reciters
  • 12 imitators

Total Size:

  • 15,810 audio clips for reciters
  • 379 audio clips for imitators

Number of Real Samples:

  • 15,810 real audio clips of recitations

Number of Fake Samples:

  • 379 imitation audio clips

 

Extra Details:
The dataset is the first of its kind, providing a large corpus of Arabic Quranic recitations along with imitation clips for speaker identification and verification. It can be deployed for machine learning and deep learning models for audio processing, speaker recognition, and more.

 

Data Type:

  • WAV audio files
  • Plain text files for Quranic text

Average Length:

  • Approximately 10 seconds per audio clip

Keywords:

  • Arabic audio clips, Speaker identification, Machine learning, Deep learning, Quran recitations, Imitators, Cantillations

When Published:

  • 2020

 

Annotation Process:
Each audio clip is organized based on reciter, verse, and chapter. The dataset includes structured file names indicating reciters, imitators, and verse numbers. Text materials are provided with and without vocalization for further analysis.

 

Usage Scenarios:

  • Speaker identification
  • Imitation detection
  • Machine learning and deep learning models for Arabic audio processing
  • Speech recognition and classification

 

Data Accessibility:
The dataset is publicly available under the Creative Commons Attribution 4.0 International License on Mendeley.

 

Miscellaneous Information:
The dataset supports advanced research in speaker identification and imitation detection. It is designed to aid in the development of models for Arabic speech analysis, including applications in machine learning and deep learning.

Datasets Used:

  • Quranic audio from various websites

Speech Synthesis Models Referenced:

  • None specified

Dataset Link


Main Paper Link


License Link


Last Accessed: 6/24/2024

NSF Award #2346473