Dataset Name Data Size (Hours) Source Description
LibriSpeech 960 LibriSpeech A corpus of read English speech from LibriVox project, segmented and aligned, 16 kHz sampling rate.
Fisher Corpus Part 1: 984, Total: >2000 Fisher Corpus Part 1 Transcripts Spontaneous telephone conversations in English, recorded for linguistic research.
Switchboard-1 Dataset 260 Switchboard-1 English telephone conversations, collected under DARPA sponsorship.
WSJ-0 and WSJ-1 80 WSJ0 Read speech from Wall Street Journal news text, for large-vocabulary CSR systems.
National Speech Corpus Not specified, 1.2 TB National Speech Corpus Singapore English corpus for ASR research, to improve accuracy for locally accented English.
VCTK 44 (per speaker, 109 speakers) VCTK Text-to-speech research, audio of speakers with various accents reading different texts.
VoxPopuli (EN) 543 (part of 1.8K transcribed) VoxPopuli Multilingual corpus for unsupervised and semi-supervised learning.
Europarl-ASR (EN) 1300 Europarl-ASR Parliamentary debates for ASR training, with official transcripts from the European Parliament.
Multilingual LibriSpeech (MLS EN) 2,000 (subset of 44.5K) MLS EN Derived from LibriVox project audiobooks, for speech research in multiple languages.
Mozilla Common Voice (v8.0) 16,000 (ongoing project) Mozilla Common Voice Multilingual read speech corpus for building voice technologies, contributed by volunteers.
People's Speech 12,000 Not found English speech corpus for ASR model training.
TED-LIUM v3 452 TED-LIUM 3 Dataset Audio from TED Talks, including talks from TED-LIUM 2, with automatic transcripts.
AMI 100 AMI Corpus Meeting recordings with various synchronized signals including video and projector outputs.
English Broadcast News2 140 (plus 9000 hours of TV shows) English Broadcast News Speech Recognition by Humans and Machines Wide-band signals from various speakers, different background noises, and news topics, with lightly supervised transcripts.

  1. LibriSpeech
    • Data Size: 960 Hours
    • Source: LibriSpeech
    • Description: A corpus of read English speech derived from read audiobooks from the LibriVox project, carefully segmented and aligned, with a sampling rate of 16 kHz.
  2. Fisher Corpus
    • Data Size: Part 1 consists of 984 hours, and the entire collection has over 2000 hours of English conversational telephone speech.
    • Source: Fisher Corpus Part 1 Transcripts
    • Description: A collection of spontaneous telephone conversations in English between native speakers, recorded for linguistic research.
  3. Switchboard-1 Dataset
    • Data Size: 260 Hours
    • Source: Switchboard-1
    • Description: A corpus of English telephone conversations, collected under DARPA sponsorship and released by NIST and the LDC.
  4. WSJ-0 and WSJ-1
    • Data Size: 80 Hours
    • Source: WSJ0
    • Description: A corpus of read speech with texts drawn from Wall Street Journal news text, known as WSJ0 and WSJ1, used for research on large-vocabulary Continuous Speech Recognition (CSR) systems.
  5. National Speech Corpus (Part 1, Part 6)
    • Data Size: The entire corpus is approximately 1.2 TB in size (specific hours not provided).
    • Source: National Speech Corpus
    • Description: A large-scale Singapore English corpus for automatic speech recognition (ASR) research, designed to improve speech engines’ accuracy for locally accented English.
  6. VCTK
    • Data Size: 44 Hours (Each of the 109 native English speakers reads about 400 sentences.)
    • Source: VCTK
    • Description: A dataset designed for text-to-speech research, containing audio recordings of speakers with various accents reading newspaper excerpts, the Rainbow Passage, and an elicitation paragraph.
  7. VoxPopuli (EN)
    • Data Size: 543 Hours (Part of a larger corpus with 1.8K hours of transcribed speeches in 16 languages.)
    • Source: VoxPopuli
    • Description: A large-scale multilingual corpus with unlabelled and transcribed speech data in multiple languages, intended for unsupervised and semi-supervised learning.
  8. Europarl-ASR (EN)
    • Data Size: 1300 hours of English-language annotated speech data.
    • Source: Europarl-ASR
    • Description: A corpus of parliamentary debates for ASR training and benchmarking, containing speeches and their official transcripts from the European Parliament.
  9. Multilingual LibriSpeech (MLS EN) - 2,000 hrs subset
    • Data Size: 2,000 hours subset of a larger corpus with 44.5K hours of English.
    • Source: MLS EN
    • Description: A corpus derived from read audiobooks from the LibriVox project, suitable for speech research in multiple languages.
  10. Mozilla Common Voice (v8.0)
    • Data Size: 16,000 Hours (The size for v8.0 is not specified, but the project is ongoing with contributions from volunteers.)
    • Source: Mozilla Common Voice
    • Description: A multilingual corpus of read speech collected from volunteers across the globe for building voice-enabled technologies.
  11. *People's Speech *
    • Data Size: 12,000 Hours
    • Source: A specific link for the 12,000 hours subset was not found during the search.
    • Description: A large and diverse English speech corpus aimed at training ASR models.
  12. *TED-LIUM v3 *
    • Data Size: 452 hours of audio
    • Source: TED-LIUM 3 Dataset
    • Description: This audio dataset is derived from TED Talks and includes 2351 audio talks. It features aligned automatic transcripts and takes into account speech disfluencies such as repetitions and hesitations.
  13. AMI
    • Data Size: 100 hours of meeting recordings
    • Source: AMI Corpus
    • Description: The AMI Meeting Corpus is a multi-modal dataset that includes synchronized recordings using various signals. It features close-talking and far-field microphones, individual and room-view video cameras, and outputs from a slide projector and an electronic whiteboard.
  14. English Broadcast News2
    • Data Size: 140 hours of carefully transcribed data, with an additional 9000 hours of TV shows with closed captions used for training.
    • Source: English Broadcast News Speech Recognition by Humans and Machines
    • Description: This dataset is for speech recognition systems that deal with wide-band signals from a variety of speakers in different background noise conditions, speaking on various news topics. The data is similar to written English, with lightly supervised transcripts for training.

+ Recent posts