Dataset Name | Data Size (Hours) | Source | Description |
---|---|---|---|
LibriSpeech | 960 | LibriSpeech | A corpus of read English speech from LibriVox project, segmented and aligned, 16 kHz sampling rate. |
Fisher Corpus | Part 1: 984, Total: >2000 | Fisher Corpus Part 1 Transcripts | Spontaneous telephone conversations in English, recorded for linguistic research. |
Switchboard-1 Dataset | 260 | Switchboard-1 | English telephone conversations, collected under DARPA sponsorship. |
WSJ-0 and WSJ-1 | 80 | WSJ0 | Read speech from Wall Street Journal news text, for large-vocabulary CSR systems. |
National Speech Corpus | Not specified, 1.2 TB | National Speech Corpus | Singapore English corpus for ASR research, to improve accuracy for locally accented English. |
VCTK | 44 (per speaker, 109 speakers) | VCTK | Text-to-speech research, audio of speakers with various accents reading different texts. |
VoxPopuli (EN) | 543 (part of 1.8K transcribed) | VoxPopuli | Multilingual corpus for unsupervised and semi-supervised learning. |
Europarl-ASR (EN) | 1300 | Europarl-ASR | Parliamentary debates for ASR training, with official transcripts from the European Parliament. |
Multilingual LibriSpeech (MLS EN) | 2,000 (subset of 44.5K) | MLS EN | Derived from LibriVox project audiobooks, for speech research in multiple languages. |
Mozilla Common Voice (v8.0) | 16,000 (ongoing project) | Mozilla Common Voice | Multilingual read speech corpus for building voice technologies, contributed by volunteers. |
People's Speech | 12,000 | Not found | English speech corpus for ASR model training. |
TED-LIUM v3 | 452 | TED-LIUM 3 Dataset | Audio from TED Talks, including talks from TED-LIUM 2, with automatic transcripts. |
AMI | 100 | AMI Corpus | Meeting recordings with various synchronized signals including video and projector outputs. |
English Broadcast News2 | 140 (plus 9000 hours of TV shows) | English Broadcast News Speech Recognition by Humans and Machines | Wide-band signals from various speakers, different background noises, and news topics, with lightly supervised transcripts. |
- LibriSpeech
- Data Size: 960 Hours
- Source: LibriSpeech
- Description: A corpus of read English speech derived from read audiobooks from the LibriVox project, carefully segmented and aligned, with a sampling rate of 16 kHz.
- Fisher Corpus
- Data Size: Part 1 consists of 984 hours, and the entire collection has over 2000 hours of English conversational telephone speech.
- Source: Fisher Corpus Part 1 Transcripts
- Description: A collection of spontaneous telephone conversations in English between native speakers, recorded for linguistic research.
- Switchboard-1 Dataset
- Data Size: 260 Hours
- Source: Switchboard-1
- Description: A corpus of English telephone conversations, collected under DARPA sponsorship and released by NIST and the LDC.
- WSJ-0 and WSJ-1
- Data Size: 80 Hours
- Source: WSJ0
- Description: A corpus of read speech with texts drawn from Wall Street Journal news text, known as WSJ0 and WSJ1, used for research on large-vocabulary Continuous Speech Recognition (CSR) systems.
- National Speech Corpus (Part 1, Part 6)
- Data Size: The entire corpus is approximately 1.2 TB in size (specific hours not provided).
- Source: National Speech Corpus
- Description: A large-scale Singapore English corpus for automatic speech recognition (ASR) research, designed to improve speech engines’ accuracy for locally accented English.
- VCTK
- Data Size: 44 Hours (Each of the 109 native English speakers reads about 400 sentences.)
- Source: VCTK
- Description: A dataset designed for text-to-speech research, containing audio recordings of speakers with various accents reading newspaper excerpts, the Rainbow Passage, and an elicitation paragraph.
- VoxPopuli (EN)
- Data Size: 543 Hours (Part of a larger corpus with 1.8K hours of transcribed speeches in 16 languages.)
- Source: VoxPopuli
- Description: A large-scale multilingual corpus with unlabelled and transcribed speech data in multiple languages, intended for unsupervised and semi-supervised learning.
- Europarl-ASR (EN)
- Data Size: 1300 hours of English-language annotated speech data.
- Source: Europarl-ASR
- Description: A corpus of parliamentary debates for ASR training and benchmarking, containing speeches and their official transcripts from the European Parliament.
- Multilingual LibriSpeech (MLS EN) - 2,000 hrs subset
- Data Size: 2,000 hours subset of a larger corpus with 44.5K hours of English.
- Source: MLS EN
- Description: A corpus derived from read audiobooks from the LibriVox project, suitable for speech research in multiple languages.
- Mozilla Common Voice (v8.0)
- Data Size: 16,000 Hours (The size for v8.0 is not specified, but the project is ongoing with contributions from volunteers.)
- Source: Mozilla Common Voice
- Description: A multilingual corpus of read speech collected from volunteers across the globe for building voice-enabled technologies.
- *People's Speech *
- Data Size: 12,000 Hours
- Source: A specific link for the 12,000 hours subset was not found during the search.
- Description: A large and diverse English speech corpus aimed at training ASR models.
- *TED-LIUM v3 *
- Data Size: 452 hours of audio
- Source: TED-LIUM 3 Dataset
- Description: This audio dataset is derived from TED Talks and includes 2351 audio talks. It features aligned automatic transcripts and takes into account speech disfluencies such as repetitions and hesitations.
- AMI
- Data Size: 100 hours of meeting recordings
- Source: AMI Corpus
- Description: The AMI Meeting Corpus is a multi-modal dataset that includes synchronized recordings using various signals. It features close-talking and far-field microphones, individual and room-view video cameras, and outputs from a slide projector and an electronic whiteboard.
- English Broadcast News2
- Data Size: 140 hours of carefully transcribed data, with an additional 9000 hours of TV shows with closed captions used for training.
- Source: English Broadcast News Speech Recognition by Humans and Machines
- Description: This dataset is for speech recognition systems that deal with wide-band signals from a variety of speakers in different background noise conditions, speaking on various news topics. The data is similar to written English, with lightly supervised transcripts for training.
'Speech Signal Processing > Basic' 카테고리의 다른 글
UTF-8, Byte-level BPE (BBPE) (4) | 2024.10.09 |
---|---|
Public Speech Datasets for ASR (0) | 2023.11.18 |
16 Bit, 16kHz wav 데이터 사이즈 계산 (Calculation of 16 Bit, 16kHz wave data size) (0) | 2021.05.14 |
[기본] 음성 신호 처리 - 시간영역/주파수영역 분석 (0) | 2020.09.18 |
16비트 고정소수점, 32비트 부동소수점 WAV 파일 (16-bit fixed point, 32-bit floating point WAV file basics) (0) | 2020.07.22 |