Public Speech Datasets for ASR (details)

2023. 11. 18. 15:24

Dataset Name	Data Size (Hours)	Source	Description
LibriSpeech	960	LibriSpeech	A corpus of read English speech from LibriVox project, segmented and aligned, 16 kHz sampling rate.
Fisher Corpus	Part 1: 984, Total: >2000	Fisher Corpus Part 1 Transcripts	Spontaneous telephone conversations in English, recorded for linguistic research.
Switchboard-1 Dataset	260	Switchboard-1	English telephone conversations, collected under DARPA sponsorship.
WSJ-0 and WSJ-1	80	WSJ0	Read speech from Wall Street Journal news text, for large-vocabulary CSR systems.
National Speech Corpus	Not specified, 1.2 TB	National Speech Corpus	Singapore English corpus for ASR research, to improve accuracy for locally accented English.
VCTK	44 (per speaker, 109 speakers)	VCTK	Text-to-speech research, audio of speakers with various accents reading different texts.
VoxPopuli (EN)	543 (part of 1.8K transcribed)	VoxPopuli	Multilingual corpus for unsupervised and semi-supervised learning.
Europarl-ASR (EN)	1300	Europarl-ASR	Parliamentary debates for ASR training, with official transcripts from the European Parliament.
Multilingual LibriSpeech (MLS EN)	2,000 (subset of 44.5K)	MLS EN	Derived from LibriVox project audiobooks, for speech research in multiple languages.
Mozilla Common Voice (v8.0)	16,000 (ongoing project)	Mozilla Common Voice	Multilingual read speech corpus for building voice technologies, contributed by volunteers.
People's Speech	12,000	Not found	English speech corpus for ASR model training.
TED-LIUM v3	452	TED-LIUM 3 Dataset	Audio from TED Talks, including talks from TED-LIUM 2, with automatic transcripts.
AMI	100	AMI Corpus	Meeting recordings with various synchronized signals including video and projector outputs.
English Broadcast News2	140 (plus 9000 hours of TV shows)	English Broadcast News Speech Recognition by Humans and Machines	Wide-band signals from various speakers, different background noises, and news topics, with lightly supervised transcripts.

LibriSpeech
- Data Size: 960 Hours
- Source: LibriSpeech
- Description: A corpus of read English speech derived from read audiobooks from the LibriVox project, carefully segmented and aligned, with a sampling rate of 16 kHz.
Fisher Corpus
- Data Size: Part 1 consists of 984 hours, and the entire collection has over 2000 hours of English conversational telephone speech.
- Source: Fisher Corpus Part 1 Transcripts
- Description: A collection of spontaneous telephone conversations in English between native speakers, recorded for linguistic research.
Switchboard-1 Dataset
- Data Size: 260 Hours
- Source: Switchboard-1
- Description: A corpus of English telephone conversations, collected under DARPA sponsorship and released by NIST and the LDC.
WSJ-0 and WSJ-1
- Data Size: 80 Hours
- Source: WSJ0
- Description: A corpus of read speech with texts drawn from Wall Street Journal news text, known as WSJ0 and WSJ1, used for research on large-vocabulary Continuous Speech Recognition (CSR) systems.
National Speech Corpus (Part 1, Part 6)
- Data Size: The entire corpus is approximately 1.2 TB in size (specific hours not provided).
- Source: National Speech Corpus
- Description: A large-scale Singapore English corpus for automatic speech recognition (ASR) research, designed to improve speech engines’ accuracy for locally accented English.
VCTK
- Data Size: 44 Hours (Each of the 109 native English speakers reads about 400 sentences.)
- Source: VCTK
- Description: A dataset designed for text-to-speech research, containing audio recordings of speakers with various accents reading newspaper excerpts, the Rainbow Passage, and an elicitation paragraph.
VoxPopuli (EN)
- Data Size: 543 Hours (Part of a larger corpus with 1.8K hours of transcribed speeches in 16 languages.)
- Source: VoxPopuli
- Description: A large-scale multilingual corpus with unlabelled and transcribed speech data in multiple languages, intended for unsupervised and semi-supervised learning.
Europarl-ASR (EN)
- Data Size: 1300 hours of English-language annotated speech data.
- Source: Europarl-ASR
- Description: A corpus of parliamentary debates for ASR training and benchmarking, containing speeches and their official transcripts from the European Parliament.
Multilingual LibriSpeech (MLS EN) - 2,000 hrs subset
- Data Size: 2,000 hours subset of a larger corpus with 44.5K hours of English.
- Source: MLS EN
- Description: A corpus derived from read audiobooks from the LibriVox project, suitable for speech research in multiple languages.
Mozilla Common Voice (v8.0)
- Data Size: 16,000 Hours (The size for v8.0 is not specified, but the project is ongoing with contributions from volunteers.)
- Source: Mozilla Common Voice
- Description: A multilingual corpus of read speech collected from volunteers across the globe for building voice-enabled technologies.
*People's Speech *
- Data Size: 12,000 Hours
- Source: A specific link for the 12,000 hours subset was not found during the search.
- Description: A large and diverse English speech corpus aimed at training ASR models.
*TED-LIUM v3 *
- Data Size: 452 hours of audio
- Source: TED-LIUM 3 Dataset
- Description: This audio dataset is derived from TED Talks and includes 2351 audio talks. It features aligned automatic transcripts and takes into account speech disfluencies such as repetitions and hesitations.
AMI
- Data Size: 100 hours of meeting recordings
- Source: AMI Corpus
- Description: The AMI Meeting Corpus is a multi-modal dataset that includes synchronized recordings using various signals. It features close-talking and far-field microphones, individual and room-view video cameras, and outputs from a slide projector and an electronic whiteboard.
English Broadcast News2
- Data Size: 140 hours of carefully transcribed data, with an additional 9000 hours of TV shows with closed captions used for training.
- Source: English Broadcast News Speech Recognition by Humans and Machines
- Description: This dataset is for speech recognition systems that deal with wide-band signals from a variety of speakers in different background noise conditions, speaking on various news topics. The data is similar to written English, with lightly supervised transcripts for training.

'Speech Signal Processing > Basic' 카테고리의 다른 글

UTF-8, Byte-level BPE (BBPE) (10)	2024.10.09
Public Speech Datasets for ASR (0)	2023.11.18
16 Bit, 16kHz wav 데이터 사이즈 계산 (Calculation of 16 Bit, 16kHz wave data size) (0)	2021.05.14
[기본] 음성 신호 처리 - 시간영역/주파수영역 분석 (0)	2020.09.18
16비트 고정소수점, 32비트 부동소수점 WAV 파일 (16-bit fixed point, 32-bit floating point WAV file basics) (0)	2020.07.22

Notes

Public Speech Datasets for ASR (details)

'Speech Signal Processing > Basic' 카테고리의 다른 글

+ Recent posts

티스토리툴바