'Speech Signal Processing' 카테고리의 글 목록 (2 Page)

Speech Signal Processing

[ASR/ST/PT/ACL22] Unified Speech-Text Pre-training for Speech Translation and Recognition 2023.11.18
[ASR/ST/PT/2023.10] SpeechUT:Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training 2023.11.18
Public Speech Datasets for ASR 2023.11.18
Public Speech Datasets for ASR (details) 2023.11.18
Whisper ASR: Model and Training Details 2023.11.18
ICASSP 2022 | Keyword Spotting 2022.05.20
ICASSP 2022 | SSL for Speech and Audio Processing I 2022.05.07
Subword modelling for ASR 2022.05.07

[ASR/ST/PT/ACL22] Unified Speech-Text Pre-training for Speech Translation and Recognition

2023. 11. 18. 17:30

This paper presents a method for jointly pre-training speech and text in an encoder-decoder framework to improve performance in speech translation and recognition tasks.

Key Takeaways:

Architecture: The method utilizes an Attention based Encoder-Decoder (AED) framework to integrate data from different modalities (speech and text) for representation learning.
- Shared Encoder and Decoder: The STPT framework uses a shared encoder and decoder for both the speech and text modalities, which allows the model to integrate knowledge from both domains.
Acoustic and Linguistic Representation Learning: The STPT framework is designed to learn both acoustic features from speech and linguistic features from text during the pre-training stage. This is crucial for speech translation models, which must understand the sounds of speech as well as the meaning of words.
Joint Pre-Training Phase; Multi-Task Learning Framework: The framework integrates different pre-training tasks to build a robust model capable of handling multiple aspects of speech and language. The proposed Speech and Text joint Pre-Training (STPT) framework incorporates four self-supervised and supervised subtasks designed for cross-modality learning.
- Text-to-Text (T2T): This self-supervised task helps the model learn linguistic patterns in the text. It's similar to how models like BERT learn by predicting masked words in a sentence.
- Speech SSL learning (SSL): This is another self-supervised task focused on learning from the speech data alone, likely involving predicting some masked or hidden parts of the speech input.
- Speech-to-Phoneme (S2P): A supervised task where the model is trained to predict phoneme units from speech data. Phonemes are the smallest units of sound in a language, so this task helps the model learn the sounds that make up speech.
- Speech-to-Subword (S2T): Also a supervised task, where the model learns to predict subword units from the speech input. Subwords are larger than phonemes and can carry more linguistic information, like syllables or parts of words.
Loss Functions: Pretraining is guided by different loss functions corresponding to the various tasks:
- LT2T: The loss for the Text-to-Text task.
- LSSL: The loss for the Speech SSL learning task, which involves masked prediction.
- LS2P: The loss for the Speech-to-Phoneme task, which involves phoneme-unit sequence classification.
- LS2T: The loss for the Speech-to-Subword task, involving sequential prediction of subword tokens.
- Final Loss: The overall objective for the pre-training phase is a combination of these losses, guiding the model to learn both modality-specific and cross-modal representations.
Improved Performance: The STPT method effectively fuses speech and text information into one model, leading to significant improvements in performance. It achieves 1.7 to 2.3 BLEU score improvements on the MUST-C speech translation dataset and comparable word error rates (WERs) to the wav2vec 2.0 model on the LibriSpeech speech recognition task.

'Speech Signal Processing > Research' 카테고리의 다른 글

MultiHeadSelfAttentions, their masks, and variant (2)	2024.09.05
[SSL] BEST-RQ Pre-Training (0)	2024.09.05
[ASR/ST/PT/2023.10] SpeechUT:Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training (0)	2023.11.18
ICASSP 2022 \| Keyword Spotting (0)	2022.05.20
ICASSP 2022 \| SSL for Speech and Audio Processing I (0)	2022.05.07

[ASR/ST/PT/2023.10] SpeechUT:Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training

2023. 11. 18. 17:13

This paper presents a new model, SpeechUT, which aims to bridge the gap between speech and text representations in the context of pre-training for speech-to-text tasks.

Key Takeways:

Tasks: SpeechUT incorporates three unsupervised pre-training tasks: speech-to-unit (S2U), masked unit modeling (MUM), and unit-to-text (U2T). These tasks help to learn better representations for the speech and text modalities.
Architecture: SpeechUT comprises a speech encoder, unit encoder, and text decoder, along with speech and unit pre-nets to process the inputs.
Unified-Modal Speech-Unit-Text Pre-training Model (SpeechUT): The proposed model is designed to connect the representations of speech and text through a shared unit encoder. It allows for pre-training with unpaired speech and text data, which can be beneficial for tasks like automatic speech recognition (ASR) and speech translation (ST). SpeechUT is a new pre-training method using hidden-unit representations to connect speech encoders and text decoders.
Discrete Representation (Units): SpeechUT leverages hidden-unit representations as an interface to align speech and text. This is done by decomposing the speech-to-text model into a speech-to-unit model and a unit-to-text model, which can be pre-trained separately with large amounts of unpaired data. The model uses discrete unit sequences produced by off-line generators, allowing for the pre-training of large-scale unpaired speech and text.
Embedding Mixing: An embedding mixing mechanism is introduced to better align speech and unit representations.
Pre-Training and Fine-Tuning Methods: The paper describes how SpeechUT is pre-trained with the mentioned tasks and fine-tuned for specific ASR and ST tasks.
1. Pre-Training Tasks: SpeechUT includes three unsupervised pre-training tasks: speech-to-unit, masked unit modeling, and unit-to-text.
2. Fine-Tuning: For downstream tasks like ASR and ST, SpeechUT is fine-tuned without introducing new parameters, utilizing the pre-trained modules.
Performance: The paper reports that SpeechUT achieves substantial improvements over strong baselines and sets new state-of-the-art performance on the LibriSpeech ASR and MuST-C ST benchmarks.
Detailed Analyses: The paper includes detailed analyses to understand the proposed SpeechUT model better, and the code and pre-trained models are made available for the community.

'Speech Signal Processing > Research' 카테고리의 다른 글

[SSL] BEST-RQ Pre-Training (0)	2024.09.05
[ASR/ST/PT/ACL22] Unified Speech-Text Pre-training for Speech Translation and Recognition (0)	2023.11.18
ICASSP 2022 \| Keyword Spotting (0)	2022.05.20
ICASSP 2022 \| SSL for Speech and Audio Processing I (0)	2022.05.07
ICASSP 2022 \| Language Modeling (0)	2022.05.07

Public Speech Datasets for ASR

2023. 11. 18. 15:29

OWSM v1, v2, and v3: Refer the paper
- OWSM v1
  - AISHELL-1 [23],
  - CoVoST2 [24],
  - GigaSpeech [25],
  - LibriSpeech [26],
  - MuST-C [27],
  - SPGISpeech [28]
  - TEDLIUM3 [29].
- OWSM v2
  - builds upon v1 and includes additional datasets:
  - GigaST [30]
  - Multilingual LibriSpeech [31]
  - WenetSpeech [32].
- OWSM v3
  - extends v2 with even more datasets:
  - AIDATATANG [33],
  - AMI [34],
  - Babel [35],
  - Common Voice [36],
  - Fisher (Switchboard) [37],
  - Fisher Callhome Spanish [38],
  - FLEURS [39],
  - Googlei18n3 ,
  - KsponSpeech [40],
  - MagicData [41],
  - ReazonSpeech [42],
  - Russian Open STT [43],
  - VCTK [44],
  - VoxForge [45],
  - VoxPopuli [46],
  - WSJ [47].
NeMo-Public dataset
- Librispeech
- Fisher Corpus
- Switchboard-1 Dataset
- WSJ-0 and WSJ-1
- National Speech Corpus (Part 1, Part 6)
- VCTK
- VoxPopuli (EN)
- Europarl-ASR (EN)
- Multilingual Librispeech (MLS EN) - 2,000 hrs subset
- Mozilla Common Voice (v8.0)
- People's Speech - 12,000 hrs subset
SpeechStaw
- Librispeech
- Common Voice v8.0
- TED-LIUM v3
- AMI
- English Broadcast News2
- WSJ0 and WSJ1

'Speech Signal Processing > Basic' 카테고리의 다른 글

UTF-8, Byte-level BPE (BBPE) (4)	2024.10.09
Public Speech Datasets for ASR (details) (0)	2023.11.18
16 Bit, 16kHz wav 데이터 사이즈 계산 (Calculation of 16 Bit, 16kHz wave data size) (0)	2021.05.14
[기본] 음성 신호 처리 - 시간영역/주파수영역 분석 (0)	2020.09.18
16비트 고정소수점, 32비트 부동소수점 WAV 파일 (16-bit fixed point, 32-bit floating point WAV file basics) (0)	2020.07.22

Public Speech Datasets for ASR (details)

2023. 11. 18. 15:24

Dataset Name	Data Size (Hours)	Source	Description
LibriSpeech	960	LibriSpeech	A corpus of read English speech from LibriVox project, segmented and aligned, 16 kHz sampling rate.
Fisher Corpus	Part 1: 984, Total: >2000	Fisher Corpus Part 1 Transcripts	Spontaneous telephone conversations in English, recorded for linguistic research.
Switchboard-1 Dataset	260	Switchboard-1	English telephone conversations, collected under DARPA sponsorship.
WSJ-0 and WSJ-1	80	WSJ0	Read speech from Wall Street Journal news text, for large-vocabulary CSR systems.
National Speech Corpus	Not specified, 1.2 TB	National Speech Corpus	Singapore English corpus for ASR research, to improve accuracy for locally accented English.
VCTK	44 (per speaker, 109 speakers)	VCTK	Text-to-speech research, audio of speakers with various accents reading different texts.
VoxPopuli (EN)	543 (part of 1.8K transcribed)	VoxPopuli	Multilingual corpus for unsupervised and semi-supervised learning.
Europarl-ASR (EN)	1300	Europarl-ASR	Parliamentary debates for ASR training, with official transcripts from the European Parliament.
Multilingual LibriSpeech (MLS EN)	2,000 (subset of 44.5K)	MLS EN	Derived from LibriVox project audiobooks, for speech research in multiple languages.
Mozilla Common Voice (v8.0)	16,000 (ongoing project)	Mozilla Common Voice	Multilingual read speech corpus for building voice technologies, contributed by volunteers.
People's Speech	12,000	Not found	English speech corpus for ASR model training.
TED-LIUM v3	452	TED-LIUM 3 Dataset	Audio from TED Talks, including talks from TED-LIUM 2, with automatic transcripts.
AMI	100	AMI Corpus	Meeting recordings with various synchronized signals including video and projector outputs.
English Broadcast News2	140 (plus 9000 hours of TV shows)	English Broadcast News Speech Recognition by Humans and Machines	Wide-band signals from various speakers, different background noises, and news topics, with lightly supervised transcripts.

LibriSpeech
- Data Size: 960 Hours
- Source: LibriSpeech
- Description: A corpus of read English speech derived from read audiobooks from the LibriVox project, carefully segmented and aligned, with a sampling rate of 16 kHz.
Fisher Corpus
- Data Size: Part 1 consists of 984 hours, and the entire collection has over 2000 hours of English conversational telephone speech.
- Source: Fisher Corpus Part 1 Transcripts
- Description: A collection of spontaneous telephone conversations in English between native speakers, recorded for linguistic research.
Switchboard-1 Dataset
- Data Size: 260 Hours
- Source: Switchboard-1
- Description: A corpus of English telephone conversations, collected under DARPA sponsorship and released by NIST and the LDC.
WSJ-0 and WSJ-1
- Data Size: 80 Hours
- Source: WSJ0
- Description: A corpus of read speech with texts drawn from Wall Street Journal news text, known as WSJ0 and WSJ1, used for research on large-vocabulary Continuous Speech Recognition (CSR) systems.
National Speech Corpus (Part 1, Part 6)
- Data Size: The entire corpus is approximately 1.2 TB in size (specific hours not provided).
- Source: National Speech Corpus
- Description: A large-scale Singapore English corpus for automatic speech recognition (ASR) research, designed to improve speech engines’ accuracy for locally accented English.
VCTK
- Data Size: 44 Hours (Each of the 109 native English speakers reads about 400 sentences.)
- Source: VCTK
- Description: A dataset designed for text-to-speech research, containing audio recordings of speakers with various accents reading newspaper excerpts, the Rainbow Passage, and an elicitation paragraph.
VoxPopuli (EN)
- Data Size: 543 Hours (Part of a larger corpus with 1.8K hours of transcribed speeches in 16 languages.)
- Source: VoxPopuli
- Description: A large-scale multilingual corpus with unlabelled and transcribed speech data in multiple languages, intended for unsupervised and semi-supervised learning.
Europarl-ASR (EN)
- Data Size: 1300 hours of English-language annotated speech data.
- Source: Europarl-ASR
- Description: A corpus of parliamentary debates for ASR training and benchmarking, containing speeches and their official transcripts from the European Parliament.
Multilingual LibriSpeech (MLS EN) - 2,000 hrs subset
- Data Size: 2,000 hours subset of a larger corpus with 44.5K hours of English.
- Source: MLS EN
- Description: A corpus derived from read audiobooks from the LibriVox project, suitable for speech research in multiple languages.
Mozilla Common Voice (v8.0)
- Data Size: 16,000 Hours (The size for v8.0 is not specified, but the project is ongoing with contributions from volunteers.)
- Source: Mozilla Common Voice
- Description: A multilingual corpus of read speech collected from volunteers across the globe for building voice-enabled technologies.
*People's Speech *
- Data Size: 12,000 Hours
- Source: A specific link for the 12,000 hours subset was not found during the search.
- Description: A large and diverse English speech corpus aimed at training ASR models.
*TED-LIUM v3 *
- Data Size: 452 hours of audio
- Source: TED-LIUM 3 Dataset
- Description: This audio dataset is derived from TED Talks and includes 2351 audio talks. It features aligned automatic transcripts and takes into account speech disfluencies such as repetitions and hesitations.
AMI
- Data Size: 100 hours of meeting recordings
- Source: AMI Corpus
- Description: The AMI Meeting Corpus is a multi-modal dataset that includes synchronized recordings using various signals. It features close-talking and far-field microphones, individual and room-view video cameras, and outputs from a slide projector and an electronic whiteboard.
English Broadcast News2
- Data Size: 140 hours of carefully transcribed data, with an additional 9000 hours of TV shows with closed captions used for training.
- Source: English Broadcast News Speech Recognition by Humans and Machines
- Description: This dataset is for speech recognition systems that deal with wide-band signals from a variety of speakers in different background noise conditions, speaking on various news topics. The data is similar to written English, with lightly supervised transcripts for training.

'Speech Signal Processing > Basic' 카테고리의 다른 글

UTF-8, Byte-level BPE (BBPE) (4)	2024.10.09
Public Speech Datasets for ASR (0)	2023.11.18
16 Bit, 16kHz wav 데이터 사이즈 계산 (Calculation of 16 Bit, 16kHz wave data size) (0)	2021.05.14
[기본] 음성 신호 처리 - 시간영역/주파수영역 분석 (0)	2020.09.18
16비트 고정소수점, 32비트 부동소수점 WAV 파일 (16-bit fixed point, 32-bit floating point WAV file basics) (0)	2020.07.22

Whisper ASR: Model and Training Details

2023. 11. 18. 14:14

Model Overview

Whisper is a Transformer-based encoder-decoder model.

Training Data

Whisper ASR models are trained on a mixture of English-only and multilingual data, with a substantial amount of weakly labeled and pseudolabeled audio.

Whisper ASR V1 and V2

Trained on 680,000 hours of audio and corresponding transcripts from the internet.
Data distribution includes 65% English audio (438k hours), 18% non-English audio with English transcripts, and 17% non-English audio with corresponding transcripts, spanning 98 languages.

Whisper ASR V3

Trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio of pseudolabeled audio collected using Whisper large-v2. The model was trained for 2.0 epochs over this mixture dataset.
V3 shows a 10% to 20% reduction in errors compared to V2

Training Details

Initial models were trained with AdamW optimizer, gradient norm clipping, and a linear learning rate decay after a warmup period.
No data augmentation or regularization was used initially due to the diversity and size of the dataset.
For Whisper Large V2, additional techniques like SpecAugment, Stochastic Depth, and BPE Dropout were introduced for regularization.
Different max learning rates were used for different model sizes.

Hyperparameters

General Hyperparameters

Hyperparameters for Whisper Large V2

Model Learning Rates

Summary

'Speech Signal Processing > Speech Recognition' 카테고리의 다른 글

[DataLoader] DynamicBatchSampler (3)	2024.10.09
Text-only adaptation for E2E ASR models (0)	2024.04.04
Subword modelling for ASR (0)	2022.05.07
[Kaldi Decoding] 칼디 디코딩 그래프 구성 (0)	2020.06.18
[Kaldi Decoding] Finite State Transducer algorithms (FST) (0)	2020.06.18

ICASSP 2022 | Keyword Spotting

2022. 5. 20. 14:23

SPE-54: Keyword Spotting

Unified Speculation, Detection, And, Verification Keyword Spotting

Geng-shen Fu, Thibaud Senechal, Aaron Challenner, Tao Zhang, Amazon Alexa Science

Problem

- Accurate and timely recognition of the trigger keyword is vital.

- There is a trade-off needed between accuracy and latency.

Proposed method

- We propose an CRNN-based unified speculation, detection, and verification keyword detection model.

- We propose a latency- aware max-pooling loss, and show empirically that it teaches a model to maximize accuracy under the latency constraint.

- A USDV model can be trained in a MTL fashion and achieves different accuracy and latency trade-off across these three tasks.

1. Unified speculation, detection, and verification model

- Speculation makes an early decision, which can be used to give a head-start to downstream processes on the device.

- Detection mimics the traditional keyword trigger task and gives a more accurate decision by observing the full keyword context.

- Verification verifies previous decision by observing even more audio after the keyword span.

2. Model architecture and training strategy

- CRNN architecture

- multi-task learning with different target latencies on the new proposed latency-aware max-pooling loss.

Temporal Early Exiting for Streaming Speech Commands Recognition

Comcast Applied AI, University of Waterloo

Problem

Voice queries to take time to process:

Stage 1: The user is speaking (seconds).

Stage 2: Finish ASR transcription (~50ms).

Stage 3: Information retrieval (~500ms).

Proposed method

- Use a streaming speech commands model for the top-K voice queries.

- Apply some training objective for better early exiting across time; Return a prediction before the entire audio is observed.

- Use early exiting with some condence threshold to adjust the latency-accuracy trade-off.

Model

- GRU Model

- Per-frame output probability distribution over K commands (classes).

Early-Exiting Objectives

Connectionist temporal classication (CTC):

Last-frame cross entropy (LF):

All-frame cross entropy (AF):

Findings

1. The all-frame objective (AF) performs best, perhaps because it explicitly trains the hidden features to be more discriminative, similar to deep supervision [1].

2. The observed indices correlate with the optimal indices for all models and datasets, with the AF-0.5 model consistently exiting earlier than the LF one does.

'Speech Signal Processing > Research' 카테고리의 다른 글

[ASR/ST/PT/ACL22] Unified Speech-Text Pre-training for Speech Translation and Recognition (0)	2023.11.18
[ASR/ST/PT/2023.10] SpeechUT:Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training (0)	2023.11.18
ICASSP 2022 \| SSL for Speech and Audio Processing I (0)	2022.05.07
ICASSP 2022 \| Language Modeling (0)	2022.05.07
ICASSP 2022 \| Speech Recognition: Robust Speech Recognition I (0)	2022.05.07

ICASSP 2022 | SSL for Speech and Audio Processing I

2022. 5. 7. 20:51

Self-supervised Learning for Speech and Audio Processing I

Technical Program Session MLSP-3

UNIVERSAL PARALINGUISTIC SPEECH REPRESENTATIONS USING SELF-SUPERVISED CONFORMERS

Verily Life Sciences, Boston, USA1 and Mountain View, California, USA

Many speech applications require understanding aspects beyond the words being spoken, such as recognizing emotion, detecting whether the speaker is wearing a mask, or distinguishing real from synthetic speech. In this work, we introduce a new state-of-the-art paralinguistic representation derived from large-scale, fully self-supervised training of a 600M+ parameter Conformer-based architecture. We benchmark on a diverse set of speech tasks and demonstrate that simple linear classifiers trained on top of our time-averaged representation outperform nearly all previous results, in some cases by large margins. Our analyses of context-window size demonstrate that, surprisingly, 2 second context-windows achieve 96% the performance of the Conformers that use the full long-term context on 7 out of 9 tasks. Furthermore, while the best per-task representations are extracted internally in the network, stable performance across several layers allows a single universal representation to reach near optimal performance on all tasks.

https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9747197

Proposed method

Key Findings

A NOISE-ROBUST SELF-SUPERVISED PRE-TRAINING MODEL BASED SPEECH REPRESENTATION LEARNING FOR AUTOMATIC SPEECH RECOGNITION

NEL-SLIP, University of Science and Technology of China (USTC), Hefei, China

Wav2vec2.0 is a popular self-supervised pre-training framework for learning speech representations in the context of automatic speech recognition (ASR). It was shown that wav2vec2.0 has a good robustness against the domain shift, while the noise robustness is still unclear. In this work, we therefore first analyze the noise robustness of wav2vec2.0 via experiments. We observe that wav2vec2.0 pre-trained on noisy data can obtain good representations and thus improve the ASR performance on the noisy test set, which however brings a performance degradation on the clean test set. To avoid this issue, in this work we propose an enhanced wav2vec2.0 model. Specifically, the noisy speech and the corresponding clean version are fed into the same feature encoder, where the clean speech provides training targets for the model. Experimental results reveal that the proposed method can not only improve the ASR performance on the noisy test set which surpasses the original wav2vec2.0, but also ensure a tiny performance decrease on the clean test set. In addition, the effectiveness of the proposed method is demonstrated under different types of noise conditions.

https://ieeexplore.ieee.org/document/9747379

AN ADAPTER BASED PRE-TRAINING FOR EFFICIENT AND SCALABLE SELF-SUPERVISED SPEECH REPRESENTATION LEARNING

Huawei R&D UK, University of Oxford

https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9747374

CONTRASTIVE PREDICTION STRATEGIES FOR UNSUPERVISED SEGMENTATION AND CATEGORIZATION OF PHONEMES AND WORDS

University of Wroclaw, Poland, NavAlgo, France, NVIDIA, Poland, Universite de Toulon, France

We identify a performance trade-off between the tasks of phoneme categorization and phoneme and word segmentation in several self-supervised learning algorithms based on Contrastive Predictive Coding (CPC). Our experiments suggest that context building networks, albeit necessary for high performance on categorization tasks, harm segmentation performance by causing a temporal shift on the learned representations. Aiming to tackle this trade-off, we take inspiration from the leading approaches on segmentation and propose multi-level Aligned CPC (mACPC). It builds on Aligned CPC (ACPC), a variant of CPC which exhibits the best performance on categorization tasks, and incorporates multi-level modeling and optimization for detection of spectral changes. Our methods improve in all tested categorization metrics and achieve state-of-the-art performance in word segmentation.

https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9746102

CHARACTERIZING THE ADVERSARIAL VULNERABILITY OF SPEECH SELF-SUPERVISED LEARNING

National Taiwan University, The Chinese University of Hong Kong

SUPERB

A leaderboard named Speech processing Universal PERformance Benchmark (SUPERB), which aims at benchmarking the performance of a shared self-supervised learning (SSL) speech model across various downstream speech tasks with minimal modification of architectures and a small amount of data, has fueled the research for speech representation learning. The SUPERB demonstrates speech SSL upstream models improve the performance of various downstream tasks through just minimal adaptation. As the paradigm of the self-supervised learning upstream model followed by downstream tasks arouses more attention in the speech community, characterizing the adversarial robustness of such paradigm is of high priority. In this paper, we make the first attempt to investigate the adversarial vulnerability of such paradigm under the attacks from both zero-knowledge adversaries and limited-knowledge adversaries. The experimental results illustrate that the paradigm proposed by SUPERB is seriously vulnerable to limited-knowledge adversaries, and the attacks generated by zero-knowledge adversaries are with transferability. The XAB test verifies the imperceptibility of crafted adversarial attacks.

https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9747242

'Speech Signal Processing > Research' 카테고리의 다른 글

[ASR/ST/PT/ACL22] Unified Speech-Text Pre-training for Speech Translation and Recognition (0)	2023.11.18
[ASR/ST/PT/2023.10] SpeechUT:Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training (0)	2023.11.18
ICASSP 2022 \| Keyword Spotting (0)	2022.05.20
ICASSP 2022 \| Language Modeling (0)	2022.05.07
ICASSP 2022 \| Speech Recognition: Robust Speech Recognition I (0)	2022.05.07

Subword modelling for ASR

2022. 5. 7. 20:33

There have been various generic and language specific approaches on sub-word segmentation to handle OOV problem for machine translation and ASR tasks.

Various subword units like phoneme, syllable, character, morpheme and combination have been used in different approaches of subword modelling. Also, there have been generic and language specific approaches as well. Below enlists some of the major sub-word segmentation approaches. One of the earlier approaches to ASR was Korean syllable-based segmentation [8]. Some of the language specific earlier approaches were in German LVSR [10] and Polish [11]. There was Morpheme based OOV handling approach for Turkish ASR keyword spotting task [9] and multiple languages [12].

The popular recent approaches in unsupervised segmentation

Both Byte Pair Encoding and WordPiece algorithms works on merging adjacent characters.

BPE : the merge pair is chosen based on frequency (merging adjacent characters)

WordPiece : merge is based on maximizing likelihood (merging adjacent characters)

Unigram and BPE dropout [14] are some of the sub-word segmentation regularization techniques.

Libraries implementing segmentation algorithms

sentencepiece [15],

bpeNMT [16],

morfessor [17]

Morph agram [16].

[1] Hybrid sub-word segmentation for handling long tail in morphologically rich low resource languages, ICASSP 2022

[7] M. Huck, S. Riess, and A. Fraser, “Target-side Word Segmentation Strategies for Neural Machine Translation in Proceedings of the Conference on Machine Translation (WMT), Volume 1: Research Papers, pages 56–67 Copenhagen, Denmark, 2017.

[8] D. Kiecza, T. Schultz and A. Waibel, “Data-Driven Determination of Appropriate Dictionary Units for Korean LVCSR”, in proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1999.

[9] Y. He, B. Hutchinson, P. Baumann, M. Ostendorf, E. FoslerLussier, and J. Pierrehumbert, “Subword-Based Modeling For Handling OOV Words In Keyword Spotting”, in proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Italy, 2014.

[10] A. El-Desoky, M. Mousa, B. Ali, R. Shaik, H. Schlüter, and Ney, “Sub-Lexical Language Models For German LVCSR”, in proceedings of the 2010 IEEE Spoken Language Technology Workshop (SLT), 2010.

[11] M.A.B. Shaik, A.E.-D. Mousa, R. Schluter, and H. Ney, “Using morpheme and syllable based sub-words for Polish LVCSR”, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4680–4683, 2011.

[12] M. Creutz, T. Hirsimäki, M. Kurimo, A. Puurula, “Morph-based speech recognition and modeling of out-of-vocabulary words across languages” in ACM Transactions on Speech and Language Processing (TSLP). 5(1):3, 2007

[13] M. Schuster and K. Nakajima, “Japanese and Korean voice search,” in proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012.

[14] I. Provilkov, D. Emelianenko and E. Voita, “BPE-Dropout: Simple and Effective Subword Regularization”, in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1882–1892, 2020.

[15] R. Eskander, F. Callejas, E. Nichols, J. Klavans, and S. Muresan, “MorphAGram: Evaluation and Framework for Unsupervised MorphologicalSegmentation”, in Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 7112–7122, 2020.

[16] “Subword-nmt”, Available at: https://github.com/rsennrich/subword-nmt [Accessed : 10 January, 2021]

[17] “Morfessor”, Available at: https://github.com/aaltospeech/morfessor [Accessed : 10 January, 2021].

'Speech Signal Processing > Speech Recognition' 카테고리의 다른 글

Text-only adaptation for E2E ASR models (0)	2024.04.04
Whisper ASR: Model and Training Details (0)	2023.11.18
[Kaldi Decoding] 칼디 디코딩 그래프 구성 (0)	2020.06.18
[Kaldi Decoding] Finite State Transducer algorithms (FST) (0)	2020.06.18
[Acoustic Model] Feedforward Sequential Memory Networks (FSMN) (0)	2020.06.15

PREV 1 2 3 4 5 NEXT

Notes