There are generally three ways to perform text-only adaptation:
Injecting synthesizing speech data to the model
generate audio for training texts via TTS and inject it to the model
LM fusion
Fusion and biasing (shallow fusion):
during decoding interpolate posterior word probabilities with text priors from external LMs
another recent approach is to extract internal LM probabilities and discount with the ratio of external and internal LM probabilities
Rescoring and reranking
after decoding, use a powerful external LM to update scores and rerank n-best results or recognition lattice
These techniques incur a significant overhead at inference time due to the external LM and also require careful tuning of the interpolation weight used for the external LM.
Explicit separation of internal LMs
force the E2E decoder/predictor to behave more like a language model (e.g. Hybrid autoregressive transducer (HAT), Modular hybrid autoregressive transducer, and Factorized transducer)
Reference
[1] External Language Model Integration for Factorized Neural Transducers
[2] in-situ test-only adaptation of speech models with low-overhead speech imputations
This paper presents a method for jointly pre-training speech and text in an encoder-decoder framework to improve performance in speech translation and recognition tasks.
Key Takeaways:
Architecture: The method utilizes an Attention based Encoder-Decoder (AED) framework to integrate data from different modalities (speech and text) for representation learning.
Shared Encoder and Decoder: The STPT framework uses a shared encoder and decoder for both the speech and text modalities, which allows the model to integrate knowledge from both domains.
Acoustic and Linguistic Representation Learning: The STPT framework is designed to learn both acoustic features from speech and linguistic features from text during the pre-training stage. This is crucial for speech translation models, which must understand the sounds of speech as well as the meaning of words.
Joint Pre-Training Phase; Multi-Task Learning Framework: The framework integrates different pre-training tasks to build a robust model capable of handling multiple aspects of speech and language. The proposed Speech and Text joint Pre-Training (STPT) framework incorporates four self-supervised and supervised subtasks designed for cross-modality learning.
Text-to-Text (T2T): This self-supervised task helps the model learn linguistic patterns in the text. It's similar to how models like BERT learn by predicting masked words in a sentence.
Speech SSL learning (SSL): This is another self-supervised task focused on learning from the speech data alone, likely involving predicting some masked or hidden parts of the speech input.
Speech-to-Phoneme (S2P): A supervised task where the model is trained to predict phoneme units from speech data. Phonemes are the smallest units of sound in a language, so this task helps the model learn the sounds that make up speech.
Speech-to-Subword (S2T): Also a supervised task, where the model learns to predict subword units from the speech input. Subwords are larger than phonemes and can carry more linguistic information, like syllables or parts of words.
Loss Functions: Pretraining is guided by different loss functions corresponding to the various tasks:
LT2T: The loss for the Text-to-Text task.
LSSL: The loss for the Speech SSL learning task, which involves masked prediction.
LS2P: The loss for the Speech-to-Phoneme task, which involves phoneme-unit sequence classification.
LS2T: The loss for the Speech-to-Subword task, involving sequential prediction of subword tokens.
Final Loss: The overall objective for the pre-training phase is a combination of these losses, guiding the model to learn both modality-specific and cross-modal representations.
Improved Performance: The STPT method effectively fuses speech and text information into one model, leading to significant improvements in performance. It achieves 1.7 to 2.3 BLEU score improvements on the MUST-C speech translation dataset and comparable word error rates (WERs) to the wav2vec 2.0 model on the LibriSpeech speech recognition task.
This paper presents a new model, SpeechUT, which aims to bridge the gap between speech and text representations in the context of pre-training for speech-to-text tasks.
Key Takeways:
Tasks: SpeechUT incorporates three unsupervised pre-training tasks: speech-to-unit (S2U), masked unit modeling (MUM), and unit-to-text (U2T). These tasks help to learn better representations for the speech and text modalities.
Architecture: SpeechUT comprises a speech encoder, unit encoder, and text decoder, along with speech and unit pre-nets to process the inputs.
Unified-Modal Speech-Unit-Text Pre-training Model (SpeechUT): The proposed model is designed to connect the representations of speech and text through a shared unit encoder. It allows for pre-training with unpaired speech and text data, which can be beneficial for tasks like automatic speech recognition (ASR) and speech translation (ST). SpeechUT is a new pre-training method using hidden-unit representations to connect speech encoders and text decoders.
Discrete Representation (Units): SpeechUT leverages hidden-unit representations as an interface to align speech and text. This is done by decomposing the speech-to-text model into a speech-to-unit model and a unit-to-text model, which can be pre-trained separately with large amounts of unpaired data. The model uses discrete unit sequences produced by off-line generators, allowing for the pre-training of large-scale unpaired speech and text.
Embedding Mixing: An embedding mixing mechanism is introduced to better align speech and unit representations.
Pre-Training and Fine-Tuning Methods: The paper describes how SpeechUT is pre-trained with the mentioned tasks and fine-tuned for specific ASR and ST tasks.
Pre-Training Tasks: SpeechUT includes three unsupervised pre-training tasks: speech-to-unit, masked unit modeling, and unit-to-text.
Fine-Tuning: For downstream tasks like ASR and ST, SpeechUT is fine-tuned without introducing new parameters, utilizing the pre-trained modules.
Performance: The paper reports that SpeechUT achieves substantial improvements over strong baselines and sets new state-of-the-art performance on the LibriSpeech ASR and MuST-C ST benchmarks.
Detailed Analyses: The paper includes detailed analyses to understand the proposed SpeechUT model better, and the code and pre-trained models are made available for the community.
Description: A corpus of read English speech derived from read audiobooks from the LibriVox project, carefully segmented and aligned, with a sampling rate of 16 kHz.
Fisher Corpus
Data Size: Part 1 consists of 984 hours, and the entire collection has over 2000 hours of English conversational telephone speech.
Description: A corpus of read speech with texts drawn from Wall Street Journal news text, known as WSJ0 and WSJ1, used for research on large-vocabulary Continuous Speech Recognition (CSR) systems.
National Speech Corpus (Part 1, Part 6)
Data Size: The entire corpus is approximately 1.2 TB in size (specific hours not provided).
Description: A large-scale Singapore English corpus for automatic speech recognition (ASR) research, designed to improve speech engines’ accuracy for locally accented English.
VCTK
Data Size: 44 Hours (Each of the 109 native English speakers reads about 400 sentences.)
Description: A dataset designed for text-to-speech research, containing audio recordings of speakers with various accents reading newspaper excerpts, the Rainbow Passage, and an elicitation paragraph.
VoxPopuli (EN)
Data Size: 543 Hours (Part of a larger corpus with 1.8K hours of transcribed speeches in 16 languages.)
Description: A large-scale multilingual corpus with unlabelled and transcribed speech data in multiple languages, intended for unsupervised and semi-supervised learning.
Europarl-ASR (EN)
Data Size: 1300 hours of English-language annotated speech data.
Description: A corpus of parliamentary debates for ASR training and benchmarking, containing speeches and their official transcripts from the European Parliament.
Description: This audio dataset is derived from TED Talks and includes 2351 audio talks. It features aligned automatic transcripts and takes into account speech disfluencies such as repetitions and hesitations.
Description: The AMI Meeting Corpus is a multi-modal dataset that includes synchronized recordings using various signals. It features close-talking and far-field microphones, individual and room-view video cameras, and outputs from a slide projector and an electronic whiteboard.
English Broadcast News2
Data Size: 140 hours of carefully transcribed data, with an additional 9000 hours of TV shows with closed captions used for training.
Description: This dataset is for speech recognition systems that deal with wide-band signals from a variety of speakers in different background noise conditions, speaking on various news topics. The data is similar to written English, with lightly supervised transcripts for training.
Whisper is a Transformer-based encoder-decoder model.
Training Data
Whisper ASR models are trained on a mixture of English-only and multilingual data, with a substantial amount of weakly labeled and pseudolabeled audio.
Whisper ASR V1 and V2
Trained on 680,000 hours of audio and corresponding transcripts from the internet.
Data distribution includes 65% English audio (438k hours), 18% non-English audio with English transcripts, and 17% non-English audio with corresponding transcripts, spanning 98 languages.
Whisper ASR V3
Trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio of pseudolabeled audio collected using Whisper large-v2. The model was trained for 2.0 epochs over this mixture dataset.
V3 shows a 10% to 20% reduction in errors compared to V2
Training Details
Initial models were trained with AdamW optimizer, gradient norm clipping, and a linear learning rate decay after a warmup period.
No data augmentation or regularization was used initially due to the diversity and size of the dataset.
For Whisper Large V2, additional techniques like SpecAugment, Stochastic Depth, and BPE Dropout were introduced for regularization.
Different max learning rates were used for different model sizes.
- Accurate and timely recognition of the trigger keyword is vital.
- There is a trade-off needed between accuracy and latency.
Proposed method
- We propose an CRNN-based unified speculation, detection, and verification keyword detection model.
- We propose a latency- aware max-pooling loss, and show empirically that it teaches a model to maximize accuracy under the latency constraint.
- A USDV model can be trained in a MTL fashion and achieves different accuracy and latency trade-off across these three tasks.
1. Unified speculation, detection, and verification model
- Speculation makes an early decision, which can be used to give a head-start to downstream processes on the device.
- Detection mimics the traditional keyword trigger task and gives a more accurate decision by observing the full keyword context.
- Verification verifies previous decision by observing even more audio after the keyword span.
2. Model architecture and training strategy
- CRNN architecture
- multi-task learning with different target latencies on the new proposed latency-aware max-pooling loss.
Temporal Early Exiting for Streaming Speech Commands Recognition
Comcast Applied AI, University of Waterloo
Problem
Voice queries to take time to process:
Stage 1: The user is speaking (seconds).
Stage 2: Finish ASR transcription (~50ms).
Stage 3: Information retrieval (~500ms).
Proposed method
- Use a streaming speech commands model for the top-K voice queries.
- Apply some training objective for better early exiting across time; Return a prediction before the entire audio is observed.
- Use early exiting with some condence threshold to adjust the latency-accuracy trade-off.
Model
- GRU Model
- Per-frame output probability distribution over K commands (classes).
Early-Exiting Objectives
Connectionist temporal classication (CTC):
Last-frame cross entropy (LF):
All-frame cross entropy (AF):
Findings
1. The all-frame objective (AF) performs best, perhaps because it explicitly trains the hidden features to be more discriminative, similar to deep supervision [1].
2. The observed indices correlate with the optimal indices for all models and datasets, with the AF-0.5 model consistently exiting earlier than the LF one does.
Self-supervised Learning for Speech and Audio Processing I
Technical Program Session MLSP-3
UNIVERSAL PARALINGUISTIC SPEECH REPRESENTATIONS USING SELF-SUPERVISED CONFORMERS
Verily Life Sciences, Boston, USA1 and Mountain View, California, USA
Many speech applications require understanding aspects beyond the words being spoken, such as recognizing emotion, detecting whether the speaker is wearing a mask, or distinguishing real from synthetic speech. In this work, we introduce a new state-of-the-art paralinguistic representation derived from large-scale, fully self-supervised training of a 600M+ parameter Conformer-based architecture. We benchmark on a diverse set of speech tasks and demonstrate that simple linear classifiers trained on top of our time-averaged representation outperform nearly all previous results, in some cases by large margins. Our analyses of context-window size demonstrate that, surprisingly, 2 second context-windows achieve 96% the performance of the Conformers that use the full long-term context on 7 out of 9 tasks. Furthermore, while the best per-task representations are extracted internally in the network, stable performance across several layers allows a single universal representation to reach near optimal performance on all tasks.
A NOISE-ROBUST SELF-SUPERVISED PRE-TRAINING MODEL BASED SPEECH REPRESENTATION LEARNING FOR AUTOMATIC SPEECH RECOGNITION
NEL-SLIP, University of Science and Technology of China (USTC), Hefei, China
Wav2vec2.0 is a popular self-supervised pre-training framework for learning speech representations in the context of automatic speech recognition (ASR). It was shown that wav2vec2.0 has a good robustness against the domain shift, while the noise robustness is still unclear. In this work, we therefore first analyze the noise robustness of wav2vec2.0 via experiments. We observe that wav2vec2.0 pre-trained on noisy data can obtain good representations and thus improve the ASR performance on the noisy test set, which however brings a performance degradation on the clean test set. To avoid this issue, in this work we propose an enhanced wav2vec2.0 model. Specifically, the noisy speech and the corresponding clean version are fed into the same feature encoder, where the clean speech provides training targets for the model. Experimental results reveal that the proposed method can not only improve the ASR performance on the noisy test set which surpasses the original wav2vec2.0, but also ensure a tiny performance decrease on the clean test set. In addition, the effectiveness of the proposed method is demonstrated under different types of noise conditions.
CONTRASTIVE PREDICTION STRATEGIES FOR UNSUPERVISED SEGMENTATION AND CATEGORIZATION OF PHONEMES AND WORDS
University of Wroclaw, Poland, NavAlgo, France, NVIDIA, Poland, Universite de Toulon, France
We identify a performance trade-off between the tasks of phoneme categorization and phoneme and word segmentation in several self-supervised learning algorithms based on Contrastive Predictive Coding (CPC). Our experiments suggest that context building networks, albeit necessary for high performance on categorization tasks, harm segmentation performance by causing a temporal shift on the learned representations. Aiming to tackle this trade-off, we take inspiration from the leading approaches on segmentation and propose multi-level Aligned CPC (mACPC). It builds on Aligned CPC (ACPC), a variant of CPC which exhibits the best performance on categorization tasks, and incorporates multi-level modeling and optimization for detection of spectral changes. Our methods improve in all tested categorization metrics and achieve state-of-the-art performance in word segmentation.
CHARACTERIZING THE ADVERSARIAL VULNERABILITY OF SPEECH SELF-SUPERVISED LEARNING
National Taiwan University, The Chinese University of Hong Kong
SUPERB
A leaderboard named Speech processing Universal PERformance Benchmark (SUPERB), which aims at benchmarking the performance of a shared self-supervised learning (SSL) speech model across various downstream speech tasks with minimal modification of architectures and a small amount of data, has fueled the research for speech representation learning. The SUPERB demonstrates speech SSL upstream models improve the performance of various downstream tasks through just minimal adaptation. As the paradigm of the self-supervised learning upstream model followed by downstream tasks arouses more attention in the speech community, characterizing the adversarial robustness of such paradigm is of high priority. In this paper, we make the first attempt to investigate the adversarial vulnerability of such paradigm under the attacks from both zero-knowledge adversaries and limited-knowledge adversaries. The experimental results illustrate that the paradigm proposed by SUPERB is seriously vulnerable to limited-knowledge adversaries, and the attacks generated by zero-knowledge adversaries are with transferability. The XAB test verifies the imperceptibility of crafted adversarial attacks.