This paper presents a method for jointly pre-training speech and text in an encoder-decoder framework to improve performance in speech translation and recognition tasks. 

 

 

Key Takeaways:

  1. Architecture: The method utilizes an Attention based Encoder-Decoder (AED) framework to integrate data from different modalities (speech and text) for representation learning.
    • Shared Encoder and Decoder: The STPT framework uses a shared encoder and decoder for both the speech and text modalities, which allows the model to integrate knowledge from both domains.
  2. Acoustic and Linguistic Representation Learning: The STPT framework is designed to learn both acoustic features from speech and linguistic features from text during the pre-training stage. This is crucial for speech translation models, which must understand the sounds of speech as well as the meaning of words.
  3. Joint Pre-Training Phase; Multi-Task Learning Framework: The framework integrates different pre-training tasks to build a robust model capable of handling multiple aspects of speech and language. The proposed Speech and Text joint Pre-Training (STPT) framework incorporates four self-supervised and supervised subtasks designed for cross-modality learning.
    • Text-to-Text (T2T): This self-supervised task helps the model learn linguistic patterns in the text. It's similar to how models like BERT learn by predicting masked words in a sentence.
    • Speech SSL learning (SSL): This is another self-supervised task focused on learning from the speech data alone, likely involving predicting some masked or hidden parts of the speech input.
    • Speech-to-Phoneme (S2P): A supervised task where the model is trained to predict phoneme units from speech data. Phonemes are the smallest units of sound in a language, so this task helps the model learn the sounds that make up speech.
    • Speech-to-Subword (S2T): Also a supervised task, where the model learns to predict subword units from the speech input. Subwords are larger than phonemes and can carry more linguistic information, like syllables or parts of words.
  4. Loss Functions: Pretraining is guided by different loss functions corresponding to the various tasks:
    • LT2T: The loss for the Text-to-Text task.
    • LSSL: The loss for the Speech SSL learning task, which involves masked prediction.
    • LS2P: The loss for the Speech-to-Phoneme task, which involves phoneme-unit sequence classification.
    • LS2T: The loss for the Speech-to-Subword task, involving sequential prediction of subword tokens.
    • Final Loss: The overall objective for the pre-training phase is a combination of these losses, guiding the model to learn both modality-specific and cross-modal representations. 
  5. Improved Performance: The STPT method effectively fuses speech and text information into one model, leading to significant improvements in performance. It achieves 1.7 to 2.3 BLEU score improvements on the MUST-C speech translation dataset and comparable word error rates (WERs) to the wav2vec 2.0 model on the LibriSpeech speech recognition task.

 

 

 

This paper presents a new model, SpeechUT, which aims to bridge the gap between speech and text representations in the context of pre-training for speech-to-text tasks.

 

 

Key Takeways:

  1. Tasks: SpeechUT incorporates three unsupervised pre-training tasks: speech-to-unit (S2U), masked unit modeling (MUM), and unit-to-text (U2T). These tasks help to learn better representations for the speech and text modalities.
  2. Architecture: SpeechUT comprises a speech encoder, unit encoder, and text decoder, along with speech and unit pre-nets to process the inputs.
  3. Unified-Modal Speech-Unit-Text Pre-training Model (SpeechUT): The proposed model is designed to connect the representations of speech and text through a shared unit encoder. It allows for pre-training with unpaired speech and text data, which can be beneficial for tasks like automatic speech recognition (ASR) and speech translation (ST). SpeechUT is a new pre-training method using hidden-unit representations to connect speech encoders and text decoders.
  4. Discrete Representation (Units): SpeechUT leverages hidden-unit representations as an interface to align speech and text. This is done by decomposing the speech-to-text model into a speech-to-unit model and a unit-to-text model, which can be pre-trained separately with large amounts of unpaired data. The model uses discrete unit sequences produced by off-line generators, allowing for the pre-training of large-scale unpaired speech and text.
  5. Embedding Mixing: An embedding mixing mechanism is introduced to better align speech and unit representations.
  6. Pre-Training and Fine-Tuning Methods: The paper describes how SpeechUT is pre-trained with the mentioned tasks and fine-tuned for specific ASR and ST tasks.
    1. Pre-Training Tasks: SpeechUT includes three unsupervised pre-training tasks: speech-to-unit, masked unit modeling, and unit-to-text.
    2. Fine-Tuning: For downstream tasks like ASR and ST, SpeechUT is fine-tuned without introducing new parameters, utilizing the pre-trained modules.
  7. Performance: The paper reports that SpeechUT achieves substantial improvements over strong baselines and sets new state-of-the-art performance on the LibriSpeech ASR and MuST-C ST benchmarks.
  8. Detailed Analyses: The paper includes detailed analyses to understand the proposed SpeechUT model better, and the code and pre-trained models are made available for the community.

 

 




  1. OWSM v1, v2, and v3: Refer the paper
    • OWSM v1
      • AISHELL-1 [23],
      • CoVoST2 [24],
      • GigaSpeech [25],
      • LibriSpeech [26],
      • MuST-C [27],
      • SPGISpeech [28]
      • TEDLIUM3 [29].
    • OWSM v2
      • builds upon v1 and includes additional datasets:
      • GigaST [30]
      • Multilingual LibriSpeech [31]
      • WenetSpeech [32].
    • OWSM v3
      • extends v2 with even more datasets:
      • AIDATATANG [33],
      • AMI [34],
      • Babel [35],
      • Common Voice [36],
      • Fisher (Switchboard) [37],
      • Fisher Callhome Spanish [38],
      • FLEURS [39],
      • Googlei18n3 ,
      • KsponSpeech [40],
      • MagicData [41],
      • ReazonSpeech [42],
      • Russian Open STT [43],
      • VCTK [44],
      • VoxForge [45],
      • VoxPopuli [46],
      • WSJ [47].
  2. NeMo-Public dataset
    • Librispeech
    • Fisher Corpus
    • Switchboard-1 Dataset
    • WSJ-0 and WSJ-1
    • National Speech Corpus (Part 1, Part 6)
    • VCTK
    • VoxPopuli (EN)
    • Europarl-ASR (EN)
    • Multilingual Librispeech (MLS EN) - 2,000 hrs subset
    • Mozilla Common Voice (v8.0)
    • People's Speech - 12,000 hrs subset
  3. SpeechStaw
    • Librispeech
    • Common Voice v8.0
    • TED-LIUM v3
    • AMI
    • English Broadcast News2
    • WSJ0 and WSJ1

 

 

Dataset Name Data Size (Hours) Source Description
LibriSpeech 960 LibriSpeech A corpus of read English speech from LibriVox project, segmented and aligned, 16 kHz sampling rate.
Fisher Corpus Part 1: 984, Total: >2000 Fisher Corpus Part 1 Transcripts Spontaneous telephone conversations in English, recorded for linguistic research.
Switchboard-1 Dataset 260 Switchboard-1 English telephone conversations, collected under DARPA sponsorship.
WSJ-0 and WSJ-1 80 WSJ0 Read speech from Wall Street Journal news text, for large-vocabulary CSR systems.
National Speech Corpus Not specified, 1.2 TB National Speech Corpus Singapore English corpus for ASR research, to improve accuracy for locally accented English.
VCTK 44 (per speaker, 109 speakers) VCTK Text-to-speech research, audio of speakers with various accents reading different texts.
VoxPopuli (EN) 543 (part of 1.8K transcribed) VoxPopuli Multilingual corpus for unsupervised and semi-supervised learning.
Europarl-ASR (EN) 1300 Europarl-ASR Parliamentary debates for ASR training, with official transcripts from the European Parliament.
Multilingual LibriSpeech (MLS EN) 2,000 (subset of 44.5K) MLS EN Derived from LibriVox project audiobooks, for speech research in multiple languages.
Mozilla Common Voice (v8.0) 16,000 (ongoing project) Mozilla Common Voice Multilingual read speech corpus for building voice technologies, contributed by volunteers.
People's Speech 12,000 Not found English speech corpus for ASR model training.
TED-LIUM v3 452 TED-LIUM 3 Dataset Audio from TED Talks, including talks from TED-LIUM 2, with automatic transcripts.
AMI 100 AMI Corpus Meeting recordings with various synchronized signals including video and projector outputs.
English Broadcast News2 140 (plus 9000 hours of TV shows) English Broadcast News Speech Recognition by Humans and Machines Wide-band signals from various speakers, different background noises, and news topics, with lightly supervised transcripts.

  1. LibriSpeech
    • Data Size: 960 Hours
    • Source: LibriSpeech
    • Description: A corpus of read English speech derived from read audiobooks from the LibriVox project, carefully segmented and aligned, with a sampling rate of 16 kHz.
  2. Fisher Corpus
    • Data Size: Part 1 consists of 984 hours, and the entire collection has over 2000 hours of English conversational telephone speech.
    • Source: Fisher Corpus Part 1 Transcripts
    • Description: A collection of spontaneous telephone conversations in English between native speakers, recorded for linguistic research.
  3. Switchboard-1 Dataset
    • Data Size: 260 Hours
    • Source: Switchboard-1
    • Description: A corpus of English telephone conversations, collected under DARPA sponsorship and released by NIST and the LDC.
  4. WSJ-0 and WSJ-1
    • Data Size: 80 Hours
    • Source: WSJ0
    • Description: A corpus of read speech with texts drawn from Wall Street Journal news text, known as WSJ0 and WSJ1, used for research on large-vocabulary Continuous Speech Recognition (CSR) systems.
  5. National Speech Corpus (Part 1, Part 6)
    • Data Size: The entire corpus is approximately 1.2 TB in size (specific hours not provided).
    • Source: National Speech Corpus
    • Description: A large-scale Singapore English corpus for automatic speech recognition (ASR) research, designed to improve speech engines’ accuracy for locally accented English.
  6. VCTK
    • Data Size: 44 Hours (Each of the 109 native English speakers reads about 400 sentences.)
    • Source: VCTK
    • Description: A dataset designed for text-to-speech research, containing audio recordings of speakers with various accents reading newspaper excerpts, the Rainbow Passage, and an elicitation paragraph.
  7. VoxPopuli (EN)
    • Data Size: 543 Hours (Part of a larger corpus with 1.8K hours of transcribed speeches in 16 languages.)
    • Source: VoxPopuli
    • Description: A large-scale multilingual corpus with unlabelled and transcribed speech data in multiple languages, intended for unsupervised and semi-supervised learning.
  8. Europarl-ASR (EN)
    • Data Size: 1300 hours of English-language annotated speech data.
    • Source: Europarl-ASR
    • Description: A corpus of parliamentary debates for ASR training and benchmarking, containing speeches and their official transcripts from the European Parliament.
  9. Multilingual LibriSpeech (MLS EN) - 2,000 hrs subset
    • Data Size: 2,000 hours subset of a larger corpus with 44.5K hours of English.
    • Source: MLS EN
    • Description: A corpus derived from read audiobooks from the LibriVox project, suitable for speech research in multiple languages.
  10. Mozilla Common Voice (v8.0)
    • Data Size: 16,000 Hours (The size for v8.0 is not specified, but the project is ongoing with contributions from volunteers.)
    • Source: Mozilla Common Voice
    • Description: A multilingual corpus of read speech collected from volunteers across the globe for building voice-enabled technologies.
  11. *People's Speech *
    • Data Size: 12,000 Hours
    • Source: A specific link for the 12,000 hours subset was not found during the search.
    • Description: A large and diverse English speech corpus aimed at training ASR models.
  12. *TED-LIUM v3 *
    • Data Size: 452 hours of audio
    • Source: TED-LIUM 3 Dataset
    • Description: This audio dataset is derived from TED Talks and includes 2351 audio talks. It features aligned automatic transcripts and takes into account speech disfluencies such as repetitions and hesitations.
  13. AMI
    • Data Size: 100 hours of meeting recordings
    • Source: AMI Corpus
    • Description: The AMI Meeting Corpus is a multi-modal dataset that includes synchronized recordings using various signals. It features close-talking and far-field microphones, individual and room-view video cameras, and outputs from a slide projector and an electronic whiteboard.
  14. English Broadcast News2
    • Data Size: 140 hours of carefully transcribed data, with an additional 9000 hours of TV shows with closed captions used for training.
    • Source: English Broadcast News Speech Recognition by Humans and Machines
    • Description: This dataset is for speech recognition systems that deal with wide-band signals from a variety of speakers in different background noise conditions, speaking on various news topics. The data is similar to written English, with lightly supervised transcripts for training.

 

Model Overview

  • Whisper is a Transformer-based encoder-decoder model.

Training Data

  • Whisper ASR models are trained on a mixture of English-only and multilingual data, with a substantial amount of weakly labeled and pseudolabeled audio.

Whisper ASR V1 and V2

  • Trained on 680,000 hours of audio and corresponding transcripts from the internet.
  • Data distribution includes 65% English audio (438k hours), 18% non-English audio with English transcripts, and 17% non-English audio with corresponding transcripts, spanning 98 languages.

Whisper ASR V3

  • Trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio of pseudolabeled audio collected using Whisper large-v2. The model was trained for 2.0 epochs over this mixture dataset.
  • V3 shows a 10% to 20% reduction in errors compared to V2

Training Details

  • Initial models were trained with AdamW optimizer, gradient norm clipping, and a linear learning rate decay after a warmup period.
  • No data augmentation or regularization was used initially due to the diversity and size of the dataset.
  • For Whisper Large V2, additional techniques like SpecAugment, Stochastic Depth, and BPE Dropout were introduced for regularization.
  • Different max learning rates were used for different model sizes.

Hyperparameters

General Hyperparameters

Hyperparameters for Whisper Large V2

Model Learning Rates

 

 

Summary

SPE-54: Keyword Spotting


Unified Speculation, Detection, And, Verification Keyword Spotting

Geng-shen Fu, Thibaud Senechal, Aaron Challenner, Tao Zhang, Amazon Alexa Science

 


Problem

 

- Accurate and timely recognition of the trigger keyword is vital.

- There is a trade-off needed between accuracy and latency.

 

Proposed method

 

- We propose an CRNN-based unified speculation, detection, and verification keyword detection model.

- We propose a latency- aware max-pooling loss, and show empirically that it teaches a model to maximize accuracy under the latency constraint.

- A USDV model can be trained in a MTL fashion and achieves different accuracy and latency trade-off across these three tasks.

 

 

 

1. Unified speculation, detection, and verification model

- Speculation makes an early decision, which can be used to give a head-start to downstream processes on the device.

- Detection mimics the traditional keyword trigger task and gives a more accurate decision by observing the full keyword context.

- Verification verifies previous decision by observing even more audio after the keyword span.

 

2. Model architecture and training strategy

- CRNN architecture

- multi-task learning with different target latencies on the new proposed latency-aware max-pooling loss.


Temporal Early Exiting for Streaming Speech Commands Recognition

Comcast Applied AI, University of Waterloo

 


Problem

 

Voice queries to take time to process: 

 

Stage 1: The user is speaking (seconds). 

Stage 2: Finish ASR transcription (~50ms). 

Stage 3: Information retrieval (~500ms).

 

 

 

Proposed method

 

- Use a streaming speech commands model for the top-K voice queries.

- Apply some training objective for better early exiting across time; Return a prediction before the entire audio is observed.

- Use early exiting with some condence threshold to adjust the latency-accuracy trade-off.

 

Model

- GRU Model

- Per-frame output probability distribution over K commands (classes).

 

Early-Exiting Objectives

 

Connectionist temporal classication (CTC):

Last-frame cross entropy (LF):

All-frame cross entropy (AF):

 

Findings

 

1. The all-frame objective (AF) performs best, perhaps because it explicitly trains the hidden features to be more discriminative, similar to deep supervision [1].

2. The observed indices correlate with the optimal indices for all models and datasets, with the AF-0.5 model consistently exiting earlier than the LF one does.


Self-supervised Learning for Speech and Audio Processing I

Technical Program Session MLSP-3

 


UNIVERSAL PARALINGUISTIC SPEECH REPRESENTATIONS USING SELF-SUPERVISED CONFORMERS

 

Verily Life Sciences, Boston, USA1 and Mountain View, California, USA

 


Many speech applications require understanding aspects beyond the words being spoken, such as recognizing emotion, detecting whether the speaker is wearing a mask, or distinguishing real from synthetic speech. In this work, we introduce a new state-of-the-art paralinguistic representation derived from large-scale, fully self-supervised training of a 600M+ parameter Conformer-based architecture. We benchmark on a diverse set of speech tasks and demonstrate that simple linear classifiers trained on top of our time-averaged representation outperform nearly all previous results, in some cases by large margins. Our analyses of context-window size demonstrate that, surprisingly, 2 second context-windows achieve 96% the performance of the Conformers that use the full long-term context on 7 out of 9 tasks. Furthermore, while the best per-task representations are extracted internally in the network, stable performance across several layers allows a single universal representation to reach near optimal performance on all tasks.

 

 

https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9747197

-

 

Proposed method

-

Key Findings

 


 

A NOISE-ROBUST SELF-SUPERVISED PRE-TRAINING MODEL BASED SPEECH REPRESENTATION LEARNING FOR AUTOMATIC SPEECH RECOGNITION

NEL-SLIP, University of Science and Technology of China (USTC), Hefei, China

 


Wav2vec2.0 is a popular self-supervised pre-training framework for learning speech representations in the context of automatic speech recognition (ASR). It was shown that wav2vec2.0 has a good robustness against the domain shift, while the noise robustness is still unclear. In this work, we therefore first analyze the noise robustness of wav2vec2.0 via experiments. We observe that wav2vec2.0 pre-trained on noisy data can obtain good representations and thus improve the ASR performance on the noisy test set, which however brings a performance degradation on the clean test set. To avoid this issue, in this work we propose an enhanced wav2vec2.0 model. Specifically, the noisy speech and the corresponding clean version are fed into the same feature encoder, where the clean speech provides training targets for the model. Experimental results reveal that the proposed method can not only improve the ASR performance on the noisy test set which surpasses the original wav2vec2.0, but also ensure a tiny performance decrease on the clean test set. In addition, the effectiveness of the proposed method is demonstrated under different types of noise conditions.

 

https://ieeexplore.ieee.org/document/9747379

 


AN ADAPTER BASED PRE-TRAINING FOR EFFICIENT AND SCALABLE SELF-SUPERVISED SPEECH REPRESENTATION LEARNING

Huawei R&D UK, University of Oxford

 


https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9747374

 


CONTRASTIVE PREDICTION STRATEGIES FOR UNSUPERVISED SEGMENTATION AND CATEGORIZATION OF PHONEMES AND WORDS

University of Wroclaw, Poland, NavAlgo, France, NVIDIA, Poland, Universite de Toulon, France


We identify a performance trade-off between the tasks of phoneme categorization and phoneme and word segmentation in several self-supervised learning algorithms based on Contrastive Predictive Coding (CPC). Our experiments suggest that context building networks, albeit necessary for high performance on categorization tasks, harm segmentation performance by causing a temporal shift on the learned representations. Aiming to tackle this trade-off, we take inspiration from the leading approaches on segmentation and propose multi-level Aligned CPC (mACPC). It builds on Aligned CPC (ACPC), a variant of CPC which exhibits the best performance on categorization tasks, and incorporates multi-level modeling and optimization for detection of spectral changes. Our methods improve in all tested categorization metrics and achieve state-of-the-art performance in word segmentation.

 

https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9746102

 


 

CHARACTERIZING THE ADVERSARIAL VULNERABILITY OF SPEECH SELF-SUPERVISED LEARNING

National Taiwan University, The Chinese University of Hong Kong


SUPERB

 

A leaderboard named Speech processing Universal PERformance Benchmark (SUPERB), which aims at benchmarking the performance of a shared self-supervised learning (SSL) speech model across various downstream speech tasks with minimal modification of architectures and a small amount of data, has fueled the research for speech representation learning. The SUPERB demonstrates speech SSL upstream models improve the performance of various downstream tasks through just minimal adaptation. As the paradigm of the self-supervised learning upstream model followed by downstream tasks arouses more attention in the speech community, characterizing the adversarial robustness of such paradigm is of high priority. In this paper, we make the first attempt to investigate the adversarial vulnerability of such paradigm under the attacks from both zero-knowledge adversaries and limited-knowledge adversaries. The experimental results illustrate that the paradigm proposed by SUPERB is seriously vulnerable to limited-knowledge adversaries, and the attacks generated by zero-knowledge adversaries are with transferability. The XAB test verifies the imperceptibility of crafted adversarial attacks.

 

https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9747242


 


There have been various generic and language specific approaches on sub-word segmentation to handle OOV problem for machine translation and ASR tasks. 

 

Various subword units like phonemesyllablecharactermorpheme and combination have been used in different approaches of subword modelling. Also, there have been generic and language specific approaches as well. Below enlists some of the major sub-word segmentation approaches. One of the earlier approaches to ASR was Korean syllable-based segmentation [8]. Some of the language specific earlier approaches were in German LVSR [10] and Polish [11]. There was Morpheme based OOV handling approach for Turkish ASR keyword spotting task [9] and multiple languages [12]. 

 

The popular recent approaches in unsupervised segmentation

Both Byte Pair Encoding and WordPiece algorithms works on merging adjacent characters.

 

BPE : the merge pair is chosen based on frequency (merging adjacent characters)

WordPiece : merge is based on maximizing likelihood (merging adjacent characters)

Unigram and BPE dropout [14] are some of the sub-word segmentation regularization techniques.

 

Libraries implementing segmentation algorithms

sentencepiece [15],

bpeNMT [16],

morfessor [17]

Morph agram [16].

 

 

[1] Hybrid sub-word segmentation for handling long tail in morphologically rich low resource languages, ICASSP 2022

[7] M. Huck, S. Riess, and A. Fraser, “Target-side Word Segmentation Strategies for Neural Machine Translation in Proceedings of the Conference on Machine Translation (WMT), Volume 1: Research Papers, pages 56–67 Copenhagen, Denmark, 2017.

[8] D. Kiecza, T. Schultz and A. Waibel, “Data-Driven Determination of Appropriate Dictionary Units for Korean LVCSR”, in proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1999.

[9] Y. He, B. Hutchinson, P. Baumann, M. Ostendorf, E. FoslerLussier, and J. Pierrehumbert, “Subword-Based Modeling For Handling OOV Words In Keyword Spotting”, in proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Italy, 2014.

[10] A. El-Desoky, M. Mousa, B. Ali, R. Shaik, H. Schlüter, and Ney, “Sub-Lexical Language Models For German LVCSR”, in proceedings of the 2010 IEEE Spoken Language Technology Workshop (SLT), 2010.

[11] M.A.B. Shaik, A.E.-D. Mousa, R. Schluter, and H. Ney, “Using morpheme and syllable based sub-words for Polish LVCSR”, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4680–4683, 2011.

[12] M. Creutz, T. Hirsimäki, M. Kurimo, A. Puurula, “Morph-based speech recognition and modeling of out-of-vocabulary words across languages” in ACM Transactions on Speech and Language Processing (TSLP). 5(1):3, 2007

[13] M. Schuster and K. Nakajima, “Japanese and Korean voice search,” in proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012.

[14] I. Provilkov, D. Emelianenko and E. Voita, “BPE-Dropout: Simple and Effective Subword Regularization”, in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1882–1892, 2020.

[15] R. Eskander, F. Callejas, E. Nichols, J. Klavans, and S. Muresan, “MorphAGram: Evaluation and Framework for Unsupervised MorphologicalSegmentation”, in Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 7112–7122, 2020.

[16] “Subword-nmt”, Available at: https://github.com/rsennrich/subword-nmt [Accessed : 10 January, 2021]

[17] “Morfessor”, Available at: https://github.com/aaltospeech/morfessor [Accessed : 10 January, 2021].

 

 

+ Recent posts