This paper presents a new model, SpeechUT, which aims to bridge the gap between speech and text representations in the context of pre-training for speech-to-text tasks.

 

 

Key Takeways:

  1. Tasks: SpeechUT incorporates three unsupervised pre-training tasks: speech-to-unit (S2U), masked unit modeling (MUM), and unit-to-text (U2T). These tasks help to learn better representations for the speech and text modalities.
  2. Architecture: SpeechUT comprises a speech encoder, unit encoder, and text decoder, along with speech and unit pre-nets to process the inputs.
  3. Unified-Modal Speech-Unit-Text Pre-training Model (SpeechUT): The proposed model is designed to connect the representations of speech and text through a shared unit encoder. It allows for pre-training with unpaired speech and text data, which can be beneficial for tasks like automatic speech recognition (ASR) and speech translation (ST). SpeechUT is a new pre-training method using hidden-unit representations to connect speech encoders and text decoders.
  4. Discrete Representation (Units): SpeechUT leverages hidden-unit representations as an interface to align speech and text. This is done by decomposing the speech-to-text model into a speech-to-unit model and a unit-to-text model, which can be pre-trained separately with large amounts of unpaired data. The model uses discrete unit sequences produced by off-line generators, allowing for the pre-training of large-scale unpaired speech and text.
  5. Embedding Mixing: An embedding mixing mechanism is introduced to better align speech and unit representations.
  6. Pre-Training and Fine-Tuning Methods: The paper describes how SpeechUT is pre-trained with the mentioned tasks and fine-tuned for specific ASR and ST tasks.
    1. Pre-Training Tasks: SpeechUT includes three unsupervised pre-training tasks: speech-to-unit, masked unit modeling, and unit-to-text.
    2. Fine-Tuning: For downstream tasks like ASR and ST, SpeechUT is fine-tuned without introducing new parameters, utilizing the pre-trained modules.
  7. Performance: The paper reports that SpeechUT achieves substantial improvements over strong baselines and sets new state-of-the-art performance on the LibriSpeech ASR and MuST-C ST benchmarks.
  8. Detailed Analyses: The paper includes detailed analyses to understand the proposed SpeechUT model better, and the code and pre-trained models are made available for the community.

 

 




+ Recent posts