[ASR/ST/PT/2023.10] SpeechUT:Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training

2023. 11. 18. 17:13

This paper presents a new model, SpeechUT, which aims to bridge the gap between speech and text representations in the context of pre-training for speech-to-text tasks.

Key Takeways:

Tasks: SpeechUT incorporates three unsupervised pre-training tasks: speech-to-unit (S2U), masked unit modeling (MUM), and unit-to-text (U2T). These tasks help to learn better representations for the speech and text modalities.
Architecture: SpeechUT comprises a speech encoder, unit encoder, and text decoder, along with speech and unit pre-nets to process the inputs.
Unified-Modal Speech-Unit-Text Pre-training Model (SpeechUT): The proposed model is designed to connect the representations of speech and text through a shared unit encoder. It allows for pre-training with unpaired speech and text data, which can be beneficial for tasks like automatic speech recognition (ASR) and speech translation (ST). SpeechUT is a new pre-training method using hidden-unit representations to connect speech encoders and text decoders.
Discrete Representation (Units): SpeechUT leverages hidden-unit representations as an interface to align speech and text. This is done by decomposing the speech-to-text model into a speech-to-unit model and a unit-to-text model, which can be pre-trained separately with large amounts of unpaired data. The model uses discrete unit sequences produced by off-line generators, allowing for the pre-training of large-scale unpaired speech and text.
Embedding Mixing: An embedding mixing mechanism is introduced to better align speech and unit representations.
Pre-Training and Fine-Tuning Methods: The paper describes how SpeechUT is pre-trained with the mentioned tasks and fine-tuned for specific ASR and ST tasks.
1. Pre-Training Tasks: SpeechUT includes three unsupervised pre-training tasks: speech-to-unit, masked unit modeling, and unit-to-text.
2. Fine-Tuning: For downstream tasks like ASR and ST, SpeechUT is fine-tuned without introducing new parameters, utilizing the pre-trained modules.
Performance: The paper reports that SpeechUT achieves substantial improvements over strong baselines and sets new state-of-the-art performance on the LibriSpeech ASR and MuST-C ST benchmarks.
Detailed Analyses: The paper includes detailed analyses to understand the proposed SpeechUT model better, and the code and pre-trained models are made available for the community.

'Speech Signal Processing > Research' 카테고리의 다른 글

[SSL] BEST-RQ Pre-Training (0)	2024.09.05
[ASR/ST/PT/ACL22] Unified Speech-Text Pre-training for Speech Translation and Recognition (0)	2023.11.18
ICASSP 2022 \| Keyword Spotting (0)	2022.05.20
ICASSP 2022 \| SSL for Speech and Audio Processing I (0)	2022.05.07
ICASSP 2022 \| Language Modeling (0)	2022.05.07

Notes

[ASR/ST/PT/2023.10] SpeechUT:Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training

'Speech Signal Processing > Research' 카테고리의 다른 글

+ Recent posts

티스토리툴바