Text-only adaptation for E2E ASR models

2024. 4. 4. 23:46

There are generally three ways to perform text-only adaptation:

Injecting synthesizing speech data to the model
- generate audio for training texts via TTS and inject it to the model
LM fusion
- Fusion and biasing (shallow fusion):
  - during decoding interpolate posterior word probabilities with text priors from external LMs
  - another recent approach is to extract internal LM probabilities and discount with the ratio of external and internal LM probabilities
- Rescoring and reranking
  - after decoding, use a powerful external LM to update scores and rerank n-best results or recognition lattice
- These techniques incur a significant overhead at inference time due to the external LM and also require careful tuning of the interpolation weight used for the external LM.
Explicit separation of internal LMs
- force the E2E decoder/predictor to behave more like a language model (e.g. Hybrid autoregressive transducer (HAT), Modular hybrid autoregressive transducer, and Factorized transducer)

Reference

[1] External Language Model Integration for Factorized Neural Transducers

[2] in-situ test-only adaptation of speech models with low-overhead speech imputations

[BBPE] OPTIMIZING BYTE-LEVEL REPRESENTATION FOR END-TO-END ASR (0)	2024.10.09
[DataLoader] DynamicBatchSampler (3)	2024.10.09
Whisper ASR: Model and Training Details (0)	2023.11.18
Subword modelling for ASR (0)	2022.05.07
[Kaldi Decoding] 칼디 디코딩 그래프 구성 (0)	2020.06.18

Notes