There are generally three ways to perform text-only adaptation:
- Injecting synthesizing speech data to the model
- generate audio for training texts via TTS and inject it to the model
- LM fusion
- Fusion and biasing (shallow fusion):
- during decoding interpolate posterior word probabilities with text priors from external LMs
- another recent approach is to extract internal LM probabilities and discount with the ratio of external and internal LM probabilities
- Rescoring and reranking
- after decoding, use a powerful external LM to update scores and rerank n-best results or recognition lattice
- These techniques incur a significant overhead at inference time due to the external LM and also require careful tuning of the interpolation weight used for the external LM.
- Fusion and biasing (shallow fusion):
- Explicit separation of internal LMs
- force the E2E decoder/predictor to behave more like a language model (e.g. Hybrid autoregressive transducer (HAT), Modular hybrid autoregressive transducer, and Factorized transducer)
Reference
[1] External Language Model Integration for Factorized Neural Transducers
[2] in-situ test-only adaptation of speech models with low-overhead speech imputations
'Speech Signal Processing > Speech Recognition' 카테고리의 다른 글
[BBPE] OPTIMIZING BYTE-LEVEL REPRESENTATION FOR END-TO-END ASR (0) | 2024.10.09 |
---|---|
[DataLoader] DynamicBatchSampler (3) | 2024.10.09 |
Whisper ASR: Model and Training Details (0) | 2023.11.18 |
Subword modelling for ASR (0) | 2022.05.07 |
[Kaldi Decoding] 칼디 디코딩 그래프 구성 (0) | 2020.06.18 |