There are generally three ways to perform text-only adaptation:

 

  • Injecting synthesizing speech data to the model
    • generate audio for training texts via TTS and inject it to the model
  • LM fusion
    • Fusion and biasing (shallow fusion):
      • during decoding interpolate posterior word probabilities with text priors from external LMs
      • another recent approach is to extract internal LM probabilities and discount with the ratio of external and internal LM probabilities
    • Rescoring and reranking
      • after decoding, use a powerful external LM to update scores and rerank n-best results or recognition lattice
    • These techniques incur a significant overhead at inference time due to the external LM and also require careful tuning of the interpolation weight used for the external LM.
  • Explicit separation of internal LMs
    • force the E2E decoder/predictor to behave more like a language model (e.g. Hybrid autoregressive transducer (HAT), Modular hybrid autoregressive transducer, and Factorized transducer)

 

Reference

[1] External Language Model Integration for Factorized Neural Transducers

[2] in-situ test-only adaptation of speech models with low-overhead speech imputations

 

+ Recent posts