'e2e asr' 태그의 글 목록

e2e asr

Text-only adaptation for E2E ASR models 2024.04.04
[E2E ASR] RNN-Transducer for ASR 2020.06.13

Text-only adaptation for E2E ASR models

2024. 4. 4. 23:46

There are generally three ways to perform text-only adaptation:

Injecting synthesizing speech data to the model
- generate audio for training texts via TTS and inject it to the model
LM fusion
- Fusion and biasing (shallow fusion):
  - during decoding interpolate posterior word probabilities with text priors from external LMs
  - another recent approach is to extract internal LM probabilities and discount with the ratio of external and internal LM probabilities
- Rescoring and reranking
  - after decoding, use a powerful external LM to update scores and rerank n-best results or recognition lattice
- These techniques incur a significant overhead at inference time due to the external LM and also require careful tuning of the interpolation weight used for the external LM.
Explicit separation of internal LMs
- force the E2E decoder/predictor to behave more like a language model (e.g. Hybrid autoregressive transducer (HAT), Modular hybrid autoregressive transducer, and Factorized transducer)

Reference

[1] External Language Model Integration for Factorized Neural Transducers

[2] in-situ test-only adaptation of speech models with low-overhead speech imputations

'Speech Signal Processing > Speech Recognition' 카테고리의 다른 글

[BBPE] OPTIMIZING BYTE-LEVEL REPRESENTATION FOR END-TO-END ASR (0)	2024.10.09
[DataLoader] DynamicBatchSampler (3)	2024.10.09
Whisper ASR: Model and Training Details (0)	2023.11.18
Subword modelling for ASR (0)	2022.05.07
[Kaldi Decoding] 칼디 디코딩 그래프 구성 (0)	2020.06.18

[E2E ASR] RNN-Transducer for ASR

2020. 6. 13. 18:23

RNN-T for ASR 은 크게 Audio Encoder, Test Predictor 및 Joiner의 세 가지로 구성되어 있다.

1) Audio Encoder는 audio frames을 time t까지 input으로 받아서 high-level acoustic feature a_t를 인코딩한다. 2) Text predictor은 과거 text 의 과거정보를 h index까지 받아서, high-level lexical feature t_h를 인코딩한다. 3) 이 high-level acoustic and lexical features은 Joiner 모듈을 태우는데, 이 모듈은 두 feature을 결합하여, output unit에 대한 probability distribution, y_t,h를 내놓는다.

RNN-T는 CTC based 모델과 다르게, output symbols에 대한 확률을 생성하기 위해 audio, text 두 정보를 모두 사용함으로써, CTC 모델의 조건부 독립 가정을 극복할 수 있다는 장점이 있다.

Loss는 RNN-Transducer forward-backward 알고리즘을 사용하며 디테일은 [1] 논문을 참고하면 된다.

Test 할 때는, decoding 과정이 필요하며, 관련 메모는 [2,3]을 참고하면 된다.

[1] Alex Graves, "Sequence Transduction with Recurrent Neural Networks", 2012

[2] https://sequencedata.tistory.com/3?category=1129285

[3] https://sequencedata.tistory.com/4?category=1129285

'Speech Signal Processing > Speech Recognition' 카테고리의 다른 글

[Acoustic Model] Feedforward Sequential Memory Networks (FSMN) (0)	2020.06.15
[speech recognition] Audio augmentation (0)	2020.06.13
[E2E ASR] Improved RNN-T Beam search decoding (Facebook) (0)	2020.06.13
[E2E ASR] RNN-T Beam search decoding (0)	2020.06.13
음성인식기(ASR) 구현하기 위한 모듈 정리 (0)	2020.06.13

PREV 1 NEXT

Notes

e2e asr

Text-only adaptation for E2E ASR models

'Speech Signal Processing > Speech Recognition' 카테고리의 다른 글

[E2E ASR] RNN-Transducer for ASR

'Speech Signal Processing > Speech Recognition' 카테고리의 다른 글

+ Recent posts

티스토리툴바