ICASSP 2022 | Keyword Spotting

2022. 5. 20. 14:23

SPE-54: Keyword Spotting

Unified Speculation, Detection, And, Verification Keyword Spotting

Geng-shen Fu, Thibaud Senechal, Aaron Challenner, Tao Zhang, Amazon Alexa Science

Problem

- Accurate and timely recognition of the trigger keyword is vital.

- There is a trade-off needed between accuracy and latency.

Proposed method

- We propose an CRNN-based unified speculation, detection, and verification keyword detection model.

- We propose a latency- aware max-pooling loss, and show empirically that it teaches a model to maximize accuracy under the latency constraint.

- A USDV model can be trained in a MTL fashion and achieves different accuracy and latency trade-off across these three tasks.

1. Unified speculation, detection, and verification model

- Speculation makes an early decision, which can be used to give a head-start to downstream processes on the device.

- Detection mimics the traditional keyword trigger task and gives a more accurate decision by observing the full keyword context.

- Verification verifies previous decision by observing even more audio after the keyword span.

2. Model architecture and training strategy

- CRNN architecture

- multi-task learning with different target latencies on the new proposed latency-aware max-pooling loss.

Temporal Early Exiting for Streaming Speech Commands Recognition

Comcast Applied AI, University of Waterloo

Problem

Voice queries to take time to process:

Stage 1: The user is speaking (seconds).

Stage 2: Finish ASR transcription (~50ms).

Stage 3: Information retrieval (~500ms).

Proposed method

- Use a streaming speech commands model for the top-K voice queries.

- Apply some training objective for better early exiting across time; Return a prediction before the entire audio is observed.

- Use early exiting with some condence threshold to adjust the latency-accuracy trade-off.

Model

- GRU Model

- Per-frame output probability distribution over K commands (classes).

Early-Exiting Objectives

Connectionist temporal classication (CTC):

Last-frame cross entropy (LF):

All-frame cross entropy (AF):

Findings

1. The all-frame objective (AF) performs best, perhaps because it explicitly trains the hidden features to be more discriminative, similar to deep supervision [1].

2. The observed indices correlate with the optimal indices for all models and datasets, with the AF-0.5 model consistently exiting earlier than the LF one does.

'Speech Signal Processing > Research' 카테고리의 다른 글

[ASR/ST/PT/ACL22] Unified Speech-Text Pre-training for Speech Translation and Recognition (0)	2023.11.18
[ASR/ST/PT/2023.10] SpeechUT:Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training (0)	2023.11.18
ICASSP 2022 \| SSL for Speech and Audio Processing I (0)	2022.05.07
ICASSP 2022 \| Language Modeling (0)	2022.05.07
ICASSP 2022 \| Speech Recognition: Robust Speech Recognition I (0)	2022.05.07

Notes