SPE-54: Keyword Spotting
Unified Speculation, Detection, And, Verification Keyword Spotting
Geng-shen Fu, Thibaud Senechal, Aaron Challenner, Tao Zhang, Amazon Alexa Science
Problem
- Accurate and timely recognition of the trigger keyword is vital.
- There is a trade-off needed between accuracy and latency.
Proposed method
- We propose an CRNN-based unified speculation, detection, and verification keyword detection model.
- We propose a latency- aware max-pooling loss, and show empirically that it teaches a model to maximize accuracy under the latency constraint.
- A USDV model can be trained in a MTL fashion and achieves different accuracy and latency trade-off across these three tasks.
1. Unified speculation, detection, and verification model
- Speculation makes an early decision, which can be used to give a head-start to downstream processes on the device.
- Detection mimics the traditional keyword trigger task and gives a more accurate decision by observing the full keyword context.
- Verification verifies previous decision by observing even more audio after the keyword span.
2. Model architecture and training strategy
- CRNN architecture
- multi-task learning with different target latencies on the new proposed latency-aware max-pooling loss.
Temporal Early Exiting for Streaming Speech Commands Recognition
Comcast Applied AI, University of Waterloo
Problem
Voice queries to take time to process:
Stage 1: The user is speaking (seconds).
Stage 2: Finish ASR transcription (~50ms).
Stage 3: Information retrieval (~500ms).
Proposed method
- Use a streaming speech commands model for the top-K voice queries.
- Apply some training objective for better early exiting across time; Return a prediction before the entire audio is observed.
- Use early exiting with some condence threshold to adjust the latency-accuracy trade-off.
Model
- GRU Model
- Per-frame output probability distribution over K commands (classes).
Early-Exiting Objectives
Connectionist temporal classication (CTC):
Last-frame cross entropy (LF):
All-frame cross entropy (AF):
Findings
1. The all-frame objective (AF) performs best, perhaps because it explicitly trains the hidden features to be more discriminative, similar to deep supervision [1].
2. The observed indices correlate with the optimal indices for all models and datasets, with the AF-0.5 model consistently exiting earlier than the LF one does.