Smart speaker or Voice Assistant 를 만들기 위해서 필요한 알고리즘을 정리해보자. 필요한 컴포넌트 중심으로 Flow를 간단하게 적어보면 다음과 같을 수 있다.
Mic -> Audio Processing -> KWS -> ASR -> NLU -> knowledge/Skill/Action -> TTS -> Speaker
각 모듈에 대한 간략한 설명
- Audio Processing includes Acoustic Echo Cancellation, Beamforming, Noise Suppression (NS).
- Keyword Spotting (KWS) detects a keyword (okay google) to start a conversation.
- Speech To Text (STT or ASR)
- Natural Language Understanding (NLU) converts raw text into structured data.
- Knowledge/Skill/Action- Knowledge-based model provide an answer.
- Text To Speech(TTS)
각 모듈에 대한 알고리즘을 오픈소스 중심으로 생각나는대로 간단하게 정리해보자.
Audio Processing
Several Basic Filters for sound and speech processing
https://github.com/voidqk/sndfilter
reverb, dynamic range compression, lowpass, highpass, notch
Automatic Gain Control
TF AGC: https://github.com/jorgehatccrma/pyagc
Acoustic Echo Cancellation
Removes echoes that can occur when a microphone picks up audio from a speaker, preventing feedback loops.
SpeexDSP
https://github.com/xiph/speexdsp
Daemon based on SpeexDSP AEC for the devices running Linux. https://github.com/voice-engine/ec
Residual Echo Cancellation (RES) - SpeexDSP 에 같이 구현되어 있음
Direction Of Arrival (DOA)- Most used DOA algorithms is GCC-PHAT
DOA (SRP-PHAT and GCC-PHAT)
https://github.com/wangwei2009/DOA
TDOA
https://github.com/xiongyihui/tdoa
ODAS
https://github.com/introlab/odas
ODAS stands for Open embeddeD Audition System. This is a library dedicated to perform sound source localization, tracking, separation and post-filtering.
Beamforming
Involves using multiple microphones to focus on sounds from a specific direction, enhancing the signal from the desired source while suppressing noise. Common algorithms include GCC-PHAT, MVDR, GSC, and DNN-based methods.
- Direction Of Arrival (DOA): Estimates the direction of the incoming sound. This is important for beamforming and source localization. Algorithms like SRP-PHAT, GCC-PHAT, and systems like ODAS are used.
Beamformlt - delay & sum beamforming
https://github.com/xanguera/BeamformIt
CGMM Beamforming
https://github.com/funcwj/CGMM-MVDR
MVDR Beamforming
https://github.com/DistantSpeechRecognition/mcse(mvdr + postfilter)
GSC Beamforming
그외 DNN-based 방법들
https://github.com/fgnt/nn-gev
Voice Activity Detection
Detects whether the input signal contains speech, helping to reduce unnecessary processing when there is no speech. Common tools include Sohn VAD and WebRTC VAD.
Sohn VAD
https://github.com/eesungkim/Voice_Activity_Detector
WebRTC VAD
https://github.com/wiseman/py-webrtcvad
DNN VAD
Noise Suppresion
Reduces background noise to improve the clarity of the spoken input.
MMSE-STSA
https://github.com/eesungkim/Speech_Enhancement_MMSE-STSA
NS of WebRTC audio processing
https://github.com/xiongyihui/python-webrtc-audio-processing
KWS
- Mycroft Precise - A lightweight, simple-to-use, RNN wake word listener
- Snowboy - DNN based hotword and wake word detection toolkit
- Honk - PyTorch reimplementation of Google's TensorFlow CNNs for keyword spotting
- ML-KWS-For-MCU - Maybe the most promise for resource constrained devices such as ARM Cortex M7 microcontroller
- Porcupine - Lightweight, cross-platform engine to build custom wake words in seconds
ASR
NeMo | https://github.com/NVIDIA/NeMo
ESPNET | https://github.com/espnet
Speechbrain | https://github.com/speechbrain
Kaldi | https://github.com/kaldi-asr/kaldi
NLU
- Rasa NLU
- Snips NLU - a Python library that allows to parse sentences written in natural language and extracts structured information.
TTS
Audio I/O
- portAudio, pyaudio
- libsoundio
- ALSA
- pulseAudio
'Speech Signal Processing > Speech Recognition' 카테고리의 다른 글
[Acoustic Model] Feedforward Sequential Memory Networks (FSMN) (0) | 2020.06.15 |
---|---|
[speech recognition] Audio augmentation (0) | 2020.06.13 |
[E2E ASR] RNN-Transducer for ASR (0) | 2020.06.13 |
[E2E ASR] Improved RNN-T Beam search decoding (Facebook) (0) | 2020.06.13 |
[E2E ASR] RNN-T Beam search decoding (0) | 2020.06.13 |