Smart speaker or Voice Assistant 를 만들기 위해서 필요한 알고리즘을 정리해보자. 필요한 컴포넌트 중심으로 Flow를 간단하게 적어보면 다음과 같을 수 있다.

 

Mic -> Audio Processing -> KWS -> ASR -> NLU -> knowledge/Skill/Action -> TTS -> Speaker


각 모듈에 대한 간략한 설명

  • Audio Processing includes Acoustic Echo Cancellation, Beamforming, Noise Suppression (NS).
  • Keyword Spotting (KWS) detects a keyword (okay google) to start a conversation.
  • Speech To Text (STT or ASR)
  • Natural Language Understanding (NLU) converts raw text into structured data.
  • Knowledge/Skill/Action- Knowledge-based model provide an answer.
  • Text To Speech(TTS)

각 모듈에 대한 알고리즘을 오픈소스 중심으로 생각나는대로 간단하게 정리해보자.

Audio Processing

Several Basic Filters for sound and speech processing

https://github.com/voidqk/sndfilter

reverb, dynamic range compression, lowpass, highpass, notch

Automatic Gain Control

TF AGC: https://github.com/jorgehatccrma/pyagc

Acoustic Echo Cancellation

Removes echoes that can occur when a microphone picks up audio from a speaker, preventing feedback loops.

SpeexDSP

https://github.com/xiph/speexdsp

Daemon based on SpeexDSP AEC for the devices running Linux. https://github.com/voice-engine/ec

Residual Echo Cancellation (RES) - SpeexDSP 에 같이 구현되어 있음

Direction Of Arrival (DOA)- Most used DOA algorithms is GCC-PHAT

DOA (SRP-PHAT and GCC-PHAT)

https://github.com/wangwei2009/DOA

TDOA

https://github.com/xiongyihui/tdoa

ODAS

https://github.com/introlab/odas

ODAS stands for Open embeddeD Audition System. This is a library dedicated to perform sound source localization, tracking, separation and post-filtering.

Beamforming

Involves using multiple microphones to focus on sounds from a specific direction, enhancing the signal from the desired source while suppressing noise. Common algorithms include GCC-PHAT, MVDR, GSC, and DNN-based methods.

  • Direction Of Arrival (DOA): Estimates the direction of the incoming sound. This is important for beamforming and source localization. Algorithms like SRP-PHAT, GCC-PHAT, and systems like ODAS are used.

Beamformlt - delay & sum beamforming

https://github.com/xanguera/BeamformIt

CGMM Beamforming

https://github.com/funcwj/CGMM-MVDR

MVDR Beamforming

https://github.com/DistantSpeechRecognition/mcse(mvdr + postfilter)

GSC Beamforming

그외 DNN-based 방법들

https://github.com/fgnt/nn-gev

Voice Activity Detection

Detects whether the input signal contains speech, helping to reduce unnecessary processing when there is no speech. Common tools include Sohn VAD and WebRTC VAD.

Sohn VAD

https://github.com/eesungkim/Voice_Activity_Detector

WebRTC VAD

https://github.com/wiseman/py-webrtcvad

DNN VAD

Noise Suppresion

Reduces background noise to improve the clarity of the spoken input.

MMSE-STSA
https://github.com/eesungkim/Speech_Enhancement_MMSE-STSA

NS of WebRTC audio processing
https://github.com/xiongyihui/python-webrtc-audio-processing

KWS

  • Mycroft Precise - A lightweight, simple-to-use, RNN wake word listener
  • Snowboy - DNN based hotword and wake word detection toolkit
  • Honk - PyTorch reimplementation of Google's TensorFlow CNNs for keyword spotting
  • ML-KWS-For-MCU - Maybe the most promise for resource constrained devices such as ARM Cortex M7 microcontroller
  • Porcupine - Lightweight, cross-platform engine to build custom wake words in seconds

ASR

NeMo | https://github.com/NVIDIA/NeMo
ESPNET | https://github.com/espnet
Speechbrain | https://github.com/speechbrain
Kaldi | https://github.com/kaldi-asr/kaldi

NLU

  • Rasa NLU
  • Snips NLU - a Python library that allows to parse sentences written in natural language and extracts structured information.

TTS

Audio I/O

  • portAudio, pyaudio
  • libsoundio
  • ALSA
  • pulseAudio

+ Recent posts