• 음성인식기 음향 모델 중 뛰어난 성능을 내고 있는 FSMN 에 대한 논문들을 리뷰해 보았다. 
  • 신호처리 이론 중 IIR filter는 High order FIR filter로 근사가 가능하다.
  • RNN 계통에서 recurrent layer는 개념적으로 first order IIR filter와 유사하다고 볼 수 있다.
  • 핵심 아이디어는 recurrent layer를 대신할 수 있는 High order FIR filter와 같은 DNN 구조를 제시한다는 것
  • feedforward neural network (FNN)에서 Memory block을 둬서 현재 프레임의 앞뒤의 long context information을 인코딩 해서 그 정보를 사용하여 현재의 FNN을 update 해 나간다.
  • RNN 계통보다 모델이 light하고, 학습 시 안정적이다.
  • scalar FSMN, vector FSMN, compact (vector) FSMN, Deep-FSMN 

 

 


 

[1] https://arxiv.org/abs/1803.05030

 

Smart speaker or Voice Assistant 를 만들기 위해서 필요한 알고리즘을 정리해보자. 필요한 컴포넌트 중심으로 Flow를 간단하게 적어보면 다음과 같을 수 있다.

 

Mic -> Audio Processing -> KWS -> ASR -> NLU -> knowledge/Skill/Action -> TTS -> Speaker


각 모듈에 대한 간략한 설명

  • Audio Processing includes Acoustic Echo Cancellation, Beamforming, Noise Suppression (NS).
  • Keyword Spotting (KWS) detects a keyword (okay google) to start a conversation.
  • Speech To Text (STT or ASR)
  • Natural Language Understanding (NLU) converts raw text into structured data.
  • Knowledge/Skill/Action- Knowledge-based model provide an answer.
  • Text To Speech(TTS)

각 모듈에 대한 알고리즘을 오픈소스 중심으로 생각나는대로 간단하게 정리해보자.

Audio Processing

Several Basic Filters for sound and speech processing

https://github.com/voidqk/sndfilter

reverb, dynamic range compression, lowpass, highpass, notch

Automatic Gain Control

TF AGC: https://github.com/jorgehatccrma/pyagc

Acoustic Echo Cancellation

Removes echoes that can occur when a microphone picks up audio from a speaker, preventing feedback loops.

SpeexDSP

https://github.com/xiph/speexdsp

Daemon based on SpeexDSP AEC for the devices running Linux. https://github.com/voice-engine/ec

Residual Echo Cancellation (RES) - SpeexDSP 에 같이 구현되어 있음

Direction Of Arrival (DOA)- Most used DOA algorithms is GCC-PHAT

DOA (SRP-PHAT and GCC-PHAT)

https://github.com/wangwei2009/DOA

TDOA

https://github.com/xiongyihui/tdoa

ODAS

https://github.com/introlab/odas

ODAS stands for Open embeddeD Audition System. This is a library dedicated to perform sound source localization, tracking, separation and post-filtering.

Beamforming

Involves using multiple microphones to focus on sounds from a specific direction, enhancing the signal from the desired source while suppressing noise. Common algorithms include GCC-PHAT, MVDR, GSC, and DNN-based methods.

  • Direction Of Arrival (DOA): Estimates the direction of the incoming sound. This is important for beamforming and source localization. Algorithms like SRP-PHAT, GCC-PHAT, and systems like ODAS are used.

Beamformlt - delay & sum beamforming

https://github.com/xanguera/BeamformIt

CGMM Beamforming

https://github.com/funcwj/CGMM-MVDR

MVDR Beamforming

https://github.com/DistantSpeechRecognition/mcse(mvdr + postfilter)

GSC Beamforming

그외 DNN-based 방법들

https://github.com/fgnt/nn-gev

Voice Activity Detection

Detects whether the input signal contains speech, helping to reduce unnecessary processing when there is no speech. Common tools include Sohn VAD and WebRTC VAD.

Sohn VAD

https://github.com/eesungkim/Voice_Activity_Detector

WebRTC VAD

https://github.com/wiseman/py-webrtcvad

DNN VAD

Noise Suppresion

Reduces background noise to improve the clarity of the spoken input.

MMSE-STSA
https://github.com/eesungkim/Speech_Enhancement_MMSE-STSA

NS of WebRTC audio processing
https://github.com/xiongyihui/python-webrtc-audio-processing

KWS

  • Mycroft Precise - A lightweight, simple-to-use, RNN wake word listener
  • Snowboy - DNN based hotword and wake word detection toolkit
  • Honk - PyTorch reimplementation of Google's TensorFlow CNNs for keyword spotting
  • ML-KWS-For-MCU - Maybe the most promise for resource constrained devices such as ARM Cortex M7 microcontroller
  • Porcupine - Lightweight, cross-platform engine to build custom wake words in seconds

ASR

NeMo | https://github.com/NVIDIA/NeMo
ESPNET | https://github.com/espnet
Speechbrain | https://github.com/speechbrain
Kaldi | https://github.com/kaldi-asr/kaldi

NLU

  • Rasa NLU
  • Snips NLU - a Python library that allows to parse sentences written in natural language and extracts structured information.

TTS

Audio I/O

  • portAudio, pyaudio
  • libsoundio
  • ALSA
  • pulseAudio

+ Recent posts