[Speech Emotion Recognition] ICASSP 2020

2020. 8. 5. 00:15

Speech Sentiment Analysis via Pre-Trained Features from End-to-End ASR Models, ICASSP 2020, USC(intern), Google

- E2E ASR 목적으로 학습 시킨 RNNT 모델에서, Encoder 모듈을 feature extractor로 사용한다.

- RNN with multi-head self-attention 를 사용하여 Encoder 통해 나온 문장단위 variable length feature vectors들을 softmax로 감정을 분류한다.

-IEMOCAP (WA/UA=71.7/72.6), SWBD-sentiment DB 사용

1) Text-based Model

- BERT에서 추출한 768 dim Lexical feature를 text-based 기본 feature로 사용

- 1 conv1d : embedding layer for reducing dimension (from 768 to 128)

- 2 conv1d : for relationships across neighboring elemnets of the sequence.

- Mean Pooling over time

- softmax layer

2) Audio-based Model

- 36 dim acoustic feature(pitch, jitter, shimmer, logHNR, loudness, 13-MFCC, and first orders) 를 기본 feature로 사용

- 2 conv1d : for the temporal evolution of the input sequence

- mean-pooling over time

- softmax layer

3) Fusion Models

- Text-based Model, Audio-based Model 을 동시에 jointly하게 학습 시키는 방법

- Text-based Model, Audio-based Model 을 각각 학습을 시키고, 학습된 W를 밑에 단계부터 fix시켜 놓고 학습을 추가로 하는 방법 등

[Source Separation] SUDO RM -RF (0)	2020.08.28
[Acoustic Echo Cancellation] AEC and RES problems (0)	2020.08.16
[Source Separation] Wave-U-Net (1)	2020.07.19
[Speech Enhancement] Spectral feature mapping for Robust ASR (0)	2020.06.19
[E2E Keyword Spotting] End-to-End Streaming Keyword Spotting (0)	2020.06.13