Speech Sentiment Analysis via Pre-Trained Features from End-to-End ASR Models, ICASSP 2020, USC(intern), Google


- E2E ASR 목적으로 학습 시킨 RNNT 모델에서, Encoder 모듈을 feature extractor로 사용한다.
- RNN with multi-head self-attention 를 사용하여 Encoder 통해 나온 문장단위 variable length feature vectors들을 softmax로 감정을 분류한다.
-IEMOCAP (WA/UA=71.7/72.6), SWBD-sentiment DB 사용


Fusion Approaches for Emotion Recognition from Speech Using Acoustic and Text-Based Features, 


1) Text-based Model

- BERT에서 추출한 768 dim Lexical feature를 text-based 기본 feature로 사용

- 1 conv1d : embedding layer for reducing dimension (from 768 to 128)

- 2 conv1d : for relationships across neighboring elemnets of the sequence.

- Mean Pooling over time

- softmax layer


2) Audio-based Model

- 36 dim acoustic feature(pitch, jitter, shimmer, logHNR, loudness, 13-MFCC, and first orders) 를 기본 feature로 사용 

- 2 conv1d : for the temporal evolution of the input sequence

- mean-pooling over time

- softmax layer


3) Fusion Models

- Text-based Model, Audio-based Model 을 동시에 jointly하게 학습 시키는 방법

- Text-based Model, Audio-based Model 을 각각 학습을 시키고, 학습된 W를 밑에 단계부터 fix시켜 놓고 학습을 추가로 하는 방법 등

+ Recent posts