Speech Sentiment Analysis via Pre-Trained Features from End-to-End ASR Models, ICASSP 2020, USC(intern), Google
- E2E ASR 목적으로 학습 시킨 RNNT 모델에서, Encoder 모듈을 feature extractor로 사용한다.
- RNN with multi-head self-attention 를 사용하여 Encoder 통해 나온 문장단위 variable length feature vectors들을 softmax로 감정을 분류한다.
-IEMOCAP (WA/UA=71.7/72.6), SWBD-sentiment DB 사용
Fusion Approaches for Emotion Recognition from Speech Using Acoustic and Text-Based Features,
1) Text-based Model
- BERT에서 추출한 768 dim Lexical feature를 text-based 기본 feature로 사용
- 1 conv1d : embedding layer for reducing dimension (from 768 to 128)
- 2 conv1d : for relationships across neighboring elemnets of the sequence.
- Mean Pooling over time
- softmax layer
2) Audio-based Model
- 36 dim acoustic feature(pitch, jitter, shimmer, logHNR, loudness, 13-MFCC, and first orders) 를 기본 feature로 사용
- 2 conv1d : for the temporal evolution of the input sequence
- mean-pooling over time
- softmax layer
3) Fusion Models
- Text-based Model, Audio-based Model 을 동시에 jointly하게 학습 시키는 방법
- Text-based Model, Audio-based Model 을 각각 학습을 시키고, 학습된 W를 밑에 단계부터 fix시켜 놓고 학습을 추가로 하는 방법 등
'Speech Signal Processing > Applications' 카테고리의 다른 글
[Source Separation] SUDO RM -RF (0) | 2020.08.28 |
---|---|
[Acoustic Echo Cancellation] AEC and RES problems (0) | 2020.08.16 |
[Source Separation] Wave-U-Net (1) | 2020.07.19 |
[Speech Enhancement] Spectral feature mapping for Robust ASR (0) | 2020.06.19 |
[E2E Keyword Spotting] End-to-End Streaming Keyword Spotting (0) | 2020.06.13 |