'ICASSP' 태그의 글 목록

ICASSP

ICASSP 2022 | Language Modeling 2022.05.07
ICASSP 2022 | Speech Recognition: Robust Speech Recognition I 2022.05.07

ICASSP 2022 | Language Modeling

2022. 5. 7. 20:24

Language Modeling

Technical Program Session SPE-4

CAPITALIZATION NORMALIZATION FOR LANGUAGE MODELING WITH AN ACCURATE AND EFFICIENT HIERARCHICAL RNN MODEL

Google Research

Problem

Capitalization normalization (truecasing) is the task of restoring the correct case (uppercase or lowercase) of noisy text.

Proposed method

A fast, accurate and compact two-level hierarchical word-and-character-based RNN

Used the truecaser to normalize user-generated text in a Federated Learning framework for language modeling.

Key Findings

In a real user A/B experiment, authors demonstrated that the improvement translates to reduced prediction error rates in a virtual keyboard application.

NEURAL-FST CLASS LANGUAGE MODEL FOR END-TO-END SPEECH RECOGNITION

Facebook AI, USA

Proposed method

Neural-FST Class Language Model (NFCLM) for endto-end speech recognition

a novel method that combines neural network language models (NNLMs) and finite state transducers (FSTs) in a mathematically consistent framework

Key Findings

NFCLM significantly outperforms NNLM by 15.8% relative in terms of WER.

NFCLM achieves similar performance as traditional NNLM and FST shallow fusion while being less prone to overbiasing and 12 times more compact, making it more suitable for on-device usage.

ENHANCE RNNLMS WITH HIERARCHICAL MULTI-TASK LEARNING FOR ASR

University of Missouri, USA

Proposed method

Key Findings

RESCOREBERT: DISCRIMINATIVE SPEECH RECOGNITION RESCORING WITH BERT

1Amazon Alexa AI, USA 2Emory University, USA

Problem

Second-pass rescoring improves the outputs from a first-pass decoder by implementing a lattice rescoring or n-best re-ranking.

Proposed method (RescoreBERT)

Authors showed how to train a BERT-based rescoring model with minimum WER (MWER) loss, to incorporate the improvements of a discriminative loss into fine-tuning of deep bidirectional pretrained models for ASR.

Authors proposed a fusion strategy that incorporates the MLM into the discriminative training process to effectively distill knowledge from a pretrained model. We further propose an alternative discriminative loss.

Key Findings

Reduced WER by 6.6%/3.4% relative on the LibriSpeech clean/other test sets over a BERT baseline without discriminative objective

Found that it reduces both latency and WER (by 3 to 8% relative) over an LSTM rescoring model.

Hybrid sub-word segmentation for handling long tail in morphologically rich low resource languages

Cognitive Systems Lab, University Bremen, Germany

Problem

Dealing with Out Of Vocabulary (OOV) words or unseen words

For morphologically rich languages having high type token ratio, the OOV percentage is also quite high.

Sub-word segmentation has been found to be one of the major approaches in dealing with OOVs.

Proposed method

This paper presents a hybrid sub-word segmentation algorithm to deal with OOVs.

A sub-word segmentation evaluation methodology is also presented.

All the experiments are done for conversational code-switched Malayalam-English corpus.

'Speech Signal Processing > Research' 카테고리의 다른 글

[ASR/ST/PT/ACL22] Unified Speech-Text Pre-training for Speech Translation and Recognition (0)	2023.11.18
[ASR/ST/PT/2023.10] SpeechUT:Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training (0)	2023.11.18
ICASSP 2022 \| Keyword Spotting (0)	2022.05.20
ICASSP 2022 \| SSL for Speech and Audio Processing I (0)	2022.05.07
ICASSP 2022 \| Speech Recognition: Robust Speech Recognition I (0)	2022.05.07

ICASSP 2022 | Speech Recognition: Robust Speech Recognition I

2022. 5. 7. 18:08

Speech Recognition: Robust Speech Recognition I

Technical Program Session SPE-2

AUDIO-VISUAL MULTI-CHANNEL SPEECH SEPARATION, DEREVERBERATION AND RECOGNITION

The Chinese University of Hong Kong; Tencent AI lab

Problem

accurate recognition of cocktail party speech characterised by the interference from overlapping speakers, background noise and room reverberation.

Proposed method

In this paper, an audiovisual multi-channel speech separation, dereverberation and recognition approach with visual information into all three stages of the system is proposed.

The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches based on DNN-WPE and spectral mapping respectively.

BEST OF BOTH WORLDS: MULTI-TASK AUDIO-VISUAL AUTOMATIC SPEECH RECOGNITION AND ACTIVE SPEAKER DETECTION

Google, Inc.

Problem

Under noisy conditions, automatic speech recognition (ASR) can greatly benefit from the addition of visual signals coming from a video of the speaker’s face.

현실적으로 여러 얼굴이 존재하는 경우가 많은데 전통적으로 active speaker detection (ASD)으로 모든 시간마다 audio와 일치하는 active speaker's face를 분리하는 모델을 따로 사용했으나, 최근에는 attention 모델을 추가해서 별도의 ASD를 설계하지 않고 audio와 모든 face candidate을 모델에 집어 넣어 end-to-end way로 처리 하기도 한다.

Proposed method

2.1. A/V Backbone: Shared Audio-Visual Frontend

Acoustic Features. log mel filterbank

Audio and Video Synchronization. resample video

Visual Features. ConvNet on top of the synchronized video

Attention Mechanism. in order to soft-select the one matching the audio.

2.2. ASR Model - Transformer-Transducer Model

For ASR, the weighted visual features and input acoustic features are then concatenated along the last dimension, producing audio-visual features which are then fed to the ASR encoder.

2.3. ASD Model

For ASD, the attention scores is used directly for the model prediction. For each audio query and each timestep, the attention scores give a measure of how well each candidate video corresponds to the audio.

3. MULTI-TASK LOSS FOR A/V ASR AND ASD

ASD. For active speaker detection, the normalized attention weights can be used to train the attention module directly with cross entropy loss.

ASR. RNN-T loss

MTL Loss. We combine the ASD and ASR losses with a weighted linear sum of the losses

Key Findings

This paper presents a multi-task learning (MTL) for a model that can simultaneously perform audio-visual ASR and active speaker detection, improving previous work on multiperson audio-visual ASR.

Combining the two tasks is enough to significantly improve the performance of the model in the ASD task relative to the baseline.

IMPROVING NOISE ROBUSTNESS OF CONTRASTIVE SPEECH REPRESENTATION LEARNING WITH SPEECH RECONSTRUCTION

The Ohio State University, USA, Microsoft Corporation

Problem

Noise Robust ASR

Proposed method

In this paper, authors employ a noise-robust representation learned by a refined self-supervised framework of wav2vec 2.0 for noisy speech recognition. They combine a reconstruction module with contrastive learning and perform multi-task continual pre-training to explicitly reconstruct the clean speech from the noisy input.

'Speech Signal Processing > Research' 카테고리의 다른 글

[ASR/ST/PT/ACL22] Unified Speech-Text Pre-training for Speech Translation and Recognition (0)	2023.11.18
[ASR/ST/PT/2023.10] SpeechUT:Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training (0)	2023.11.18
ICASSP 2022 \| Keyword Spotting (0)	2022.05.20
ICASSP 2022 \| SSL for Speech and Audio Processing I (0)	2022.05.07
ICASSP 2022 \| Language Modeling (0)	2022.05.07

PREV 1 NEXT

Notes

ICASSP

ICASSP 2022 | Language Modeling

Language Modeling

Technical Program Session SPE-4

CAPITALIZATION NORMALIZATION FOR LANGUAGE MODELING WITH AN ACCURATE AND EFFICIENT HIERARCHICAL RNN MODEL

NEURAL-FST CLASS LANGUAGE MODEL FOR END-TO-END SPEECH RECOGNITION

ENHANCE RNNLMS WITH HIERARCHICAL MULTI-TASK LEARNING FOR ASR

RESCOREBERT: DISCRIMINATIVE SPEECH RECOGNITION RESCORING WITH BERT

Hybrid sub-word segmentation for handling long tail in morphologically rich low resource languages

'Speech Signal Processing > Research' 카테고리의 다른 글

ICASSP 2022 | Speech Recognition: Robust Speech Recognition I

Speech Recognition: Robust Speech Recognition I

Technical Program Session SPE-2

AUDIO-VISUAL MULTI-CHANNEL SPEECH SEPARATION, DEREVERBERATION AND RECOGNITION

BEST OF BOTH WORLDS: MULTI-TASK AUDIO-VISUAL AUTOMATIC SPEECH RECOGNITION AND ACTIVE SPEAKER DETECTION

IMPROVING NOISE ROBUSTNESS OF CONTRASTIVE SPEECH REPRESENTATION LEARNING WITH SPEECH RECONSTRUCTION

'Speech Signal Processing > Research' 카테고리의 다른 글

+ Recent posts

티스토리툴바