Notes

전체 글

Whisper ASR: Model and Training Details 2023.11.18
ICASSP 2022 | Keyword Spotting 2022.05.20
ICASSP 2022 | SSL for Speech and Audio Processing I 2022.05.07
Subword modelling for ASR 2022.05.07
ICASSP 2022 | Language Modeling 2022.05.07
ICASSP 2022 | Speech Recognition: Robust Speech Recognition I 2022.05.07
[Hive] beeline 으로 테이블 데이터 조회 및 다운로드 하기 2021.05.23
[Hive] 로컬 CSV 데이터를 Hive 테이블에 Load하기 2021.05.23 2

Whisper ASR: Model and Training Details

2023. 11. 18. 14:14

Model Overview

Whisper is a Transformer-based encoder-decoder model.

Training Data

Whisper ASR models are trained on a mixture of English-only and multilingual data, with a substantial amount of weakly labeled and pseudolabeled audio.

Whisper ASR V1 and V2

Trained on 680,000 hours of audio and corresponding transcripts from the internet.
Data distribution includes 65% English audio (438k hours), 18% non-English audio with English transcripts, and 17% non-English audio with corresponding transcripts, spanning 98 languages.

Whisper ASR V3

Trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio of pseudolabeled audio collected using Whisper large-v2. The model was trained for 2.0 epochs over this mixture dataset.
V3 shows a 10% to 20% reduction in errors compared to V2

Training Details

Initial models were trained with AdamW optimizer, gradient norm clipping, and a linear learning rate decay after a warmup period.
No data augmentation or regularization was used initially due to the diversity and size of the dataset.
For Whisper Large V2, additional techniques like SpecAugment, Stochastic Depth, and BPE Dropout were introduced for regularization.
Different max learning rates were used for different model sizes.

Hyperparameters

General Hyperparameters

Hyperparameters for Whisper Large V2

Model Learning Rates

Summary

'Speech Signal Processing > Speech Recognition' 카테고리의 다른 글

[DataLoader] DynamicBatchSampler (3)	2024.10.09
Text-only adaptation for E2E ASR models (0)	2024.04.04
Subword modelling for ASR (0)	2022.05.07
[Kaldi Decoding] 칼디 디코딩 그래프 구성 (0)	2020.06.18
[Kaldi Decoding] Finite State Transducer algorithms (FST) (0)	2020.06.18

ICASSP 2022 | Keyword Spotting

2022. 5. 20. 14:23

SPE-54: Keyword Spotting

Unified Speculation, Detection, And, Verification Keyword Spotting

Geng-shen Fu, Thibaud Senechal, Aaron Challenner, Tao Zhang, Amazon Alexa Science

Problem

- Accurate and timely recognition of the trigger keyword is vital.

- There is a trade-off needed between accuracy and latency.

Proposed method

- We propose an CRNN-based unified speculation, detection, and verification keyword detection model.

- We propose a latency- aware max-pooling loss, and show empirically that it teaches a model to maximize accuracy under the latency constraint.

- A USDV model can be trained in a MTL fashion and achieves different accuracy and latency trade-off across these three tasks.

1. Unified speculation, detection, and verification model

- Speculation makes an early decision, which can be used to give a head-start to downstream processes on the device.

- Detection mimics the traditional keyword trigger task and gives a more accurate decision by observing the full keyword context.

- Verification verifies previous decision by observing even more audio after the keyword span.

2. Model architecture and training strategy

- CRNN architecture

- multi-task learning with different target latencies on the new proposed latency-aware max-pooling loss.

Temporal Early Exiting for Streaming Speech Commands Recognition

Comcast Applied AI, University of Waterloo

Problem

Voice queries to take time to process:

Stage 1: The user is speaking (seconds).

Stage 2: Finish ASR transcription (~50ms).

Stage 3: Information retrieval (~500ms).

Proposed method

- Use a streaming speech commands model for the top-K voice queries.

- Apply some training objective for better early exiting across time; Return a prediction before the entire audio is observed.

- Use early exiting with some condence threshold to adjust the latency-accuracy trade-off.

Model

- GRU Model

- Per-frame output probability distribution over K commands (classes).

Early-Exiting Objectives

Connectionist temporal classication (CTC):

Last-frame cross entropy (LF):

All-frame cross entropy (AF):

Findings

1. The all-frame objective (AF) performs best, perhaps because it explicitly trains the hidden features to be more discriminative, similar to deep supervision [1].

2. The observed indices correlate with the optimal indices for all models and datasets, with the AF-0.5 model consistently exiting earlier than the LF one does.

'Speech Signal Processing > Research' 카테고리의 다른 글

[ASR/ST/PT/ACL22] Unified Speech-Text Pre-training for Speech Translation and Recognition (0)	2023.11.18
[ASR/ST/PT/2023.10] SpeechUT:Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training (0)	2023.11.18
ICASSP 2022 \| SSL for Speech and Audio Processing I (0)	2022.05.07
ICASSP 2022 \| Language Modeling (0)	2022.05.07
ICASSP 2022 \| Speech Recognition: Robust Speech Recognition I (0)	2022.05.07

ICASSP 2022 | SSL for Speech and Audio Processing I

2022. 5. 7. 20:51

Self-supervised Learning for Speech and Audio Processing I

Technical Program Session MLSP-3

UNIVERSAL PARALINGUISTIC SPEECH REPRESENTATIONS USING SELF-SUPERVISED CONFORMERS

Verily Life Sciences, Boston, USA1 and Mountain View, California, USA

Many speech applications require understanding aspects beyond the words being spoken, such as recognizing emotion, detecting whether the speaker is wearing a mask, or distinguishing real from synthetic speech. In this work, we introduce a new state-of-the-art paralinguistic representation derived from large-scale, fully self-supervised training of a 600M+ parameter Conformer-based architecture. We benchmark on a diverse set of speech tasks and demonstrate that simple linear classifiers trained on top of our time-averaged representation outperform nearly all previous results, in some cases by large margins. Our analyses of context-window size demonstrate that, surprisingly, 2 second context-windows achieve 96% the performance of the Conformers that use the full long-term context on 7 out of 9 tasks. Furthermore, while the best per-task representations are extracted internally in the network, stable performance across several layers allows a single universal representation to reach near optimal performance on all tasks.

https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9747197

Proposed method

Key Findings

A NOISE-ROBUST SELF-SUPERVISED PRE-TRAINING MODEL BASED SPEECH REPRESENTATION LEARNING FOR AUTOMATIC SPEECH RECOGNITION

NEL-SLIP, University of Science and Technology of China (USTC), Hefei, China

Wav2vec2.0 is a popular self-supervised pre-training framework for learning speech representations in the context of automatic speech recognition (ASR). It was shown that wav2vec2.0 has a good robustness against the domain shift, while the noise robustness is still unclear. In this work, we therefore first analyze the noise robustness of wav2vec2.0 via experiments. We observe that wav2vec2.0 pre-trained on noisy data can obtain good representations and thus improve the ASR performance on the noisy test set, which however brings a performance degradation on the clean test set. To avoid this issue, in this work we propose an enhanced wav2vec2.0 model. Specifically, the noisy speech and the corresponding clean version are fed into the same feature encoder, where the clean speech provides training targets for the model. Experimental results reveal that the proposed method can not only improve the ASR performance on the noisy test set which surpasses the original wav2vec2.0, but also ensure a tiny performance decrease on the clean test set. In addition, the effectiveness of the proposed method is demonstrated under different types of noise conditions.

https://ieeexplore.ieee.org/document/9747379

AN ADAPTER BASED PRE-TRAINING FOR EFFICIENT AND SCALABLE SELF-SUPERVISED SPEECH REPRESENTATION LEARNING

Huawei R&D UK, University of Oxford

https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9747374

CONTRASTIVE PREDICTION STRATEGIES FOR UNSUPERVISED SEGMENTATION AND CATEGORIZATION OF PHONEMES AND WORDS

University of Wroclaw, Poland, NavAlgo, France, NVIDIA, Poland, Universite de Toulon, France

We identify a performance trade-off between the tasks of phoneme categorization and phoneme and word segmentation in several self-supervised learning algorithms based on Contrastive Predictive Coding (CPC). Our experiments suggest that context building networks, albeit necessary for high performance on categorization tasks, harm segmentation performance by causing a temporal shift on the learned representations. Aiming to tackle this trade-off, we take inspiration from the leading approaches on segmentation and propose multi-level Aligned CPC (mACPC). It builds on Aligned CPC (ACPC), a variant of CPC which exhibits the best performance on categorization tasks, and incorporates multi-level modeling and optimization for detection of spectral changes. Our methods improve in all tested categorization metrics and achieve state-of-the-art performance in word segmentation.

https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9746102

CHARACTERIZING THE ADVERSARIAL VULNERABILITY OF SPEECH SELF-SUPERVISED LEARNING

National Taiwan University, The Chinese University of Hong Kong

SUPERB

A leaderboard named Speech processing Universal PERformance Benchmark (SUPERB), which aims at benchmarking the performance of a shared self-supervised learning (SSL) speech model across various downstream speech tasks with minimal modification of architectures and a small amount of data, has fueled the research for speech representation learning. The SUPERB demonstrates speech SSL upstream models improve the performance of various downstream tasks through just minimal adaptation. As the paradigm of the self-supervised learning upstream model followed by downstream tasks arouses more attention in the speech community, characterizing the adversarial robustness of such paradigm is of high priority. In this paper, we make the first attempt to investigate the adversarial vulnerability of such paradigm under the attacks from both zero-knowledge adversaries and limited-knowledge adversaries. The experimental results illustrate that the paradigm proposed by SUPERB is seriously vulnerable to limited-knowledge adversaries, and the attacks generated by zero-knowledge adversaries are with transferability. The XAB test verifies the imperceptibility of crafted adversarial attacks.

https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9747242

'Speech Signal Processing > Research' 카테고리의 다른 글

[ASR/ST/PT/ACL22] Unified Speech-Text Pre-training for Speech Translation and Recognition (0)	2023.11.18
[ASR/ST/PT/2023.10] SpeechUT:Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training (0)	2023.11.18
ICASSP 2022 \| Keyword Spotting (0)	2022.05.20
ICASSP 2022 \| Language Modeling (0)	2022.05.07
ICASSP 2022 \| Speech Recognition: Robust Speech Recognition I (0)	2022.05.07

Subword modelling for ASR

2022. 5. 7. 20:33

There have been various generic and language specific approaches on sub-word segmentation to handle OOV problem for machine translation and ASR tasks.

Various subword units like phoneme, syllable, character, morpheme and combination have been used in different approaches of subword modelling. Also, there have been generic and language specific approaches as well. Below enlists some of the major sub-word segmentation approaches. One of the earlier approaches to ASR was Korean syllable-based segmentation [8]. Some of the language specific earlier approaches were in German LVSR [10] and Polish [11]. There was Morpheme based OOV handling approach for Turkish ASR keyword spotting task [9] and multiple languages [12].

The popular recent approaches in unsupervised segmentation

Both Byte Pair Encoding and WordPiece algorithms works on merging adjacent characters.

BPE : the merge pair is chosen based on frequency (merging adjacent characters)

WordPiece : merge is based on maximizing likelihood (merging adjacent characters)

Unigram and BPE dropout [14] are some of the sub-word segmentation regularization techniques.

Libraries implementing segmentation algorithms

sentencepiece [15],

bpeNMT [16],

morfessor [17]

Morph agram [16].

[1] Hybrid sub-word segmentation for handling long tail in morphologically rich low resource languages, ICASSP 2022

[7] M. Huck, S. Riess, and A. Fraser, “Target-side Word Segmentation Strategies for Neural Machine Translation in Proceedings of the Conference on Machine Translation (WMT), Volume 1: Research Papers, pages 56–67 Copenhagen, Denmark, 2017.

[8] D. Kiecza, T. Schultz and A. Waibel, “Data-Driven Determination of Appropriate Dictionary Units for Korean LVCSR”, in proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1999.

[9] Y. He, B. Hutchinson, P. Baumann, M. Ostendorf, E. FoslerLussier, and J. Pierrehumbert, “Subword-Based Modeling For Handling OOV Words In Keyword Spotting”, in proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Italy, 2014.

[10] A. El-Desoky, M. Mousa, B. Ali, R. Shaik, H. Schlüter, and Ney, “Sub-Lexical Language Models For German LVCSR”, in proceedings of the 2010 IEEE Spoken Language Technology Workshop (SLT), 2010.

[11] M.A.B. Shaik, A.E.-D. Mousa, R. Schluter, and H. Ney, “Using morpheme and syllable based sub-words for Polish LVCSR”, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4680–4683, 2011.

[12] M. Creutz, T. Hirsimäki, M. Kurimo, A. Puurula, “Morph-based speech recognition and modeling of out-of-vocabulary words across languages” in ACM Transactions on Speech and Language Processing (TSLP). 5(1):3, 2007

[13] M. Schuster and K. Nakajima, “Japanese and Korean voice search,” in proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012.

[14] I. Provilkov, D. Emelianenko and E. Voita, “BPE-Dropout: Simple and Effective Subword Regularization”, in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1882–1892, 2020.

[15] R. Eskander, F. Callejas, E. Nichols, J. Klavans, and S. Muresan, “MorphAGram: Evaluation and Framework for Unsupervised MorphologicalSegmentation”, in Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 7112–7122, 2020.

[16] “Subword-nmt”, Available at: https://github.com/rsennrich/subword-nmt [Accessed : 10 January, 2021]

[17] “Morfessor”, Available at: https://github.com/aaltospeech/morfessor [Accessed : 10 January, 2021].

'Speech Signal Processing > Speech Recognition' 카테고리의 다른 글

Text-only adaptation for E2E ASR models (0)	2024.04.04
Whisper ASR: Model and Training Details (0)	2023.11.18
[Kaldi Decoding] 칼디 디코딩 그래프 구성 (0)	2020.06.18
[Kaldi Decoding] Finite State Transducer algorithms (FST) (0)	2020.06.18
[Acoustic Model] Feedforward Sequential Memory Networks (FSMN) (0)	2020.06.15

ICASSP 2022 | Language Modeling

2022. 5. 7. 20:24

Language Modeling

Technical Program Session SPE-4

CAPITALIZATION NORMALIZATION FOR LANGUAGE MODELING WITH AN ACCURATE AND EFFICIENT HIERARCHICAL RNN MODEL

Google Research

Problem

Capitalization normalization (truecasing) is the task of restoring the correct case (uppercase or lowercase) of noisy text.

Proposed method

A fast, accurate and compact two-level hierarchical word-and-character-based RNN

Used the truecaser to normalize user-generated text in a Federated Learning framework for language modeling.

Key Findings

In a real user A/B experiment, authors demonstrated that the improvement translates to reduced prediction error rates in a virtual keyboard application.

NEURAL-FST CLASS LANGUAGE MODEL FOR END-TO-END SPEECH RECOGNITION

Facebook AI, USA

Proposed method

Neural-FST Class Language Model (NFCLM) for endto-end speech recognition

a novel method that combines neural network language models (NNLMs) and finite state transducers (FSTs) in a mathematically consistent framework

Key Findings

NFCLM significantly outperforms NNLM by 15.8% relative in terms of WER.

NFCLM achieves similar performance as traditional NNLM and FST shallow fusion while being less prone to overbiasing and 12 times more compact, making it more suitable for on-device usage.

ENHANCE RNNLMS WITH HIERARCHICAL MULTI-TASK LEARNING FOR ASR

University of Missouri, USA

Proposed method

Key Findings

RESCOREBERT: DISCRIMINATIVE SPEECH RECOGNITION RESCORING WITH BERT

1Amazon Alexa AI, USA 2Emory University, USA

Problem

Second-pass rescoring improves the outputs from a first-pass decoder by implementing a lattice rescoring or n-best re-ranking.

Proposed method (RescoreBERT)

Authors showed how to train a BERT-based rescoring model with minimum WER (MWER) loss, to incorporate the improvements of a discriminative loss into fine-tuning of deep bidirectional pretrained models for ASR.

Authors proposed a fusion strategy that incorporates the MLM into the discriminative training process to effectively distill knowledge from a pretrained model. We further propose an alternative discriminative loss.

Key Findings

Reduced WER by 6.6%/3.4% relative on the LibriSpeech clean/other test sets over a BERT baseline without discriminative objective

Found that it reduces both latency and WER (by 3 to 8% relative) over an LSTM rescoring model.

Hybrid sub-word segmentation for handling long tail in morphologically rich low resource languages

Cognitive Systems Lab, University Bremen, Germany

Problem

Dealing with Out Of Vocabulary (OOV) words or unseen words

For morphologically rich languages having high type token ratio, the OOV percentage is also quite high.

Sub-word segmentation has been found to be one of the major approaches in dealing with OOVs.

Proposed method

This paper presents a hybrid sub-word segmentation algorithm to deal with OOVs.

A sub-word segmentation evaluation methodology is also presented.

All the experiments are done for conversational code-switched Malayalam-English corpus.

'Speech Signal Processing > Research' 카테고리의 다른 글

[ASR/ST/PT/ACL22] Unified Speech-Text Pre-training for Speech Translation and Recognition (0)	2023.11.18
[ASR/ST/PT/2023.10] SpeechUT:Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training (0)	2023.11.18
ICASSP 2022 \| Keyword Spotting (0)	2022.05.20
ICASSP 2022 \| SSL for Speech and Audio Processing I (0)	2022.05.07
ICASSP 2022 \| Speech Recognition: Robust Speech Recognition I (0)	2022.05.07

ICASSP 2022 | Speech Recognition: Robust Speech Recognition I

2022. 5. 7. 18:08

Speech Recognition: Robust Speech Recognition I

Technical Program Session SPE-2

AUDIO-VISUAL MULTI-CHANNEL SPEECH SEPARATION, DEREVERBERATION AND RECOGNITION

The Chinese University of Hong Kong; Tencent AI lab

Problem

accurate recognition of cocktail party speech characterised by the interference from overlapping speakers, background noise and room reverberation.

Proposed method

In this paper, an audiovisual multi-channel speech separation, dereverberation and recognition approach with visual information into all three stages of the system is proposed.

The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches based on DNN-WPE and spectral mapping respectively.

BEST OF BOTH WORLDS: MULTI-TASK AUDIO-VISUAL AUTOMATIC SPEECH RECOGNITION AND ACTIVE SPEAKER DETECTION

Google, Inc.

Problem

Under noisy conditions, automatic speech recognition (ASR) can greatly benefit from the addition of visual signals coming from a video of the speaker’s face.

현실적으로 여러 얼굴이 존재하는 경우가 많은데 전통적으로 active speaker detection (ASD)으로 모든 시간마다 audio와 일치하는 active speaker's face를 분리하는 모델을 따로 사용했으나, 최근에는 attention 모델을 추가해서 별도의 ASD를 설계하지 않고 audio와 모든 face candidate을 모델에 집어 넣어 end-to-end way로 처리 하기도 한다.

Proposed method

2.1. A/V Backbone: Shared Audio-Visual Frontend

Acoustic Features. log mel filterbank

Audio and Video Synchronization. resample video

Visual Features. ConvNet on top of the synchronized video

Attention Mechanism. in order to soft-select the one matching the audio.

2.2. ASR Model - Transformer-Transducer Model

For ASR, the weighted visual features and input acoustic features are then concatenated along the last dimension, producing audio-visual features which are then fed to the ASR encoder.

2.3. ASD Model

For ASD, the attention scores is used directly for the model prediction. For each audio query and each timestep, the attention scores give a measure of how well each candidate video corresponds to the audio.

3. MULTI-TASK LOSS FOR A/V ASR AND ASD

ASD. For active speaker detection, the normalized attention weights can be used to train the attention module directly with cross entropy loss.

ASR. RNN-T loss

MTL Loss. We combine the ASD and ASR losses with a weighted linear sum of the losses

Key Findings

This paper presents a multi-task learning (MTL) for a model that can simultaneously perform audio-visual ASR and active speaker detection, improving previous work on multiperson audio-visual ASR.

Combining the two tasks is enough to significantly improve the performance of the model in the ASD task relative to the baseline.

IMPROVING NOISE ROBUSTNESS OF CONTRASTIVE SPEECH REPRESENTATION LEARNING WITH SPEECH RECONSTRUCTION

The Ohio State University, USA, Microsoft Corporation

Problem

Noise Robust ASR

Proposed method

In this paper, authors employ a noise-robust representation learned by a refined self-supervised framework of wav2vec 2.0 for noisy speech recognition. They combine a reconstruction module with contrastive learning and perform multi-task continual pre-training to explicitly reconstruct the clean speech from the noisy input.

'Speech Signal Processing > Research' 카테고리의 다른 글

[ASR/ST/PT/ACL22] Unified Speech-Text Pre-training for Speech Translation and Recognition (0)	2023.11.18
[ASR/ST/PT/2023.10] SpeechUT:Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training (0)	2023.11.18
ICASSP 2022 \| Keyword Spotting (0)	2022.05.20
ICASSP 2022 \| SSL for Speech and Audio Processing I (0)	2022.05.07
ICASSP 2022 \| Language Modeling (0)	2022.05.07

[Hive] beeline 으로 테이블 데이터 조회 및 다운로드 하기

2021. 5. 23. 19:21

이번 글에서는 Hive 테이블의 데이터를 조회, 테이블 데이터를 클라이언트로 다운로드하는 과정에 대해 설명합니다.

1. 데이터 조회

Hive 테이블 생성 - beeline 에서 만들어진 테이블을 조회합니다.

SELECT
  *
FROM speech_db.speech_internal_db
WHERE ymd between '2021-06-11' and '2021-06-11';

+----------------------+--------------------------+------------------------------------+---------------------+
| speech_internal_db.indx  | speech_internal_db.path_wav  |      speech_internal_db.utterance      | speech_internal_db.ymd  |
+----------------------+--------------------------+------------------------------------+---------------------+
| 1                    | /root/1.wav              | This is an example                 | 2021-06-11          |
| 2                    | /root/2.wav              | Let us learn apache hive together  | 2021-06-11          |
+----------------------+--------------------------+------------------------------------+---------------------+

2. 데이터 저장

조회 결과를 File 로 저장합니다. /user/new/download 디렉토리에 결과가 저장됩니다.

하이브 처리 결과를 gzip으로 압축하여 출력할 때는 다음과 같이 사용합니다.
- hive.exec.compress.output: 출력결과의 압축 여부를 설정
- mapred.output.compression.codec: 압축 코덱을 설정. core-site.xml의 io.compression.codecs에 설정된 값을 사용

set hive.exec.compress.output=false;


-- 결과를 /user/new/example_download 에 저장합니다.
INSERT OVERWRITE DIRECTORY '/user/new/download'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' ESCAPED BY '\\'
STORED AS TEXTFILE
SELECT
  *
FROM speech_db.speech_internal_db
WHERE ymd between '2021-06-11' and '2021-06-11';

3. /user/user/download 에 파일이 저장되었는지 확인하고 다운로드합니다. 다운로드 후 파일은 삭제합니다.

[hadoop] [user@user-MacBookPro-5 ~/Downloads 15:12:34] hadoop fs -get /user/new/download

[hadoop] [user@user-MacBookPro-5 ~/Downloads 15:13:59] ls download/000000_0
download/000000_0

'ML Engineering > Hadoop and Hive' 카테고리의 다른 글

[Hive] 로컬 CSV 데이터를 Hive 테이블에 Load하기 (2)	2021.05.23
[Hive] 테이블 분할(partition) 과 버킷화(bucket) (0)	2021.05.23
[Hive] 파일 포맷 (Storage Formats) (0)	2021.05.23
[Hive] 관리형(Managed) 테이블과 외부(External) 테이블 (0)	2021.05.23
[Hive] Hive DDL Commands (0)	2021.05.23

[Hive] 로컬 CSV 데이터를 Hive 테이블에 Load하기

2021. 5. 23. 19:19

이번 글에서는 client에 있는 원하는 데이터를 Hive table로 만드는 과정을 정리해봤습니다. 저는 현재 업무 중 하나로 음성 DB를 다루는 일을 수행하고 있는데요. 이와 관련한 간단한 데이터를 놓고 실습을 진행해보겠습니다.

전체적인 프로세스는 다음과 같습니다.

파일을 하둡 HDFS 에 원하는 location에 Upload합니다.

Upload 된 파일을 읽기 위해 임시 외부 테이블을 생성합니다.데이터 적재가 완료되면 삭제해야 할 것입니다.

임시 외부 테이블 데이터를 읽어 최종 내부 테이블에 적재합니다.
- 이때 파일 포맷은 ORC 또는 PARQUET 로 변경합니다.

최종 내부 테이블에 정상적으로 데이터가 쌓였는지 확인합니다.

임시 외부 테이블을 DROP 하고 Upload 한 파일을 삭제합니다.

데이터 준비

[hadoop] [user@user-MacBookPro-5 ~/Downloads 14:40:35] hadoop fs -put /Users/user/Downloads/example-2021-06-11.txt /ns/new/speech_db/txt
[hadoop] [user@user-MacBookPro-5 ~/Downloads 14:41:07] hadoop fs -ls /ns/new/speech_db/txt
Found 1 items
-rw-rw-r--   3 user new        108 2021-05-22 14:41 /ns/new/speech_db/txt/example-2021-06-11.txt
[hadoop] [user@user-MacBookPro-5 ~/Downloads 14:41:14]

다음 파일을 가지고 테이블을 생성하도록 하겠습니다.

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/77e95408-8c22-413c-ad70-a006f95228b3/example-2021-06-11.txt

INDX, PATH_WAV, UTTERANCE 으로 구성되었고 \t(탭) 으로 구분된 파일입니다.

CSV, JSON 등 여러 파일 포맷과 RDBMS, Log 등 여러 소스를 가지고 테이블을 생성할 수 있으며, 어떤 마이그레이션 툴을 사용하던지 공용 하둡 HDFS 에 Upload 되면 테이블을 생성할 수 있습니다.

임시 외부 테이블 생성

HDFS 내의 지정된 위치를 가리키는 외부 테이블을 임시 테이블로 생성

-- 테이블이 있다면 삭제(DROP) 합니다.
DROP TABLE IF EXISTS speech_db.speech_external_db;
 
-- 테이블을 생성합니다. 첫번째 행은 Skip 합니다.
CREATE EXTERNAL TABLE speech_db.speech_external_db (
  INDX int
  , PATH_WAV string
  , UTTERANCE string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t' -- \t 으로 필드를 구분합니다.
LINES TERMINATED BY '\n'  -- \n 으로 라인을 구분합니다.
STORED AS TEXTFILE
LOCATION '/ns/new/speech_db/txt' -- 업로드된 파일의 부모 디렉토리를 지정합니다.
TBLPROPERTIES ('skip.header.line.count'='1');

임시 외부 테이블 데이터 조회 확인

SELECT * FROM speech_db.speech_external_db LIMIT 2;
 
+----------------------+--------------------------+------------------------------------+
| speech_external_db.indx  | speech_external_db.path_wav  |      speech_external_db.utterance      |
+----------------------+--------------------------+------------------------------------+
| 1                    | /root/1.wav              | This is an example                 |
| 2                    | /root/2.wav              | Let us learn apache hive together  |
+----------------------+--------------------------+------------------------------------+
1 row selected (0.274 seconds)

최종 내부 테이블 생성

최종 내부 테이블을 생성

날짜로 파티셔닝을 진행하겠습니다. 파일 포맷은 ORC 로 생성합니다.

-- 테이블이 없다면 생성합니다. ORC 파일 포맷으로 데이터를 저장합니다.
CREATE TABLE IF NOT EXISTS speech_db.speech_internal_db (
  INDX int
  , PATH_WAV string
  , UTTERANCE string
)
PARTITIONED BY ( ymd string ) -- 파티셔닝 칼럼명은 ymd 입니다.
STORED AS ORC;

최종 테이블 Location 확인

/ns/new/speech_db.db/speech_internal_db 에 생성된 것을 확인할 수 있습니다.

SHOW CREATE TABLE speech_db.speech_internal_db;
 
+----------------------------------------------------+
|                   createtab_stmt                   |
+----------------------------------------------------+
| CREATE TABLE `speech_db.speech_internal_db`(                          |
|   `indx` int,                                      |
|   `path_wav` string,                               |
|   `utterance` string)                              |
| PARTITIONED BY (                                   |
|   `ymd` string)                                    |
| ROW FORMAT SERDE                                   |
|   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'      |
| STORED AS INPUTFORMAT                              |
|   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'  |
| OUTPUTFORMAT                                       |
|   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
| LOCATION                                           |
|   'hdfs://hadoop-ns/new/speech_db.db/speech_internal_db'  |
| TBLPROPERTIES (                                    |
|   'transient_lastDdlTime'='1621662306')            |
+----------------------------------------------------+
16 rows selected (0.351 seconds)

임시 테이블에서 데이터를 읽어 최종 테이블에 적재

최종 테이블 파티션에 데이터가 있다면 OVERWRITE 되니 주의 해야합니다.

-- 2021-06-11 파티션이 없다면 추가합니다.
ALTER TABLE speech_db.speech_internal_db ADD IF NOT EXISTS PARTITION (ymd='2021-06-11');
 

-- 최종 테이블의 2018-01-01 파티션에 데이터가 있다면 OVERWRITE 되니 주의 바랍니다.
INSERT OVERWRITE TABLE speech_db.speech_internal_db PARTITION (ymd='2021-06-11')
SELECT * FROM speech_db.speech_external_db;

최종 테이블에 데이터가 잘 들어갔는지 확인

SELECT * FROM speech_db.speech_internal_db WHERE ymd='2021-06-11';
 
+----------------------+--------------------------+------------------------------------+---------------------+
| speech_internal_db.indx  | speech_internal_db.path_wav  |      speech_internal_db.utterance      | speech_internal_db.ymd  |
+----------------------+--------------------------+------------------------------------+---------------------+
| 1                    | /root/1.wav              | This is an example                 | 2021-06-11          |
| 2                    | /root/2.wav              | Let us learn apache hive together  | 2021-06-11          |
+----------------------+--------------------------+------------------------------------+---------------------+
2 rows selected (0.209 seconds)

파티션 디렉토리가 생성 되었는지 확인

[hadoop] [user@user-MacBookPro-5 ~/Downloads 14:41:14] hadoop fs -ls /ns/new/speech_db.db/speech_internal_db
Found 1 items
drwxrwxr-x   - user new          0 2021-05-22 15:01 /ns/new/speech_db.db/speech_internal_db/ymd=2021-06-11

임시 테이블을 삭제

DROP TABLE speech_db.speech_external_db;

임시 테이블 LOCATION 디렉토리 삭제

[doopey] [user ~/Downloads/hive 16:32:08] hadoop fs -rm -r /ns/new/speech_db/txt/example-2021-06-11.txt
21/05/22 15:03:53 INFO fs.TrashPolicyDefault: Moved: 'hdfs://hadoop-ns/new/speech_db/txt/example-2021-06-11.txt' to trash at: hdfs://hadoop-ns/user/user/.Trash/Current/new/speech_db/txt/example-2021-06-11.txt

'ML Engineering > Hadoop and Hive' 카테고리의 다른 글

[Hive] beeline 으로 테이블 데이터 조회 및 다운로드 하기 (0)	2021.05.23
[Hive] 테이블 분할(partition) 과 버킷화(bucket) (0)	2021.05.23
[Hive] 파일 포맷 (Storage Formats) (0)	2021.05.23
[Hive] 관리형(Managed) 테이블과 외부(External) 테이블 (0)	2021.05.23
[Hive] Hive DDL Commands (0)	2021.05.23

PREV 1 ···3 4 5 6 7 8 9 ···11 NEXT

전체 글

'Speech Signal Processing > Speech Recognition' 카테고리의 다른 글

SPE-54: Keyword Spotting

Unified Speculation, Detection, And, Verification Keyword Spotting

Temporal Early Exiting for Streaming Speech Commands Recognition

'Speech Signal Processing > Research' 카테고리의 다른 글

Self-supervised Learning for Speech and Audio Processing I

Technical Program Session MLSP-3

'Speech Signal Processing > Research' 카테고리의 다른 글

The popular recent approaches in unsupervised segmentation

Libraries implementing segmentation algorithms

'Speech Signal Processing > Speech Recognition' 카테고리의 다른 글

Language Modeling

Technical Program Session SPE-4

CAPITALIZATION NORMALIZATION FOR LANGUAGE MODELING WITH AN ACCURATE AND EFFICIENT HIERARCHICAL RNN MODEL

NEURAL-FST CLASS LANGUAGE MODEL FOR END-TO-END SPEECH RECOGNITION

ENHANCE RNNLMS WITH HIERARCHICAL MULTI-TASK LEARNING FOR ASR

RESCOREBERT: DISCRIMINATIVE SPEECH RECOGNITION RESCORING WITH BERT

Hybrid sub-word segmentation for handling long tail in morphologically rich low resource languages

'Speech Signal Processing > Research' 카테고리의 다른 글

Speech Recognition: Robust Speech Recognition I

Technical Program Session SPE-2

AUDIO-VISUAL MULTI-CHANNEL SPEECH SEPARATION, DEREVERBERATION AND RECOGNITION

BEST OF BOTH WORLDS: MULTI-TASK AUDIO-VISUAL AUTOMATIC SPEECH RECOGNITION AND ACTIVE SPEAKER DETECTION

IMPROVING NOISE ROBUSTNESS OF CONTRASTIVE SPEECH REPRESENTATION LEARNING WITH SPEECH RECONSTRUCTION

'Speech Signal Processing > Research' 카테고리의 다른 글

1. 데이터 조회

2. 데이터 저장

3. /user/user/download 에 파일이 저장되었는지 확인하고 다운로드합니다. 다운로드 후 파일은 삭제합니다.

'ML Engineering > Hadoop and Hive' 카테고리의 다른 글

데이터 준비

임시 외부 테이블 생성

최종 내부 테이블 생성

'ML Engineering > Hadoop and Hive' 카테고리의 다른 글

티스토리툴바