'분류 전체보기' 카테고리의 글 목록 (5 Page)

분류 전체보기

Python - Data types 2024.04.05
Kubernetes - Overview II - Components 2024.04.05
Kubernetes - Overview I 2024.04.05
Text-only adaptation for E2E ASR models 2024.04.04
[ASR/ST/PT/ACL22] Unified Speech-Text Pre-training for Speech Translation and Recognition 2023.11.18
[ASR/ST/PT/2023.10] SpeechUT:Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training 2023.11.18
Public Speech Datasets for ASR 2023.11.18
Public Speech Datasets for ASR (details) 2023.11.18

Python - Data types

2024. 4. 5. 02:52

String

Sequence of characters, enclosed –“ ”/’ ’/””” “””, ordered, immutable, duplicates allowed
Accessing Items: Indexing ,slicing and for loop
Functions: capitalize(), title(), upper(), lower(), lstrip() ,rstrip() ,strip(), swapcase(), replace(), find(), split(), join() ,endswith(), startswith(), del, index(), count(), isalnum(), isalpha(), isdecimal(), isdigit(), isnumeric(), islower() ,isupper(), istitle(), isspace(), zfill(),center().

List

collection of heterogeneous datatype, enclosed –[ ], ordered, mutable duplicates allowed
Accessing Items: Indexing ,slicing and for loop
Functions: append(), extend(), insert(), remove(), pop(), clear(), del, sort(), reverse(), index(), count()

Tuple

collection of heterogeneous datatype, enclosed –( ), ordered, immutable, duplicates allow
Accessing Items: Indexing ,slicing and for loop
Functions: Index(), count()

Set

collection of immutable items, enclosed in –{ }.,unordered, mutable, duplicates not allows
Accessing Items:for loop
Functions:remove(), discard(), pop(), clear(), del, add(), update(), union(), intersection(), intersection_update, difference() ,difference_update() ,symmetric_difference, symmetric_difference_update(), issubset(), issuperset(), isdisjoint(),

Frozen set

collection of immutable items, enclosed in –{()}.,unordered, immutable, duplicates not allow.
Accessing Items:for loop
Functions: Union(), intersection(), difference(), symmetric_difference.

Dictionary

collection of key and value pairs, enclosed in –{ }., ordered, mutable, duplicates allows Keys:immutable,Values:mutables
Accessing Items: Indexing ,slicing and for loop
Functions: Dict_name[keyname], update(), pop(), popitems(), clear(), del, fromkeys(),setdefault().

Bool

True and False

'ML Engineering > python' 카테고리의 다른 글

03. Binary Tree DFS Techniques (0)	2024.08.06
02. Sliding Window Technique (0)	2024.08.06
01. Two Pointers Technique (0)	2024.08.06
Differences between subarrays, substrings, subsequences, and subsets (0)	2024.08.06
Python 2 vs 3 Difference (0)	2024.04.12

Kubernetes - Overview II - Components

2024. 4. 5. 00:21

Kubernetes architecture is built to manage the lifecycle and operations of containerized applications across diverse infrastructure. Kubernetes not only simplifies cloud application deployment but also offers a robust framework for operational excellence in the cloud era.

Kubernetes, at its core, orchestrates containerized applications across a cluster of machines. This orchestration allows for high availability, fault tolerance, and scalability in deploying applications. Let’s dissect the key components that make Kubernetes an essential tool for managing distributed systems.

This document outlines the various components you need to have for a complete and working Kubernetes cluster.

The Foundation: Clusters and Nodes

A Kubernetes Cluster is a set of nodes that run containerized applications. These nodes are the workhorses of a Kubernetes cluster, where:

Worker Nodes host the Pods—the smallest deployable units that can be created and managed in Kubernetes.
Control Plane Nodes manage the state of the cluster, including scheduling applications, maintaining applications' desired state, scaling applications, and rolling out new updates.

Understanding these components is crucial for anyone looking to specialize in cloud computing or distributed systems.

Control Plane Components: The Brain Behind the Operation

The control plane is responsible for making global decisions about the cluster and reacting to cluster events. Its components include:

kube-apiserver: This acts as the front end to the cluster’s control plane. It exposes the Kubernetes API, which is the primary management interface of Kubernetes. The kube-apiserver is designed to scale horizontally, ensuring Kubernetes can manage and scale applications efficiently.
etcd: A highly available key-value store used for all cluster data, ensuring there's a reliable source of truth for the state of the cluster. Proper management of etcd is crucial for cluster health and data integrity.
kube-scheduler: Responsible for assigning newly created Pods to Nodes, taking into account the operational requirements of the workloads and the current infrastructure's state.
kube-controller-manager: Runs controller processes that monitor the state of the cluster through the apiserver and make changes aiming to move the current state towards the desired state.
cloud-controller-manager: Integrates the cluster into the cloud provider's API, allowing Kubernetes to interact with the underlying infrastructure when necessary.

These components ensure the cluster is behaving as intended, handling scheduling, and interacting with underlying infrastructure.

Node Components: The Muscle Doing the Heavy Lifting

Each node in a cluster runs several mandatory components:

kubelet: Ensures that containers are running in a Pod. It starts, stops, and maintains application containers organized into Pods as directed by the control plane.
kube-proxy: Maintains network rules on nodes, allowing communication to your Pods from network sessions inside or outside of your cluster.
Container Runtime: The software responsible for running containers. Kubernetes supports several container runtimes, like containerd and CRI-O.

These components are essential for the actual running of applications on the physical or virtual machines within the cluster.

'ML Engineering > Kubernetes' 카테고리의 다른 글

Kubernetes - Overview I (0)	2024.04.05

Kubernetes - Overview I

2024. 4. 5. 00:02

Kubernetes?

Kubernetes는 컨테이너화된 애플리케이션의 배포, 확장, 관리를 자동화하기 위해 설계된 오픈 소스 플랫폼입니다. declarative configuration and automation를 통해 대규모 애플리케이션 관리를 용이하게 합니다. Kubernetes는 그리스어에서 '키잡이' 또는 '조타수'를 의미하며, Google이 2014년에 오픈 소스로 공개했습니다. 이는 Google의 대규모 운영 경험과 커뮤니티의 아이디어를 결합한 것입니다.

The Evolution of Deployment

Traditional deployment era

초기에는 애플리케이션을 물리 서버에 직접 배포했으나, 리소스 할당 문제가 발생했습니다. 애플리케이션마다 물리 서버를 사용하는 것은 확장성이 떨어지고 비용이 많이 드는 방법이었습니다.

Virtualized deployment era

가상화는 물리 서버의 CPU 위에 여러 가상 머신(VM)을 실행할 수 있게 하여 리소스 활용도를 개선했습니다. 각 VM은 자체 OS를 포함하여 애플리케이션 간 격리를 제공합니다.

Container deployment era

컨테이너는 VM과 유사하지만, OS를 애플리케이션 간에 공유하여 더 가볍게 만들었습니다. 컨테이너는 이동성, 민첩한 개발 및 배포, 환경 일관성 등의 이점으로 인해 인기를 끌고 있습니다.

Kubernetes 의 필요성

프로덕션 환경에서 컨테이너를 관리하려면 높은 가용성을 보장하고 수요에 따라 확장해야 합니다. Kubernetes는 분산 시스템을 탄력적으로 운영할 수 있는 프레임워크를 제공합니다.

Kubernetes의 주요 기능

Service discovery and load balancing: DNS 이름이나 IP 주소를 사용하여 컨테이너를 노출하고 네트워크 트래픽을 분산시킵니다.
Storage orchestration: 자동으로 스토리지 시스템을 마운트합니다.
Automated rollouts and rollbacks: 원하는 상태를 선언적으로 관리하며 변경사항을 자동으로 적용합니다.
Self-healing: 실패한 컨테이너를 재시작하고, 문제가 있는 컨테이너를 교체합니다.
Secret and configuration management: 비밀 정보를 안전하게 저장하고 관리합니다.
Horizontal scaling: CPU 사용량에 기반해 애플리케이션을 자동으로 확장하거나 축소합니다.

'ML Engineering > Kubernetes' 카테고리의 다른 글

Kubernetes - Overview II - Components (0)	2024.04.05

Text-only adaptation for E2E ASR models

2024. 4. 4. 23:46

There are generally three ways to perform text-only adaptation:

Injecting synthesizing speech data to the model
- generate audio for training texts via TTS and inject it to the model
LM fusion
- Fusion and biasing (shallow fusion):
  - during decoding interpolate posterior word probabilities with text priors from external LMs
  - another recent approach is to extract internal LM probabilities and discount with the ratio of external and internal LM probabilities
- Rescoring and reranking
  - after decoding, use a powerful external LM to update scores and rerank n-best results or recognition lattice
- These techniques incur a significant overhead at inference time due to the external LM and also require careful tuning of the interpolation weight used for the external LM.
Explicit separation of internal LMs
- force the E2E decoder/predictor to behave more like a language model (e.g. Hybrid autoregressive transducer (HAT), Modular hybrid autoregressive transducer, and Factorized transducer)

Reference

[1] External Language Model Integration for Factorized Neural Transducers

[2] in-situ test-only adaptation of speech models with low-overhead speech imputations

'Speech Signal Processing > Speech Recognition' 카테고리의 다른 글

[BBPE] OPTIMIZING BYTE-LEVEL REPRESENTATION FOR END-TO-END ASR (0)	2024.10.09
[DataLoader] DynamicBatchSampler (3)	2024.10.09
Whisper ASR: Model and Training Details (0)	2023.11.18
Subword modelling for ASR (0)	2022.05.07
[Kaldi Decoding] 칼디 디코딩 그래프 구성 (0)	2020.06.18

[ASR/ST/PT/ACL22] Unified Speech-Text Pre-training for Speech Translation and Recognition

2023. 11. 18. 17:30

This paper presents a method for jointly pre-training speech and text in an encoder-decoder framework to improve performance in speech translation and recognition tasks.

Key Takeaways:

Architecture: The method utilizes an Attention based Encoder-Decoder (AED) framework to integrate data from different modalities (speech and text) for representation learning.
- Shared Encoder and Decoder: The STPT framework uses a shared encoder and decoder for both the speech and text modalities, which allows the model to integrate knowledge from both domains.
Acoustic and Linguistic Representation Learning: The STPT framework is designed to learn both acoustic features from speech and linguistic features from text during the pre-training stage. This is crucial for speech translation models, which must understand the sounds of speech as well as the meaning of words.
Joint Pre-Training Phase; Multi-Task Learning Framework: The framework integrates different pre-training tasks to build a robust model capable of handling multiple aspects of speech and language. The proposed Speech and Text joint Pre-Training (STPT) framework incorporates four self-supervised and supervised subtasks designed for cross-modality learning.
- Text-to-Text (T2T): This self-supervised task helps the model learn linguistic patterns in the text. It's similar to how models like BERT learn by predicting masked words in a sentence.
- Speech SSL learning (SSL): This is another self-supervised task focused on learning from the speech data alone, likely involving predicting some masked or hidden parts of the speech input.
- Speech-to-Phoneme (S2P): A supervised task where the model is trained to predict phoneme units from speech data. Phonemes are the smallest units of sound in a language, so this task helps the model learn the sounds that make up speech.
- Speech-to-Subword (S2T): Also a supervised task, where the model learns to predict subword units from the speech input. Subwords are larger than phonemes and can carry more linguistic information, like syllables or parts of words.
Loss Functions: Pretraining is guided by different loss functions corresponding to the various tasks:
- LT2T: The loss for the Text-to-Text task.
- LSSL: The loss for the Speech SSL learning task, which involves masked prediction.
- LS2P: The loss for the Speech-to-Phoneme task, which involves phoneme-unit sequence classification.
- LS2T: The loss for the Speech-to-Subword task, involving sequential prediction of subword tokens.
- Final Loss: The overall objective for the pre-training phase is a combination of these losses, guiding the model to learn both modality-specific and cross-modal representations.
Improved Performance: The STPT method effectively fuses speech and text information into one model, leading to significant improvements in performance. It achieves 1.7 to 2.3 BLEU score improvements on the MUST-C speech translation dataset and comparable word error rates (WERs) to the wav2vec 2.0 model on the LibriSpeech speech recognition task.

'Speech Signal Processing > Research' 카테고리의 다른 글

MultiHeadSelfAttentions, their masks, and variant (2)	2024.09.05
[SSL] BEST-RQ Pre-Training (0)	2024.09.05
[ASR/ST/PT/2023.10] SpeechUT:Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training (0)	2023.11.18
ICASSP 2022 \| Keyword Spotting (0)	2022.05.20
ICASSP 2022 \| SSL for Speech and Audio Processing I (0)	2022.05.07

[ASR/ST/PT/2023.10] SpeechUT:Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training

2023. 11. 18. 17:13

This paper presents a new model, SpeechUT, which aims to bridge the gap between speech and text representations in the context of pre-training for speech-to-text tasks.

Key Takeways:

Tasks: SpeechUT incorporates three unsupervised pre-training tasks: speech-to-unit (S2U), masked unit modeling (MUM), and unit-to-text (U2T). These tasks help to learn better representations for the speech and text modalities.
Architecture: SpeechUT comprises a speech encoder, unit encoder, and text decoder, along with speech and unit pre-nets to process the inputs.
Unified-Modal Speech-Unit-Text Pre-training Model (SpeechUT): The proposed model is designed to connect the representations of speech and text through a shared unit encoder. It allows for pre-training with unpaired speech and text data, which can be beneficial for tasks like automatic speech recognition (ASR) and speech translation (ST). SpeechUT is a new pre-training method using hidden-unit representations to connect speech encoders and text decoders.
Discrete Representation (Units): SpeechUT leverages hidden-unit representations as an interface to align speech and text. This is done by decomposing the speech-to-text model into a speech-to-unit model and a unit-to-text model, which can be pre-trained separately with large amounts of unpaired data. The model uses discrete unit sequences produced by off-line generators, allowing for the pre-training of large-scale unpaired speech and text.
Embedding Mixing: An embedding mixing mechanism is introduced to better align speech and unit representations.
Pre-Training and Fine-Tuning Methods: The paper describes how SpeechUT is pre-trained with the mentioned tasks and fine-tuned for specific ASR and ST tasks.
1. Pre-Training Tasks: SpeechUT includes three unsupervised pre-training tasks: speech-to-unit, masked unit modeling, and unit-to-text.
2. Fine-Tuning: For downstream tasks like ASR and ST, SpeechUT is fine-tuned without introducing new parameters, utilizing the pre-trained modules.
Performance: The paper reports that SpeechUT achieves substantial improvements over strong baselines and sets new state-of-the-art performance on the LibriSpeech ASR and MuST-C ST benchmarks.
Detailed Analyses: The paper includes detailed analyses to understand the proposed SpeechUT model better, and the code and pre-trained models are made available for the community.

'Speech Signal Processing > Research' 카테고리의 다른 글

[SSL] BEST-RQ Pre-Training (0)	2024.09.05
[ASR/ST/PT/ACL22] Unified Speech-Text Pre-training for Speech Translation and Recognition (0)	2023.11.18
ICASSP 2022 \| Keyword Spotting (0)	2022.05.20
ICASSP 2022 \| SSL for Speech and Audio Processing I (0)	2022.05.07
ICASSP 2022 \| Language Modeling (0)	2022.05.07

Public Speech Datasets for ASR

2023. 11. 18. 15:29

OWSM v1, v2, and v3: Refer the paper
- OWSM v1
  - AISHELL-1 [23],
  - CoVoST2 [24],
  - GigaSpeech [25],
  - LibriSpeech [26],
  - MuST-C [27],
  - SPGISpeech [28]
  - TEDLIUM3 [29].
- OWSM v2
  - builds upon v1 and includes additional datasets:
  - GigaST [30]
  - Multilingual LibriSpeech [31]
  - WenetSpeech [32].
- OWSM v3
  - extends v2 with even more datasets:
  - AIDATATANG [33],
  - AMI [34],
  - Babel [35],
  - Common Voice [36],
  - Fisher (Switchboard) [37],
  - Fisher Callhome Spanish [38],
  - FLEURS [39],
  - Googlei18n3 ,
  - KsponSpeech [40],
  - MagicData [41],
  - ReazonSpeech [42],
  - Russian Open STT [43],
  - VCTK [44],
  - VoxForge [45],
  - VoxPopuli [46],
  - WSJ [47].
NeMo-Public dataset
- Librispeech
- Fisher Corpus
- Switchboard-1 Dataset
- WSJ-0 and WSJ-1
- National Speech Corpus (Part 1, Part 6)
- VCTK
- VoxPopuli (EN)
- Europarl-ASR (EN)
- Multilingual Librispeech (MLS EN) - 2,000 hrs subset
- Mozilla Common Voice (v8.0)
- People's Speech - 12,000 hrs subset
SpeechStaw
- Librispeech
- Common Voice v8.0
- TED-LIUM v3
- AMI
- English Broadcast News2
- WSJ0 and WSJ1

'Speech Signal Processing > Basic' 카테고리의 다른 글

UTF-8, Byte-level BPE (BBPE) (4)	2024.10.09
Public Speech Datasets for ASR (details) (0)	2023.11.18
16 Bit, 16kHz wav 데이터 사이즈 계산 (Calculation of 16 Bit, 16kHz wave data size) (0)	2021.05.14
[기본] 음성 신호 처리 - 시간영역/주파수영역 분석 (0)	2020.09.18
16비트 고정소수점, 32비트 부동소수점 WAV 파일 (16-bit fixed point, 32-bit floating point WAV file basics) (0)	2020.07.22

Public Speech Datasets for ASR (details)

2023. 11. 18. 15:24

Dataset Name	Data Size (Hours)	Source	Description
LibriSpeech	960	LibriSpeech	A corpus of read English speech from LibriVox project, segmented and aligned, 16 kHz sampling rate.
Fisher Corpus	Part 1: 984, Total: >2000	Fisher Corpus Part 1 Transcripts	Spontaneous telephone conversations in English, recorded for linguistic research.
Switchboard-1 Dataset	260	Switchboard-1	English telephone conversations, collected under DARPA sponsorship.
WSJ-0 and WSJ-1	80	WSJ0	Read speech from Wall Street Journal news text, for large-vocabulary CSR systems.
National Speech Corpus	Not specified, 1.2 TB	National Speech Corpus	Singapore English corpus for ASR research, to improve accuracy for locally accented English.
VCTK	44 (per speaker, 109 speakers)	VCTK	Text-to-speech research, audio of speakers with various accents reading different texts.
VoxPopuli (EN)	543 (part of 1.8K transcribed)	VoxPopuli	Multilingual corpus for unsupervised and semi-supervised learning.
Europarl-ASR (EN)	1300	Europarl-ASR	Parliamentary debates for ASR training, with official transcripts from the European Parliament.
Multilingual LibriSpeech (MLS EN)	2,000 (subset of 44.5K)	MLS EN	Derived from LibriVox project audiobooks, for speech research in multiple languages.
Mozilla Common Voice (v8.0)	16,000 (ongoing project)	Mozilla Common Voice	Multilingual read speech corpus for building voice technologies, contributed by volunteers.
People's Speech	12,000	Not found	English speech corpus for ASR model training.
TED-LIUM v3	452	TED-LIUM 3 Dataset	Audio from TED Talks, including talks from TED-LIUM 2, with automatic transcripts.
AMI	100	AMI Corpus	Meeting recordings with various synchronized signals including video and projector outputs.
English Broadcast News2	140 (plus 9000 hours of TV shows)	English Broadcast News Speech Recognition by Humans and Machines	Wide-band signals from various speakers, different background noises, and news topics, with lightly supervised transcripts.

LibriSpeech
- Data Size: 960 Hours
- Source: LibriSpeech
- Description: A corpus of read English speech derived from read audiobooks from the LibriVox project, carefully segmented and aligned, with a sampling rate of 16 kHz.
Fisher Corpus
- Data Size: Part 1 consists of 984 hours, and the entire collection has over 2000 hours of English conversational telephone speech.
- Source: Fisher Corpus Part 1 Transcripts
- Description: A collection of spontaneous telephone conversations in English between native speakers, recorded for linguistic research.
Switchboard-1 Dataset
- Data Size: 260 Hours
- Source: Switchboard-1
- Description: A corpus of English telephone conversations, collected under DARPA sponsorship and released by NIST and the LDC.
WSJ-0 and WSJ-1
- Data Size: 80 Hours
- Source: WSJ0
- Description: A corpus of read speech with texts drawn from Wall Street Journal news text, known as WSJ0 and WSJ1, used for research on large-vocabulary Continuous Speech Recognition (CSR) systems.
National Speech Corpus (Part 1, Part 6)
- Data Size: The entire corpus is approximately 1.2 TB in size (specific hours not provided).
- Source: National Speech Corpus
- Description: A large-scale Singapore English corpus for automatic speech recognition (ASR) research, designed to improve speech engines’ accuracy for locally accented English.
VCTK
- Data Size: 44 Hours (Each of the 109 native English speakers reads about 400 sentences.)
- Source: VCTK
- Description: A dataset designed for text-to-speech research, containing audio recordings of speakers with various accents reading newspaper excerpts, the Rainbow Passage, and an elicitation paragraph.
VoxPopuli (EN)
- Data Size: 543 Hours (Part of a larger corpus with 1.8K hours of transcribed speeches in 16 languages.)
- Source: VoxPopuli
- Description: A large-scale multilingual corpus with unlabelled and transcribed speech data in multiple languages, intended for unsupervised and semi-supervised learning.
Europarl-ASR (EN)
- Data Size: 1300 hours of English-language annotated speech data.
- Source: Europarl-ASR
- Description: A corpus of parliamentary debates for ASR training and benchmarking, containing speeches and their official transcripts from the European Parliament.
Multilingual LibriSpeech (MLS EN) - 2,000 hrs subset
- Data Size: 2,000 hours subset of a larger corpus with 44.5K hours of English.
- Source: MLS EN
- Description: A corpus derived from read audiobooks from the LibriVox project, suitable for speech research in multiple languages.
Mozilla Common Voice (v8.0)
- Data Size: 16,000 Hours (The size for v8.0 is not specified, but the project is ongoing with contributions from volunteers.)
- Source: Mozilla Common Voice
- Description: A multilingual corpus of read speech collected from volunteers across the globe for building voice-enabled technologies.
*People's Speech *
- Data Size: 12,000 Hours
- Source: A specific link for the 12,000 hours subset was not found during the search.
- Description: A large and diverse English speech corpus aimed at training ASR models.
*TED-LIUM v3 *
- Data Size: 452 hours of audio
- Source: TED-LIUM 3 Dataset
- Description: This audio dataset is derived from TED Talks and includes 2351 audio talks. It features aligned automatic transcripts and takes into account speech disfluencies such as repetitions and hesitations.
AMI
- Data Size: 100 hours of meeting recordings
- Source: AMI Corpus
- Description: The AMI Meeting Corpus is a multi-modal dataset that includes synchronized recordings using various signals. It features close-talking and far-field microphones, individual and room-view video cameras, and outputs from a slide projector and an electronic whiteboard.
English Broadcast News2
- Data Size: 140 hours of carefully transcribed data, with an additional 9000 hours of TV shows with closed captions used for training.
- Source: English Broadcast News Speech Recognition by Humans and Machines
- Description: This dataset is for speech recognition systems that deal with wide-band signals from a variety of speakers in different background noise conditions, speaking on various news topics. The data is similar to written English, with lightly supervised transcripts for training.

'Speech Signal Processing > Basic' 카테고리의 다른 글

UTF-8, Byte-level BPE (BBPE) (4)	2024.10.09
Public Speech Datasets for ASR (0)	2023.11.18
16 Bit, 16kHz wav 데이터 사이즈 계산 (Calculation of 16 Bit, 16kHz wave data size) (0)	2021.05.14
[기본] 음성 신호 처리 - 시간영역/주파수영역 분석 (0)	2020.09.18
16비트 고정소수점, 32비트 부동소수점 WAV 파일 (16-bit fixed point, 32-bit floating point WAV file basics) (0)	2020.07.22

PREV 1 2 3 4 5 6 7 8 ···11 NEXT

Notes