'Speech Signal Processing/Speech Recognition' 카테고리의 글 목록

Speech Signal Processing/Speech Recognition

[BBPE] Neural Machine Translation with Byte-Level Subwords 2024.10.09 3
[BBPE] BILINGUAL END-TO-END ASR WITH BYTE-LEVEL SUBWORDS 2024.10.09
[BBPE] OPTIMIZING BYTE-LEVEL REPRESENTATION FOR END-TO-END ASR 2024.10.09
[DataLoader] DynamicBatchSampler 2024.10.09 3
Text-only adaptation for E2E ASR models 2024.04.04
Whisper ASR: Model and Training Details 2023.11.18
Subword modelling for ASR 2022.05.07
[Kaldi Decoding] 칼디 디코딩 그래프 구성 2020.06.18

[BBPE] Neural Machine Translation with Byte-Level Subwords

2024. 10. 9. 13:02

Neural Machine Translation with Byte-Level Subwords

- https://arxiv.org/pdf/1909.03341

- https://git.opendfki.de/yhamidullah/fairseq-stable/-/tree/noencoder/examples/byte_level_bpe

Byte Level Text Representation

Encoding Byte-Level Representation. We consider UTF8 encoding of text, which encodes each Unicode character into 1 to 4 bytes. This allows us to model a sentence as a sequence of bytes instead of characters. While there are 138K Unicode characters covering over 150 languages, we represent a sentence in any language as a sequence of UTF-8 bytes (248 out of 256 possible bytes).

A byte sequence representation of text is often much longer (up to 4x) than a character sequence representation, which makes it computationally demanding to use bytes as they are. As an alternative, we consider segmenting a byte sequence into variable-length n-grams (byte-level “subwords”). Specifically, we learn BPE vocabulary on the byte-level representation which extends UTF-8 byte set with byte n-grams. We denote this type of vocabulary as B(ytelevel)BPE in the rest of the paper. Figure 1 shows an example of BBPE tokenization.

BBPE symbols can be partial characters shared by different characters or the combination of complete and partial characters. This arbitrariness may necessitate incorporating a larger context surrounding each symbol for disambiguation and learning the character boundaries. In this work, we base our experiments on Transformer (Vaswani et al. 2017) models. We propose to use either a depth-wise convolutional layer (Kaiser, Gomez, and Chollet 2017) or a bidirectional recurrent layer with gated recurrent units (Cho et al. 2014, GRU,) to contextualize BBPE embeddings before feeding them into the model:

Decoding with Byte-Level Subwords. While any sentence can be represented as a byte sequence, the converse is, however, not necessarily true in that there are byte sequences that do not translate to valid character sequences. Empirically, we find that invalid outputs from trained models are very rare. We do not observe any in the experiments described below (note that one of them does have a large test set of 165K examples). And a common error pattern in halftrained models is redundant repeating bytes. In our system, we try to recover as many Unicode characters as possible from this error pattern efficiently in linear time. The algorithm is as follows: For a given byte sequence {B} N k=1, we denote the maximum number of characters that we can recover from it as f(k). Then f(k) has optimal substructure and can be solved by dynamic programming:

corresponds to a valid character, otherwise 0. When f(k) is calculated recursively, we also record the selections at each position k so that we can recover the solution through backtracking. The design of UTF-8 encoding ensures the uniqueness of this recovery process: for a character UTF-8 encoded with multiple bytes, its trailing bytes will not make a valid UTF-8 encoded character. Then the best selection in Eq. 1 is unique and so is the final solution.

'Speech Signal Processing > Speech Recognition' 카테고리의 다른 글

[BBPE] BILINGUAL END-TO-END ASR WITH BYTE-LEVEL SUBWORDS (0)	2024.10.09
[BBPE] OPTIMIZING BYTE-LEVEL REPRESENTATION FOR END-TO-END ASR (0)	2024.10.09
[DataLoader] DynamicBatchSampler (3)	2024.10.09
Text-only adaptation for E2E ASR models (0)	2024.04.04
Whisper ASR: Model and Training Details (0)	2023.11.18

[BBPE] BILINGUAL END-TO-END ASR WITH BYTE-LEVEL SUBWORDS

2024. 10. 9. 12:57

BILINGUAL END-TO-END ASR WITH BYTE-LEVEL SUBWORDS, Apple 2022

- https://arxiv.org/pdf/2205.00485

Byte-level models have been proposed for natural language processing (NLP) [9] [10] [11]. The idea is to convert text to a sequence of variable-length UTF-8 codewords, and to have the model predict one byte at each decoding step. The advantages of byte-level representation are compactness and universality, as any combination of languages may be represented with an output dimension of only 256. However, a sequence represented at the byte level is always much longer than its character-level counterpart for languages such as Chinese and Japanese [12], which is because many characters of these languages are represented by multiple bytes in UTF-8. As a result, a byte-level model can be error-prone since it needs to make multiple predictions for many single characters, and each prediction has a chance to make a mistake. To compensate for this drawback, [12] proposes byte-level subwords for neural machine translation. The idea is to apply byte pair encoding (BPE) [13] to UTF-8 codeword sequences and as a result, an approach referred to as byte-level BPE (BBPE). BBPE inherits the advantages of UTF-8 byte-level representation. BBPE is able to represent all languages while keeping the output dimension in check. At the same time, as BBPE tokens are in general longer than byte-level tokens, the approach reduces the number of steps required by the decoding process.

In this work, we investigate bilingual (English and Mandarin) E2E ASR models by exploring different types of output representations, including character-level, BPE, byte-level (UTF-8) and BBPE. Similar to some of the previous work cited, we build a single E2E model for utterance-based bilingual speech recognition. Our contributions are threefold. First, we compare the strengths and weaknesses of different output representations in monolingual and bilingual use cases. Second, we propose a method to adjust the bigram statistics in the BPE algorithm and show that the BBPE representation leads to accuracy improvements in the bilingual scenario. Finally, we analyze different representations and show how we might improve them for multilingual ASR.

OUTPUT REPRESENTATIONS FOR E2E ASR

Character-level Representation

Using a character-level representation in an E2E model means that the output symbol set for the model is the set of graphemes of the target language. In addition to graphemes, the output representation may also contain punctuation marks, digits, emojis or special tokens such as begin-of-sentence (BOS) or end-of-sentence (EOS). According to [14] [15], character-level representation is often a good representation for Mandarin E2E models, and this serves as one of the baselines in our experiments.

BPE Representation

The BPE algorithm [13] starts from the character representation and iteratively merges the most frequent bigrams given a training text corpus. At the end of this process, the BPE algorithm produces a symbol set that consists of subwords with different lengths. This symbol set can then be used by an E2E model as its output units. It is common to keep the single characters in the final symbol set, so unseen words in the test set can still be represented by the symbol set. For English, BPE is widely used in E2E ASR systems, as it improves accuracy and reduces computation due to the use of frequent subwords and the resulting shorter labeling sequences.

Byte-level Representation

Scalability is one of the important aspects in designing an output representation for a multilingual E2E ASR model. As the model supports more languages, the size of the symbol set increases. To tackle this problem [8] proposes a byte-level representation based on UTF-8. Instead of using characters or subwords as the symbols, byte-level model uses UTF-8 codewords as the output symbol set. The resulting representation is compact as each UTF-8 codeword only has 256 values so each symbol uses one byte. Yet, this representation is capable of representing any language, and adding more languages does not increase the size of the symbol set, which is an advantage compared to the character-level and BPE representation. However, byte-level representation has two drawbacks, first, it increases the length of the sequence by up to 4x [12], and it increases the number of decoding steps during inference. Second, not all byte sequences are valid UTF-8 sequences, which means the byte-level models may generate invalid byte sequences that require special handling.

To repair an invalid byte sequence, [8] proposes a dynamic programming algorithm to recover the Unicode characters given any byte sequence. We use this post-processing approach to recover characters from byte sequences as much as possible.

Byte-level BPE Representation

To circumvent the increase of sequence length for byte-level representation, [12] proposes byte-level BPE (BBPE) for neural machine translation, which applies BPE to byte-represented text. The advantage of this approach is that it reduces the sequence length by adopting frequent byte-level subwords and it keeps the size of the symbol set in check. It is important to note that BBPE is equivalent to BPE for many Latin-based languages, since in UTF-8, all Latin characters are single byte units. However, for languages like Chinese or Japanese, characters can use multiple bytes, so BBPE could be helpful. Similar to BPE representation, BBPE representation might generate invalid byte sequences, and post-processing using dynamic programming is necessary to remedy that. Another aspect is that if we keep all the single-byte UTF-8 codewords in the symbol set after BPE, BBPE can represent all languages, as with the byte-level representation.

Reference

[1] Anjuli Kannan, Arindrima Datta, Tara Sainath, Eugene Weinstein, Bhuvana Ramabhadran, Yonghui Wu, Ankur Bapna, and Zhifeng Chen, “Large-scale multilingual speech recognition with a streaming end-to-end model,” in Proceedings of the INTERSPEECH, 2019.

[2] Surabhi Punjabi, Harish Arsikere, Zeynab Raeesy, Chander Chandak, Nikhil Bhave, Ankish Bansal, Markus M¨uller, Sergio Murillo, Ariya Rastrow, Sri Garimella, et al., “Streaming end-to-end bilingual ASR systems with joint language identification,” in arXiv preprint arXiv:2007.03900, 2020.

[3] Vineel Pratap, Anuroop Sriram, Paden Tomasello, Awni Hannun, Vitaliy Liptchinsky, Gabriel Synnaeve, and Ronan Collobert, “Massively multilingual ASR: 50 languages, 1 model, 1 billion parameters,” pp. 4751–4755, 2020.

[4] Ke Li, Jinyu Li, Guoli Ye, Rui Zhao, and Yifan Gong, “Towards code-switching ASR for end-to-end CTC models,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2019.

[5] Changhao Shan, Chao Weng, Guangsen Wang, Dan Su, Min Luo, Dong Yu, and Lei Xie, “Investigating end-to-end speech recognition for mandarin-english code-switching,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2019.

[6] Zimeng Qiu, Yiyuan Li, Xinjian Li, Florian Metze, and William M. Campbell, “Towards context-aware end-to-end code-switching speech recognition,” in Proceedings of the INTERSPEECH, 2020.

[8] Bo Li, Yu Zhang, Tara Sainath, Yonghui Wu, and William Chan, “Bytes are all you need: End-to-end multilingual speech recognition and synthesis with bytes,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2019, pp. 5621–5625.

[9] Dan Gillick, Cliff Brunk, Oriol Vinyals, and Amarnag Subramanya, “Multilingual language processing from bytes,” in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies, 2016, pp. 1296–1306.

[10] Marta Ruiz Costa-Juss`a, Carlos Escolano Peinado, and Jos´e Adri´an Rodr´ıguez Fonollosa, “Byte-based neural machine translation,” in Proceedings of the First Workshop on Subword and Character Level Models in NLP, 2017, pp. 154–158.

[11] Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel, “Byt5: Towards a token-free future with pre-trained byte-tobyte models,” 2021.

[12] Changhan Wang, Kyunghyun Cho, and Jiatao Gu, “Neural machine translation with byte-level subwords,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2020, pp. 9154–9160.

'Speech Signal Processing > Speech Recognition' 카테고리의 다른 글

[BBPE] Neural Machine Translation with Byte-Level Subwords (3)	2024.10.09
[BBPE] OPTIMIZING BYTE-LEVEL REPRESENTATION FOR END-TO-END ASR (0)	2024.10.09
[DataLoader] DynamicBatchSampler (3)	2024.10.09
Text-only adaptation for E2E ASR models (0)	2024.04.04
Whisper ASR: Model and Training Details (0)	2023.11.18

[BBPE] OPTIMIZING BYTE-LEVEL REPRESENTATION FOR END-TO-END ASR

2024. 10. 9. 12:43

OPTIMIZING BYTE-LEVEL REPRESENTATION FOR END-TO-END ASR, Apple 2024

https://arxiv.org/pdf/2406.09676

Introduction

End-to-end (E2E) neural networks are flexible and accurate models for multilingual automatic speech recognition (ASR). The output of such a multilingual model is often unions of characters or subwords of the supported languages. However, as the number of languages increases, the size of the output layer increases, which can negatively affect compute, memory usage and asset size. This problem is more prominent when the system supports languages that have large character sets, such as Chinese, Japanese and Korean (CJK). To tackle this problem, previous work proposed the use of byte level representation for E2E ASR [1, 2]. By using UTF-8 [3] codewords as the underlying base tokens, the output vocabulary is no longer constrained by the character sets of each language, allowing developers to choose a vocabulary size based on compute, and memory constraints. One well-known multilingual ASR system that uses UTF-8 subwords is Whisper [4].

UTF-8 aims to represent all the characters used in major languages. The encoding and decoding processes are designed to be simple and efficient. UTF-8 is a variable length prefix code where each character is represented by one to four bytes. Most byte sequences are not valid UTF-8 strings, and the UTF-8 decoder needs to detect invalid sequences. UTF-8 also provides backward compatibility, where ASCII characters are represented by a single byte and they are the same as the ASCII encoding. While UTF-8 has proven to be an effective output representation for ASR, it is unclear whether it is optimal. For example, characters with similar pronunciations or meaning are not guaranteed to share the same prefixes. In addition, the large number of invalid byte sequences means the model needs to identify valid UTF-8 strings, an additional burden.

UTF-8 BASED REPRESENTATION

UTF-8 based models have been proposed for natural language processing (NLP) [5] [6] [7]. The idea is to convert text to a sequence of variable-length UTF-8 codewords, and to have the model predict one byte at each decoding step. The advantages of byte-level representation are compactness and universality, as any combination of languages may be represented with an output dimension of only 256. However, a sequence represented at byte level is often longer than its characterlevel counterpart, especially for CJK languages [8]. This is because while Latin characters are represented by a single byte, many CJK characters and accented characters are represented by multiple bytes. As a result, a byte-level model can be error-prone since it needs to make multiple predictions for many single characters, and each prediction might make a mistake.

To compensate for the drawback of making byte level mistakes, [1, 2] propose byte-level subwords for E2E ASR. The idea is to apply byte pair encoding (BPE) [9] to UTF-8 codeword sequences to create UTF-8 subwords. As subwords are in general longer than byte-level tokens, this approach reduces the number of steps required by the decoding process. However, BPE does not guarantee that the output will be a valid UTF-8 sequence. To repair an invalid byte sequence, [1] proposes a dynamic programming algorithm to recover as many characters as possible given any byte sequence. While this dynamic programming approach ensures the output sequence is always valid, it optimizes for the number of valid characters, not ASR quality.

Reference

[1] Bo Li, Yu Zhang, Tara Sainath, Yonghui Wu, and William Chan, “Bytes are all you need: End-to-end multilingual speech recognition and synthesis with bytes,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2019, pp. 5621–5625.

[2] L. Deng, R. Hsiao, and A. Ghoshal, “Bilingual endto-end ASR with byte-level subwords,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2022.

[8] Changhan Wang, Kyunghyun Cho, and Jiatao Gu, “Neural machine translation with byte-level subwords,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2020, pp. 9154–9160.

[9] Rico Sennrich, Barry Haddow, and Alexandra Birch, “Neural machine translation of rare words with subword units,” in Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2016, pp. 1715–1725.

'Speech Signal Processing > Speech Recognition' 카테고리의 다른 글

[BBPE] Neural Machine Translation with Byte-Level Subwords (3)	2024.10.09
[BBPE] BILINGUAL END-TO-END ASR WITH BYTE-LEVEL SUBWORDS (0)	2024.10.09
[DataLoader] DynamicBatchSampler (3)	2024.10.09
Text-only adaptation for E2E ASR models (0)	2024.04.04
Whisper ASR: Model and Training Details (0)	2023.11.18

[DataLoader] DynamicBatchSampler

2024. 10. 9. 05:00

Overview:

The DynamicBatchSampler groups data samples into batches.
It takes into account the length of each sample, a target batch size, and a maximum token limit per batch.
A batch is generated when either the batch size or the sum of sample lengths reaches the specified limits.

Example Code:

import random

class DynamicBatchSampler:
    def __init__(self, data_lengths, batch_size, max_tokens):
        """
        Initializes the DynamicBatchSampler with data lengths, target batch size, and max tokens constraint.

        :param data_lengths: List of lengths of each data sample.
        :param batch_size: The base size of each batch.
        :param max_tokens: The maximum number of tokens (or length sum) allowed per batch.
        """
        self.data_lengths = data_lengths
        self.batch_size = batch_size
        self.max_tokens = max_tokens

    def __iter__(self):
        # Shuffle the indices to get a different sampling order each time.
        indices = list(range(len(self.data_lengths)))
        random.shuffle(indices)

        batch = []
        total_length = 0

        for idx in indices:
            length = self.data_lengths[idx]
            # Check if adding this sample would exceed max tokens in the batch.
            if total_length + length > self.max_tokens or len(batch) >= self.batch_size:
                # Yield the current batch and reset for the next one.
                yield batch
                batch = []
                total_length = 0

            # Add the current sample to the batch.
            batch.append(idx)
            total_length += length

        # Yield any remaining samples as the last batch.
        if batch:
            yield batch

# Example usage:
data_lengths = [5, 8, 3, 7, 10, 2, 4, 6]  # Example lengths of samples.
batch_size = 3  # Maximum number of samples per batch.
max_tokens = 15  # Maximum total length of samples in a batch.

sampler = DynamicBatchSampler(data_lengths, batch_size, max_tokens)

# Iterate over the batches generated by the DynamicBatchSampler.
for batch in sampler:
    print(f"Batch indices: {batch}, Batch lengths: {[data_lengths[i] for i in batch]}")

Explanation:

Data Lengths: [5, 8, 3, 7, 10, 2, 4, 6] represents the lengths of different samples.
Batch Size: 3 means each batch can have up to 3 samples.
Max Tokens: 15 restricts each batch to a maximum total length of 15.

Output:

Batch indices: [6, 7], Batch lengths: [4, 6]
Batch indices: [4, 0], Batch lengths: [10, 5]
Batch indices: [1, 2, 5], Batch lengths: [8, 3, 2]
Batch indices: [3], Batch lengths: [7]

How It Works:

The sampler iterates over the data indices and groups them into batches.
A batch is finalized when adding another sample would exceed max_tokens or batch_size.
The example shows how batches are formed dynamically based on the length constraints, making it flexible for varying data sizes.

'Speech Signal Processing > Speech Recognition' 카테고리의 다른 글

[BBPE] BILINGUAL END-TO-END ASR WITH BYTE-LEVEL SUBWORDS (0)	2024.10.09
[BBPE] OPTIMIZING BYTE-LEVEL REPRESENTATION FOR END-TO-END ASR (0)	2024.10.09
Text-only adaptation for E2E ASR models (0)	2024.04.04
Whisper ASR: Model and Training Details (0)	2023.11.18
Subword modelling for ASR (0)	2022.05.07

Text-only adaptation for E2E ASR models

2024. 4. 4. 23:46

There are generally three ways to perform text-only adaptation:

Injecting synthesizing speech data to the model
- generate audio for training texts via TTS and inject it to the model
LM fusion
- Fusion and biasing (shallow fusion):
  - during decoding interpolate posterior word probabilities with text priors from external LMs
  - another recent approach is to extract internal LM probabilities and discount with the ratio of external and internal LM probabilities
- Rescoring and reranking
  - after decoding, use a powerful external LM to update scores and rerank n-best results or recognition lattice
- These techniques incur a significant overhead at inference time due to the external LM and also require careful tuning of the interpolation weight used for the external LM.
Explicit separation of internal LMs
- force the E2E decoder/predictor to behave more like a language model (e.g. Hybrid autoregressive transducer (HAT), Modular hybrid autoregressive transducer, and Factorized transducer)

Reference

[1] External Language Model Integration for Factorized Neural Transducers

[2] in-situ test-only adaptation of speech models with low-overhead speech imputations

'Speech Signal Processing > Speech Recognition' 카테고리의 다른 글

[BBPE] OPTIMIZING BYTE-LEVEL REPRESENTATION FOR END-TO-END ASR (0)	2024.10.09
[DataLoader] DynamicBatchSampler (3)	2024.10.09
Whisper ASR: Model and Training Details (0)	2023.11.18
Subword modelling for ASR (0)	2022.05.07
[Kaldi Decoding] 칼디 디코딩 그래프 구성 (0)	2020.06.18

Whisper ASR: Model and Training Details

2023. 11. 18. 14:14

Model Overview

Whisper is a Transformer-based encoder-decoder model.

Training Data

Whisper ASR models are trained on a mixture of English-only and multilingual data, with a substantial amount of weakly labeled and pseudolabeled audio.

Whisper ASR V1 and V2

Trained on 680,000 hours of audio and corresponding transcripts from the internet.
Data distribution includes 65% English audio (438k hours), 18% non-English audio with English transcripts, and 17% non-English audio with corresponding transcripts, spanning 98 languages.

Whisper ASR V3

Trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio of pseudolabeled audio collected using Whisper large-v2. The model was trained for 2.0 epochs over this mixture dataset.
V3 shows a 10% to 20% reduction in errors compared to V2

Training Details

Initial models were trained with AdamW optimizer, gradient norm clipping, and a linear learning rate decay after a warmup period.
No data augmentation or regularization was used initially due to the diversity and size of the dataset.
For Whisper Large V2, additional techniques like SpecAugment, Stochastic Depth, and BPE Dropout were introduced for regularization.
Different max learning rates were used for different model sizes.

Hyperparameters

General Hyperparameters

Hyperparameters for Whisper Large V2

Model Learning Rates

Summary

'Speech Signal Processing > Speech Recognition' 카테고리의 다른 글

[DataLoader] DynamicBatchSampler (3)	2024.10.09
Text-only adaptation for E2E ASR models (0)	2024.04.04
Subword modelling for ASR (0)	2022.05.07
[Kaldi Decoding] 칼디 디코딩 그래프 구성 (0)	2020.06.18
[Kaldi Decoding] Finite State Transducer algorithms (FST) (0)	2020.06.18

Subword modelling for ASR

2022. 5. 7. 20:33

There have been various generic and language specific approaches on sub-word segmentation to handle OOV problem for machine translation and ASR tasks.

Various subword units like phoneme, syllable, character, morpheme and combination have been used in different approaches of subword modelling. Also, there have been generic and language specific approaches as well. Below enlists some of the major sub-word segmentation approaches. One of the earlier approaches to ASR was Korean syllable-based segmentation [8]. Some of the language specific earlier approaches were in German LVSR [10] and Polish [11]. There was Morpheme based OOV handling approach for Turkish ASR keyword spotting task [9] and multiple languages [12].

The popular recent approaches in unsupervised segmentation

Both Byte Pair Encoding and WordPiece algorithms works on merging adjacent characters.

BPE : the merge pair is chosen based on frequency (merging adjacent characters)

WordPiece : merge is based on maximizing likelihood (merging adjacent characters)

Unigram and BPE dropout [14] are some of the sub-word segmentation regularization techniques.

Libraries implementing segmentation algorithms

sentencepiece [15],

bpeNMT [16],

morfessor [17]

Morph agram [16].

[1] Hybrid sub-word segmentation for handling long tail in morphologically rich low resource languages, ICASSP 2022

[7] M. Huck, S. Riess, and A. Fraser, “Target-side Word Segmentation Strategies for Neural Machine Translation in Proceedings of the Conference on Machine Translation (WMT), Volume 1: Research Papers, pages 56–67 Copenhagen, Denmark, 2017.

[8] D. Kiecza, T. Schultz and A. Waibel, “Data-Driven Determination of Appropriate Dictionary Units for Korean LVCSR”, in proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1999.

[9] Y. He, B. Hutchinson, P. Baumann, M. Ostendorf, E. FoslerLussier, and J. Pierrehumbert, “Subword-Based Modeling For Handling OOV Words In Keyword Spotting”, in proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Italy, 2014.

[10] A. El-Desoky, M. Mousa, B. Ali, R. Shaik, H. Schlüter, and Ney, “Sub-Lexical Language Models For German LVCSR”, in proceedings of the 2010 IEEE Spoken Language Technology Workshop (SLT), 2010.

[11] M.A.B. Shaik, A.E.-D. Mousa, R. Schluter, and H. Ney, “Using morpheme and syllable based sub-words for Polish LVCSR”, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4680–4683, 2011.

[12] M. Creutz, T. Hirsimäki, M. Kurimo, A. Puurula, “Morph-based speech recognition and modeling of out-of-vocabulary words across languages” in ACM Transactions on Speech and Language Processing (TSLP). 5(1):3, 2007

[13] M. Schuster and K. Nakajima, “Japanese and Korean voice search,” in proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012.

[14] I. Provilkov, D. Emelianenko and E. Voita, “BPE-Dropout: Simple and Effective Subword Regularization”, in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1882–1892, 2020.

[15] R. Eskander, F. Callejas, E. Nichols, J. Klavans, and S. Muresan, “MorphAGram: Evaluation and Framework for Unsupervised MorphologicalSegmentation”, in Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 7112–7122, 2020.

[16] “Subword-nmt”, Available at: https://github.com/rsennrich/subword-nmt [Accessed : 10 January, 2021]

[17] “Morfessor”, Available at: https://github.com/aaltospeech/morfessor [Accessed : 10 January, 2021].

'Speech Signal Processing > Speech Recognition' 카테고리의 다른 글

Text-only adaptation for E2E ASR models (0)	2024.04.04
Whisper ASR: Model and Training Details (0)	2023.11.18
[Kaldi Decoding] 칼디 디코딩 그래프 구성 (0)	2020.06.18
[Kaldi Decoding] Finite State Transducer algorithms (FST) (0)	2020.06.18
[Acoustic Model] Feedforward Sequential Memory Networks (FSMN) (0)	2020.06.15

[Kaldi Decoding] 칼디 디코딩 그래프 구성

2020. 6. 18. 20:05

FST와 ASR에 대한 내용은 "Speech Recognition with Weighted Finite-State Transducers" by Mohri, Pereira and Riley (in Springer Handbook on SpeechProcessing and Speech Communication, 2008) 을 참조. 이곳에 Kaldi가 사용하는 일반적인 접근 방식이 설명되어 있지만, 구체적으로 disambiguation symbols를 처리하는 방법과 weight-pushing을 처리하는 방법과 관련하여 일부 세부 사항이 다르다.

Overview of graph creation

디코딩 그래프 생성을위한 전반적인 overview는 그래프 HCLG = H o C o L o G를 구성하는 것이다. 각각의 심볼의 의미는 다음과 같다.

G는 grammar 또는 language model을 인코딩하는 acceptor (즉, 입력 및 출력 기호가 동일함)
L은 lexicon 입니다. output symbols 는 단어이고 input symbols 는 phones
C는 context-dependent 을 나타냄 : 
	출력 심볼은 phones 이고 
    입력 심볼은 context-dependent phones , 
    즉 N phones 의 windows 를 나타내고; 발음 문맥 창을 참조.
H는 HMM 정의를 포함한다. 
	출력 기호는 상황에 따른 phones 를 나타내며 
    입력 기호는 전이 ID이며 pdf-id 및 기타 정보를 인코딩한다 
    (TransitionModel에서 사용되는 정수 식별자 참조).

세부적인 내용은 밑에서 계속해서 설명한다. 출력을 결정하고 최소화하기 위해 HCLG를 결정하려면 disambiguation symbol를 삽입해야한다.

또한 HCLG가 가능한 한 stochastic 이길 원하고, 이것은 기존 레시피에선 "weight-pushing" 동작으로 수행된다. stochacity를 보장하기위한 Kaldi의 접근 방식은 다르며, no graph creation step "takes away" stochasticity보장한다. 자세한 내용은 이곳에서 확인 (Preserving stochasticity and testing it)

한 줄로 요약하면, 다음과 같을 것이다.

HCLG = asl(min(rds(det(H' o min(det(C o min(det(L o G))))))))

where asl=="add-self-loops",
      rds=="remove-disambiguation-symbols", 
      H' == H without the self-loops

weight-pushing 는 레시피의 일부가 아니다. 대신 우리는 G가 stochastic이라면 graph creation sate 가 확률적 결과를 막지 않도록하는 것을 목표로한다. 물론, G는 backoff 기능이있는 Arpa 언어 모델이 FST로 표현되는 방식 때문에, 일반적으로 상당히 stochastic이지 않지만, 적어도 우리의 접근 방식은 non-stochasticity "stays put" 를 유지하고 시작했을 때보다 더 나빠지지 않도록한다. 접근 방식은 weight-pushing 작업이 실패하거나 상황을 악화시키는 위험을 피한다.

Disambiguation symbols

Disambiguation symbols는 lexicon에서,phonemene sequences의 끝에 삽입되는 기호 #1, #2, #3 같은 것들이다.

1) phonemene sequences가 lexicon에서 다른 phonemene sequences의 prefix (접두어) 이거나

2) 둘 이상의 단어로 나타나는 경우, 뒤에 이 기호 중 하나를 추가해야한다.

-> 이 Disambiguation symbols은 the product (L o G)가 determinizable위해 필요하다.

3) LM model G의 backoff arcs 에 symbol #0 를 놓는다. 이를 통해 determinizable알고리즘이 epsilons을 제거하기 때문에 epsilons을 제거한 후 G를 결정할 수 있다(determinizable).

4) 우리는 utterance의 시작 부분에서 symbols을 출력하기 전에 context FST (C)의 왼쪽에 나타나는 epsilons대신에 symbol #1을 놓는다. empty phonetic representation이 있는 단어 (예 : 문장 기호 <s> 및 </ s>의 시작과 끝)가 있을 때 발생하는, 다소 작은 문제를 해결하는 데 필요하다.

다음은 Graph compilation의 중간 단계 (예 : LG, CLG, HCLG)가 결정 가능(determinizable)하다는 것을 공식적으로 증명하는 방법에 대한 overview이다. 이는 레시피가 절대 실패하지 않도록하는 데 중요하다.

일반적인 설정은 다음과 같다.

1) G를 결정할 수 있어야 한다. 그렇기 때문에 #0 기호가 필요하다 (G는 실제로 결정적이므로 결정 가능하다). 2) 그런 다음 L을 결정 가능한 G에 대해 L o G를 결정할 수 있도록 한다. [G 대신 오른쪽에 L o G가있는 C도 마찬가지입니다.] 여전히 이론에 대한 많은 세부 사항이 있지만, L이 다음과 같은 두 가지 property을 갖는 것으로 충분하다고 생각한다.

은 작동해야만 한다

- equivalently: any input-sequence on L must induce a unique output-sequence

- equivalently: for any linear acceptor A, A o L is a linear transducer or empty.

2) L has the twins property, i.e. there are no two states reachable with the same input-symbol sequence, that each have a cycle with the same input sequence but different weight or output sequence.

C Transducer 에도 동일하게 적용된다. 우리는 스크립트와 프로그램이 현재 생성하는 변환기는 이러한 property를 가지고 있다.

The ContextFst Object

ContextFst 객체 (C)는 context-dependent phones 에서 context-independent phones로의 transducer를 나타내는 동적으로 생성 된 FST object 이다. 이 object의 목적은 context 에서 가능한 모든 phones를 열거 해야하는 것을 피하기 위한 것이다. context 에서 가능한 모든 phones를 열거 해야하는 것은 Context width (N) 또는 phones의 개수가 클 때 어려울 수 있다.

생성자 ContextFst::ContextFst는 context-width (N)과 central-position (P)를 필요하다(triphone 시스템에 각각 N=3과 P=1). (추가설명:Phonetic context windows)

또한 모든 phones을 본 후 FST에서 N-P-1을 출력하는 특수 기호 인 "subsequential symbol"(위에서 '$'라고 함)의 integer ID가 필요하다 (이로 인해 context FST is output-deterministic이 보장됨). 이 외에도 integer id's of the phones및 disambiguation symbol 목록이 필요하다.

ContextFst의 출력 측 vocabulary는 set of phones및 disambiguation symbols와 subsequential symbol로 구성된다.

입력측의 vocabulary는 동적으로 생성되며 (사용되지 않는 epsilon제외) phones in context, disambiguation symbols 그리고 "전통적인 레시피"에서 "#에서 엡실론을 대신하는 #-1로 쓰는 특수 기호(다른 disambiguation symbol로 취급(예 : 엡실론 제거 등 선호하는 의미에서 결정성을 보장하는 데 필요함) 에 해당한다. The vocabulary on the input side is dynamically generated and (apart from epsilon, which is not used), corresponds to phones in context, disambiguation symbols, and a special symbol that we write as #-1 that takes the place of epsilon in the "traditional recipe", but which we treat as any other disambiguation symbol (it is necessary to ensure determinizability in the sense we prefer, i.e. with epsilon removal).

전통적인 레시피에서와 같이 subsequential symbol'$'는 입력측에 해당하는 것이 없다. 입력측에서 disambiguation symbols에 대응하는 symbol id's는, 대응하는 심볼에 대한 출력측에서 사용되는 integer 식별자와 반드시 동일하지는 않다.

ContextFst 객체에는 std :: vector <std :: vector <int32> 유형의 객체에 대한 참조를 반환하는 함수 ILabelInfo()가 있으며, 이를 통해 사용자는 입력 측에서 각 심볼의 "의미"를 계산할 수 있다. 이 객체의 올바른 해석은 The ilabel_info object에 자세히 설명되어 있다.

ContextFst와 관련된 composition알고리즘에 사용하기위한 ContextMatcher라는 특수한 "Matcher" object가있다 (Matcher는 OpenFst의 composition알고리즘이 arc lookup에 사용하는 것이다. ContextMatcher는 필요한 것보다 더 많은 상태의 할당을 피함으로써 ContextFst 객체의 사용을 보다 효율적으로 만든다 (문제는 normal matcher를 사용하면 state에서 arc를 원할 때마다 대상을 할당해야한다는 것이다) 해당 상태에서 다른 모든 아크의 상태).

composition에 대한 left hand argument가 ContxtFst 유형 인 경우 FST 컴포지션을 수행하는 관련 함수 ComposeContextFst ()가 있고, Matcher를 사용한다. ComposeContext () 함수도 있는데, 이 함수는 비슷하지만 ContextFst 객체 자체를 만든다.

Avoiding weight pushing

weight-pushing는 각 상태의 arc 확률이 적절한 의미로 "sum to one(합계)"되는 것을 보장하는 FST operation이다. Kaldi는 전통적인 레시피와는 약간 다른 방식으로 weight-pushing 문제를 다룬다. log semiring(반올림)에서weight-pushing는 검색 속도를 높이는 데 도움이 될 수 있다. 그러나 경우에 따라 weight-pushing는 나쁜 영향을 줄 수 있다. 문제는 FST로 표현 될 때 통계 언어모델이 일반적으로 ""add up to more than one (하나 이상으로 합산)"하기 때문에 일부 단어는 직접적으로 backoff arcs를 통해 두 번 계산되기 때문이다.

우리는 절대로 pushing weights하지 않기로 했고, 대신 다른 방식으로 처리한다. 첫째, Definition: 우리는 weight가 1에 해당하는 "stochastic"FST를 호출하고 reader는 "log semiring"에 대해 이야기하고 있다고 가정 한다. "sum to one"를 의미하며 "take the max" 이 아니다.

그래프 생성의 각 단계는 이전 단계가 stochastic이라면 다음 단계가 stochastic이라는 특성을 갖는다. 즉, G가 stochastic이라면 LG는 stochastic이다. LG가 stochastic이라면 det(LG)는 stochastic이다. det(LG)가 stochastic이면 min(det(LG))은 stochastic등이다. 이것은 각각의 개별 작업이 적절한 의미에서 "preserve stochasticity"해야한다는 것을 의미한다. 예를 들어, 예를 들어 weight-push 알고리즘을 시도해 볼 수 있으며 원래 G fst가 둘 이상으로 합산되어 실패한 것으로 판단되면 실패를 내뱉는다.

우리는 더 강력한 의미로 stochasticity을 유지하려고 한다. 즉, G에 대해 모든 states에 대한 최소값과 최대값을 먼저 측정한다 (arc probabilities plus final-prob). 이것이 우리의 프로그램 "fstisstochastic"이 수행하는 일이다. G가 stochastic이라면,이 두 숫자는 모두 1이 된다 (실제로 로그 공간에서 작동하기 때문에 실제로 프로그램에서 0을 보게 될 것이다. 이것이 "log semiring" 이다). 우리는 다음과 같은 의미에서 확률을 유지하려고 한다.이 최소값과 최대 값은 "get worse" 않는다. 즉, 그들은 결코 1에서 더 멀어지지 않는다. 실제로 이것은 "local" 방식으로 확률을 유지하는 알고리즘이있을 때 자연스럽게 일어난다. stochasticity을 보존하기 위해 필요한 다음과 같은 다양한 알고리즘이 있다.

Minimization
Determinization
Epsilon removal

Composition (with particular FSTs on the left) There are also one or two minor algorithms that need to preserve stochasticity, like adding a subsequential-symbol loop. Minimization naturally preserves stochasticity, as long as we don't do any weight pushing as part of it (we use our program "fstminimizeencoded" which does minimization without weight pushing). Determinization preserves stochasticity as long as we do it in the same semiring that we want to preserve stochasticity in (this means the log semiring; this is why we use our program fstdeterminizestar with the option –determinize-in-log=true). Regarding epsilon removal: firstly, we have our own version of epsilon removal "RemoveEpsLocal()" (fstrmepslocal), which doesn't guarantee to remove all epsilons but does guarantee to never "blow up". This algorithm is unusual among FST algorithms in that, to to what we need it to do and preserve stochasticity, it needs to "keep track of" two semirings at the same time. That is, if it is to preserve equivalence in the tropical semiring and stochasticity in the log semiring, which is what we need in practice, it actually has to "know about" both semirings simultaneously. This seems to be an edge case where the "semiring" concept somewhat breaks down. Composition on the left with the lexicon L, the context FST C and the H tranducer (which encodes the HMM structure) all have to preserve stochasticity. Let's discuss this the abstract: we ask, when composing A o B, what are sufficient properties that A must have so that A o B will be stochastic whenever B is stochastic? We believe these properties are:

- For any symbol sequence that can appear on the input of B, the inverse of A must accept that sequence (i.e. it must be possible for A to output that sequence), and:

- For any such symbol sequence (say, S), if we compose A with an unweighted linear FST with S on its input, the result will be stochastic.

These properties are true of C, L and H, at least if everything is properly normalized (i.e. if the lexicon weights sum to one for any given word, and if the HMMs in H are properly normalized and we don't use a probability scale). However, in practice in our graph creation recipes we use a probability scale on the transition probabilities in the HMMs (similar to the acoustic scale). This means that the very last stages of graph creation typically don't preserve stochasticity. Also, if we are using a statistical language model, G will typically not be stochastic in the first place. What we do in this case is we measure at the start how much it "deviates from stochasticity" (using the program fstisstochastic), and during subsequent graph creation stages (except for the very last one) we verify that the non-stochasticity does not "get worse" than it was at the beginning.

[1] https://kaldi-asr.org/doc/graph.html

'Speech Signal Processing > Speech Recognition' 카테고리의 다른 글

Whisper ASR: Model and Training Details (0)	2023.11.18
Subword modelling for ASR (0)	2022.05.07
[Kaldi Decoding] Finite State Transducer algorithms (FST) (0)	2020.06.18
[Acoustic Model] Feedforward Sequential Memory Networks (FSMN) (0)	2020.06.15
[speech recognition] Audio augmentation (0)	2020.06.13

PREV 1 2 NEXT

Notes