There have been various generic and language specific approaches on sub-word segmentation to handle OOV problem for machine translation and ASR tasks. 

 

Various subword units like phonemesyllablecharactermorpheme and combination have been used in different approaches of subword modelling. Also, there have been generic and language specific approaches as well. Below enlists some of the major sub-word segmentation approaches. One of the earlier approaches to ASR was Korean syllable-based segmentation [8]. Some of the language specific earlier approaches were in German LVSR [10] and Polish [11]. There was Morpheme based OOV handling approach for Turkish ASR keyword spotting task [9] and multiple languages [12]. 

 

The popular recent approaches in unsupervised segmentation

Both Byte Pair Encoding and WordPiece algorithms works on merging adjacent characters.

 

BPE : the merge pair is chosen based on frequency (merging adjacent characters)

WordPiece : merge is based on maximizing likelihood (merging adjacent characters)

Unigram and BPE dropout [14] are some of the sub-word segmentation regularization techniques.

 

Libraries implementing segmentation algorithms

sentencepiece [15],

bpeNMT [16],

morfessor [17]

Morph agram [16].

 

 

[1] Hybrid sub-word segmentation for handling long tail in morphologically rich low resource languages, ICASSP 2022

[7] M. Huck, S. Riess, and A. Fraser, “Target-side Word Segmentation Strategies for Neural Machine Translation in Proceedings of the Conference on Machine Translation (WMT), Volume 1: Research Papers, pages 56–67 Copenhagen, Denmark, 2017.

[8] D. Kiecza, T. Schultz and A. Waibel, “Data-Driven Determination of Appropriate Dictionary Units for Korean LVCSR”, in proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1999.

[9] Y. He, B. Hutchinson, P. Baumann, M. Ostendorf, E. FoslerLussier, and J. Pierrehumbert, “Subword-Based Modeling For Handling OOV Words In Keyword Spotting”, in proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Italy, 2014.

[10] A. El-Desoky, M. Mousa, B. Ali, R. Shaik, H. Schlüter, and Ney, “Sub-Lexical Language Models For German LVCSR”, in proceedings of the 2010 IEEE Spoken Language Technology Workshop (SLT), 2010.

[11] M.A.B. Shaik, A.E.-D. Mousa, R. Schluter, and H. Ney, “Using morpheme and syllable based sub-words for Polish LVCSR”, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4680–4683, 2011.

[12] M. Creutz, T. Hirsimäki, M. Kurimo, A. Puurula, “Morph-based speech recognition and modeling of out-of-vocabulary words across languages” in ACM Transactions on Speech and Language Processing (TSLP). 5(1):3, 2007

[13] M. Schuster and K. Nakajima, “Japanese and Korean voice search,” in proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012.

[14] I. Provilkov, D. Emelianenko and E. Voita, “BPE-Dropout: Simple and Effective Subword Regularization”, in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1882–1892, 2020.

[15] R. Eskander, F. Callejas, E. Nichols, J. Klavans, and S. Muresan, “MorphAGram: Evaluation and Framework for Unsupervised MorphologicalSegmentation”, in Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 7112–7122, 2020.

[16] “Subword-nmt”, Available at: https://github.com/rsennrich/subword-nmt [Accessed : 10 January, 2021]

[17] “Morfessor”, Available at: https://github.com/aaltospeech/morfessor [Accessed : 10 January, 2021].

 

 

+ Recent posts