1) VTLP based data augmentation

- vocal tract length perturbation (VTLP) [3], has shown gains on the TIMIT phoneme recognition task. VTLP was further extended to large vocabulary continuous speech recognition (LVCSR) in [4]. In [3] the VTLP warping factors for each utterance is randomly chosen from a range (e.g. [0.9, 1.1]). Using these sampled warping factors, improvement was reported on TIMIT phoneme recognition task. In [4], VTLP was used in large vocabulary continuous speech recognition (LVCSR) tasks, and an observation was made that selecting VTLP warping factors from a limited set of perturbation factors, was better.

 

2) Tempo perturbation based data augmentaion

- Speech rate perturbation, where the speech rate of the audio was modified by randomly selected factor, was investigated in [6]. In speech rate modification, the tempo of the signal is modified while ensuring that the pitch and spectral evelope of the signal does not change. The WSOLA [16] based implementation in the tempo command of the SoX tool was used to achieve this perturbation.

 

3) Speed perturbation based data augmentation

- To modify the speed of a signal we just resample the signal. The speed function of Sox was used for this. Two additional copies of the original training data were created by modifying the speed to 90% and 110% of the original rate.

 

4) SpecAugment

 

**In order to implement speed perturbation, we resample the signal using the speed function of the Sox audio manipulation tool; SoX, audio manipulation tool, (accessed March 25, 2015). [Online]. Available: http://sox.sourceforge.net/

 


 

[1] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmentation for speech recognition.” in INTERSPEECH, 2015, pp. 3586–3589.

 

 

+ Recent posts