Model Overview

  • Whisper is a Transformer-based encoder-decoder model.

Training Data

  • Whisper ASR models are trained on a mixture of English-only and multilingual data, with a substantial amount of weakly labeled and pseudolabeled audio.

Whisper ASR V1 and V2

  • Trained on 680,000 hours of audio and corresponding transcripts from the internet.
  • Data distribution includes 65% English audio (438k hours), 18% non-English audio with English transcripts, and 17% non-English audio with corresponding transcripts, spanning 98 languages.

Whisper ASR V3

  • Trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio of pseudolabeled audio collected using Whisper large-v2. The model was trained for 2.0 epochs over this mixture dataset.
  • V3 shows a 10% to 20% reduction in errors compared to V2

Training Details

  • Initial models were trained with AdamW optimizer, gradient norm clipping, and a linear learning rate decay after a warmup period.
  • No data augmentation or regularization was used initially due to the diversity and size of the dataset.
  • For Whisper Large V2, additional techniques like SpecAugment, Stochastic Depth, and BPE Dropout were introduced for regularization.
  • Different max learning rates were used for different model sizes.

Hyperparameters

General Hyperparameters

Hyperparameters for Whisper Large V2

Model Learning Rates

 

 

Summary

+ Recent posts