Mel Spectrogram
Definition
A time-frequency representation of audio where the frequency axis is scaled to match human auditory perception.
A mel spectrogram is the standard input representation for modern speech recognition models. It is created by computing a short-time Fourier transform (STFT) of the audio signal and then mapping the linear frequency bins to the mel scale, which compresses higher frequencies where human hearing is less sensitive.
The result is a 2D image-like representation where the x-axis is time, the y-axis is mel-scaled frequency, and pixel values represent energy. Whisper uses 80-channel log-mel spectrograms. This representation is effective because it preserves the information most relevant to speech while discarding inaudible detail.