Feature Extraction
Definition
The process of converting raw audio waveforms into numerical representations suitable for machine learning models.
Feature extraction transforms raw audio into compact, informative representations that highlight speech-relevant characteristics while discarding irrelevant variation. The most common features in modern ASR are mel-frequency cepstral coefficients (MFCCs) and log-mel spectrograms.
The process typically involves windowing the audio signal into short overlapping frames, applying a Fourier transform to convert from time to frequency domain, mapping frequencies to a perceptual scale (mel or bark), and optionally applying further transformations like discrete cosine transform or delta computation.