How Automatic Speech Recognition Works

8 min read · March 7, 2026

How Automatic Speech Recognition Works

Automatic speech recognition (ASR) is the technology that converts spoken language into written text. Every time you dictate a message, use a voice assistant, or speak into an app like ummless, an ASR system is processing your voice through a sophisticated pipeline of signal processing and machine learning. Understanding how this pipeline works helps you get better results from any voice-to-text tool.

The ASR Pipeline at a Glance

An ASR system takes raw audio as input and produces a text transcript as output. Between those two endpoints, the audio passes through several stages:

Audio capture and preprocessing
Feature extraction
Acoustic modeling
Language modeling
Decoding and output

Each stage transforms the data, progressively moving from raw sound waves toward meaningful language. Let's walk through each one.

Stage 1: Audio Capture and Preprocessing

The pipeline begins the moment a microphone converts air pressure variations into an electrical signal. This analog signal is then digitized through an analog-to-digital converter (ADC), which samples the waveform at a fixed rate — typically 16 kHz for speech, meaning 16,000 amplitude measurements per second.

Before any recognition happens, the raw audio undergoes preprocessing:

Noise reduction filters out background sounds like fans, traffic, or keyboard clicks. Spectral subtraction and adaptive filtering are common techniques.
Voice activity detection (VAD) identifies which segments of audio contain speech and which are silence. This prevents the system from wasting computation on dead air.
Normalization adjusts the volume level so that quiet speakers and loud speakers produce comparable input to the next stage.
Echo cancellation removes feedback when the microphone picks up audio from speakers, which is critical in conferencing scenarios.

The quality of this preprocessing step has an outsized impact on final accuracy. A clean, well-captured audio signal makes every downstream stage perform better.

Stage 2: Feature Extraction

Raw audio samples — even after preprocessing — contain far more information than an ASR system needs. The signal includes details about the speaker's vocal timbre, room acoustics, and microphone characteristics that are irrelevant to what words were said. Feature extraction distills the audio into a compact representation that emphasizes linguistically relevant information.

The most widely used feature representation is the Mel-frequency cepstral coefficient (MFCC). The extraction process works like this:

The audio is divided into short overlapping frames, typically 25 milliseconds long with 10-millisecond overlap.
Each frame is multiplied by a window function (usually Hamming) to reduce edge artifacts.
A Fast Fourier Transform (FFT) converts each frame from the time domain to the frequency domain, producing a power spectrum.
The power spectrum is mapped onto the Mel scale, a perceptual scale that mirrors how the human ear perceives pitch — compressing higher frequencies where human hearing is less sensitive.
A discrete cosine transform (DCT) decorrelates the Mel-filtered values, producing the final MFCC vector.

The result is a sequence of compact feature vectors — usually 13 coefficients per frame, often augmented with delta and delta-delta features that capture how the spectrum changes over time. This gives the system roughly 39 values per 10-millisecond window, a dramatic compression from the original 160 samples per frame.

Modern end-to-end systems sometimes learn their own features directly from raw audio using convolutional layers, but MFCCs remain the conceptual foundation.

Stage 3: The Acoustic Model

The acoustic model is the core of the ASR system. It takes the sequence of feature vectors and predicts which speech sounds — called phonemes — are most likely at each time step.

Traditional Approach: HMMs and GMMs

For decades, the dominant approach combined Hidden Markov Models (HMMs) with Gaussian Mixture Models (GMMs). Each phoneme was modeled as an HMM with several states, and each state's emission probability was modeled by a GMM. The Viterbi algorithm found the most likely sequence of phoneme states given the observed features.

Modern Approach: Deep Neural Networks

Starting around 2012, deep neural networks replaced GMMs as the emission model within HMM frameworks, dramatically improving accuracy. Today, most production ASR systems use fully neural architectures:

Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks process the feature sequence sequentially, maintaining a hidden state that captures temporal context.
Connectionist Temporal Classification (CTC) allows the network to output character or phoneme sequences without requiring pre-aligned training data. The CTC loss function marginalizes over all possible alignments between input frames and output symbols.
Transformer-based models like OpenAI's Whisper use self-attention to process the entire utterance in parallel, capturing long-range dependencies more effectively than RNNs.
Encoder-decoder architectures with attention mechanisms generate the transcript token by token, attending to relevant parts of the audio at each step.

Apple's on-device speech recognition — used by ummless on macOS — employs a neural acoustic model optimized to run on the Apple Neural Engine, balancing accuracy with the latency and privacy requirements of local processing.

Stage 4: The Language Model

The acoustic model proposes candidate transcriptions, but spoken language is inherently ambiguous. "Recognize speech" and "wreck a nice beach" are acoustically similar. The language model resolves this ambiguity by scoring how likely each candidate is as a sequence of words in the target language.

Language models operate at the word or subword level and assign probabilities based on context:

N-gram models estimate the probability of a word given the previous N-1 words. A trigram model predicts each word based on the two preceding words. These are fast and compact but capture limited context.
Neural language models use RNNs or Transformers to estimate word probabilities conditioned on the entire preceding context. They capture long-range dependencies and generalize better to unseen word combinations.
Domain-specific models can be fine-tuned on specialized vocabularies — medical terminology, legal language, or programming jargon — to improve accuracy in specific contexts.

The language model acts as a prior that biases the system toward plausible English (or whatever the target language is). When the acoustic model is uncertain between two interpretations, the language model breaks the tie in favor of the more linguistically natural option.

Stage 5: Decoding

Decoding is the search process that combines acoustic model scores and language model scores to find the single best transcript for a given utterance. This is a computationally intensive search over an enormous space of possible word sequences.

Beam search is the most common decoding strategy. Rather than exploring every possible path (which would be exponentially expensive), beam search maintains only the top K candidates at each time step, pruning unlikely paths early. The beam width K controls the tradeoff between search thoroughness and speed.

The decoder produces the final output: a sequence of words, optionally with confidence scores, timestamps, and punctuation. Some systems also perform post-processing steps like:

Inverse text normalization — converting spoken forms like "twenty-six dollars" to "$26"
Punctuation insertion — adding periods, commas, and question marks based on prosodic cues and language model predictions
Capitalization — identifying proper nouns and sentence boundaries
Disfluency removal — filtering out "um," "uh," and false starts

How This Applies to You

When you speak into ummless, your voice passes through this entire pipeline — audio capture, feature extraction, acoustic modeling, language modeling, and decoding — in real time. On macOS, Apple's SFSpeechRecognizer handles these stages using models optimized for the Neural Engine.

Understanding the pipeline explains practical tips for better recognition:

Speak clearly and at a consistent pace to give the acoustic model clean input.
Minimize background noise so preprocessing doesn't have to fight for your voice.
Use natural phrasing so the language model can leverage its understanding of common word patterns.
Pause between distinct thoughts to help the VAD and decoder segment your speech correctly.

The ASR transcript is just the starting point. In ummless, that raw transcript then flows into the AI refinement stage, where a large language model restructures and polishes the text. The better the initial transcription, the better the final refined output.

Where ASR Is Heading

The field is converging on large, end-to-end models trained on massive multilingual datasets. These models collapse the traditional pipeline stages into a single neural network that maps audio directly to text. They handle diverse accents, background noise, and domain vocabulary with less manual engineering than earlier systems.

At the same time, on-device ASR is becoming increasingly capable. Running recognition locally — as ummless does — eliminates network latency, works offline, and keeps your audio data private. As device hardware improves and models get more efficient through quantization and distillation, the accuracy gap between on-device and cloud-based ASR continues to narrow.