How Speech-to-Text Models Are Trained: From Raw Audio to Accurate Transcription

7 min read · March 7, 2026

How Speech-to-Text Models Are Trained: From Raw Audio to Accurate Transcription

Automatic speech recognition (ASR) has improved dramatically in recent years. Models like OpenAI's Whisper, Apple's on-device speech framework, and Google's Universal Speech Model can transcribe spoken language with near-human accuracy. But how do these models actually learn to convert sound waves into text? Understanding the training pipeline reveals why modern ASR is so capable — and where it still falls short.

The Training Data Problem

Every machine learning model is only as good as its training data, and ASR models are especially data-hungry. Training a competitive speech-to-text model requires tens of thousands of hours of transcribed audio. Whisper, for example, was trained on 680,000 hours of multilingual audio scraped from the web.

Sources of Training Data

Training data for ASR models typically comes from several sources:

Audiobooks and read speech — LibriSpeech, a popular benchmark dataset, contains 1,000 hours of read English speech from public domain audiobooks. Read speech is clean and well-enunciated, making it ideal for initial training.
Broadcast media — News programs, podcasts, and radio broadcasts provide diverse speaking styles, accents, and background noise conditions.
Conversational speech — Datasets like Switchboard and Fisher contain spontaneous telephone conversations, capturing the disfluencies, interruptions, and informal patterns of real speech.
Web-scraped audio — Large-scale models increasingly use audio paired with subtitles or captions from the internet, trading curation quality for sheer volume.

Data Preprocessing

Raw audio undergoes significant preprocessing before it reaches the model:

Segmentation — Long audio files are split into shorter utterances, typically 10-30 seconds.
Alignment — Audio segments are aligned with their corresponding transcripts. Poor alignment introduces noise into training.
Normalization — Transcripts are standardized: numbers are spelled out, punctuation is normalized, and speaker labels are removed or standardized.
Filtering — Segments with excessive background noise, music, or unintelligible speech are discarded. Quality filtering can remove 20-40% of web-scraped data.
Augmentation — Training data is artificially expanded by adding background noise, changing playback speed, shifting pitch, or applying room impulse responses to simulate different acoustic environments.

Turning Sound Into Features

Neural networks cannot process raw audio waveforms efficiently. Instead, audio is converted into a visual-like representation called a spectrogram.

Mel Spectrograms

The standard input representation for modern ASR models is the log-mel spectrogram. Here is how it works:

The raw audio waveform is divided into short overlapping frames (typically 25ms windows with 10ms hop).
A Fast Fourier Transform (FFT) converts each frame from the time domain to the frequency domain.
The frequency axis is warped to the mel scale, which mimics human auditory perception — we are more sensitive to differences in low frequencies than high frequencies.
The power values are converted to a logarithmic scale.

The result is a 2D matrix where the x-axis is time, the y-axis is frequency (on the mel scale), and the values represent energy. This representation captures the essential features of speech — formants, pitch contours, and temporal patterns — in a compact format.

Whisper, for example, uses 80-channel log-mel spectrograms computed from 25ms windows with 10ms stride, normalized to the range [-1, 1].

Model Architecture

Modern ASR models overwhelmingly use the transformer architecture, the same family of models behind GPT and BERT. There are three main architectural approaches.

Encoder-Decoder Models

Whisper uses a standard encoder-decoder transformer:

The encoder processes the mel spectrogram and produces a sequence of hidden representations that capture acoustic and linguistic information.
The decoder generates text tokens autoregressively, attending to both the encoder output and previously generated tokens.

This architecture naturally handles the variable-length mapping between audio frames and text tokens.

Connectionist Temporal Classification (CTC)

CTC-based models like wav2vec 2.0 use only an encoder. The CTC loss function handles the alignment problem by marginalizing over all possible alignments between the input audio and the output text. CTC models are simpler and faster at inference but historically less accurate than encoder-decoder models.

Transducer Models

The RNN-Transducer (RNN-T) architecture, used in many production on-device systems, combines an audio encoder with a prediction network and a joint network. It can produce output tokens as audio streams in, making it well-suited for real-time applications.

The Training Process

Pre-training vs. Fine-tuning

Large ASR models are typically trained in two phases:

Pre-training — The model learns general audio representations from massive amounts of data. For self-supervised models like wav2vec 2.0, this phase uses unlabeled audio. For supervised models like Whisper, it uses weakly-labeled web data.
Fine-tuning — The pre-trained model is adapted to specific domains, languages, or tasks using smaller, high-quality labeled datasets.

Training Objectives

The training objective depends on the architecture:

Cross-entropy loss — Encoder-decoder models minimize the cross-entropy between predicted and actual text tokens at each decoding step.
CTC loss — CTC models minimize the negative log-likelihood of the correct transcript, summed over all valid alignments.
Contrastive loss — Self-supervised pre-training uses contrastive objectives, where the model learns to distinguish true audio representations from distractors.

Multitask Training

Whisper's key innovation was multitask training. The same model learns to perform transcription, translation, language identification, and timestamp prediction by encoding the task as special tokens in the decoder input. This forces the model to develop robust internal representations that generalize across tasks.

Practical Training Details

Training a model like Whisper requires significant compute:

Hundreds of GPUs running for weeks
Mixed-precision training (FP16) to reduce memory usage
Gradient accumulation across multiple batches
Learning rate warmup followed by cosine decay
Dropout and label smoothing for regularization

Evaluation and Benchmarks

Word Error Rate

The primary metric for ASR evaluation is Word Error Rate (WER), defined as:

WER = (Substitutions + Insertions + Deletions) / Total Reference Words

A WER of 5% means that 5 out of every 100 words are incorrect. Human transcription error rates on clean speech are typically 4-5%, so models approaching this threshold are considered near-human.

Standard Benchmarks

Models are evaluated on standardized test sets:

LibriSpeech test-clean — Read speech in quiet conditions. State-of-the-art WER is below 2%.
LibriSpeech test-other — Read speech with more challenging acoustic conditions. State-of-the-art WER is around 4%.
Switchboard — Conversational telephone speech. WER around 5-6%.
Common Voice — Crowd-sourced recordings across many languages and accents.

Beyond WER

WER has known limitations. It treats all errors equally — misrecognizing "the" is penalized the same as misrecognizing a critical keyword. Researchers are increasingly supplementing WER with:

Semantic error rate — Measures meaning preservation rather than exact word match.
Latency — How quickly the model produces output, critical for real-time applications.
Robustness — Performance degradation under noise, accents, and domain shift.

Where the Field Is Heading

Several trends are shaping the next generation of ASR models:

On-device models — Apple, Google, and others are pushing high-quality ASR onto phones and laptops, eliminating the need for cloud processing. This is the approach Ummless takes, leveraging Apple's on-device speech framework for privacy-preserving transcription.
Multimodal models — Models that jointly process audio, text, and visual information can leverage context to improve accuracy.
Personalization — Adapting models to individual speakers, vocabularies, and domains with minimal additional data.
Lower resource languages — Extending high-quality ASR to the thousands of languages that lack large training datasets, using techniques like cross-lingual transfer learning.

Understanding how these models are built gives you a deeper appreciation for what happens when you speak into a microphone and see text appear on screen. It is not magic — it is decades of research in signal processing, linguistics, and machine learning, distilled into models that can run on the device in your pocket.