End-to-End Speech-to-Text

Definition

An ASR approach where a single neural network directly maps audio input to text output without separate acoustic and language model components.

End-to-end speech-to-text systems replace the traditional pipeline of separate acoustic model, pronunciation dictionary, and language model with a single neural network trained to map audio directly to text. Models like Whisper, wav2vec 2.0, and Conformer exemplify this approach.

The advantages include simpler training pipelines, joint optimization of all components, and the ability to learn features that traditional pipelines might miss. The trade-off is that end-to-end models typically require much more training data to achieve competitive accuracy.

Related Terms

Related Content