Automatic Speech RecognitionASR

Definition

Technology that converts spoken language into written text.

Automatic speech recognition is the broad field of converting audio signals containing human speech into text transcripts. Modern ASR systems use deep neural networks — typically transformer-based architectures — trained on hundreds of thousands of hours of labeled audio.

ASR pipelines generally include an acoustic model that maps audio features to linguistic units, a language model that predicts likely word sequences, and a decoder that combines both to produce the final transcript. End-to-end models like Whisper collapse these stages into a single neural network.

Frequently Asked Questions

What is automatic speech recognition?

ASR is the technology that converts spoken language into written text using machine learning models trained on large datasets of transcribed audio.

How accurate is modern ASR?

State-of-the-art ASR models achieve word error rates below 5% on clean speech, approaching human-level transcription accuracy.

Related Terms

Related Content