Word Error RateWER

Definition

The standard metric for evaluating speech recognition accuracy, calculated as the ratio of errors to total reference words.

Word Error Rate is computed as (Substitutions + Insertions + Deletions) / Total Reference Words. A WER of 0% means perfect transcription; values above 100% are possible when the system produces many insertions. Human transcription error rates on clean speech are typically 4-5%, so WERs approaching this range indicate near-human performance.

WER has known limitations — it treats all word errors equally, so misrecognizing 'the' is penalized the same as misrecognizing a critical keyword. Despite this, WER remains the de facto standard for comparing ASR systems across benchmarks like LibriSpeech, Switchboard, and Common Voice.

Frequently Asked Questions

What is a good word error rate?

A WER below 5% is considered near-human accuracy for clean speech. For noisy or conversational speech, WERs of 10-15% are typical for state-of-the-art systems.

Related Terms

Related Content