Transformer Model
Definition
A neural network architecture based on self-attention mechanisms that processes entire sequences in parallel.
The transformer architecture, introduced in 2017, revolutionized natural language processing and subsequently speech recognition. Unlike RNNs that process sequences one step at a time, transformers use self-attention to relate every position in a sequence to every other position simultaneously.
In ASR, transformers power models like Whisper (encoder-decoder), wav2vec 2.0 (encoder-only with CTC), and Conformer (combining convolution with attention). The self-attention mechanism is particularly effective for speech because it can capture both local acoustic patterns and long-range linguistic dependencies. The parallel processing nature of transformers also makes them efficient to train on modern GPU hardware.