Transformer Architecture

Definition

The neural network architecture based on self-attention that powers modern language models and speech recognition systems.

The transformer architecture, introduced in the 2017 paper 'Attention Is All You Need,' replaced recurrent processing with self-attention, allowing models to capture relationships between all positions in a sequence simultaneously. The architecture consists of stacked layers of multi-head self-attention and feed-forward networks, with residual connections and layer normalization.

Transformers have become the dominant architecture for both NLP and ASR. Encoder-only variants (BERT, wav2vec) excel at understanding tasks. Decoder-only variants (GPT) excel at generation. Encoder-decoder variants (T5, Whisper) handle sequence-to-sequence tasks. The architecture's parallelism enables efficient training on modern GPU hardware.

Transformer Architecture

Related Terms

Related Content

How Large Language Models Work