Tokenization

Definition

The process of converting raw text into a sequence of tokens that a language model can process.

Tokenization transforms input text into the numerical token IDs that neural networks require. Modern tokenizers use learned subword algorithms — Byte Pair Encoding (BPE), WordPiece, or SentencePiece — that balance vocabulary size against sequence length by splitting rare words into common subword pieces.

Tokenization is a critical but often overlooked step. Different models use different tokenizers, so the same text produces different token sequences (and different token counts) for different models. Tokenization errors or inconsistencies can affect model performance, especially for code, URLs, and non-English languages.

Related Terms

Related Content