How Large Language Models Work

8 min read · March 7, 2026

How Large Language Models Work

Large language models (LLMs) are the technology behind AI text refinement in ummless. When you speak a rough thought and ummless transforms it into polished prose, an LLM is doing the heavy lifting. Understanding how these models work — even at a high level — helps you write better prompts, choose the right presets, and develop realistic expectations about what AI can and cannot do with your text.

What Is a Large Language Model?

An LLM is a neural network trained on vast amounts of text data to predict the next token in a sequence. Despite that simple objective, the resulting model develops a rich understanding of language structure, grammar, facts, reasoning patterns, and stylistic conventions. Models like Claude, GPT, and Llama contain billions of parameters — numerical weights learned during training — that encode this knowledge.

The key insight is that next-token prediction, at sufficient scale, produces emergent capabilities. A model trained to predict "the cat sat on the ___" learns not just that "mat" is likely, but develops internal representations of syntax, semantics, and world knowledge that generalize to tasks the model was never explicitly trained for.

Tokenization: Breaking Text Into Pieces

Before an LLM can process text, it must convert characters into numerical tokens. Tokenization is the bridge between human-readable text and the model's internal mathematics.

Modern LLMs use subword tokenization algorithms like Byte-Pair Encoding (BPE) or SentencePiece. Rather than treating each word as a single token (which would require an impossibly large vocabulary) or each character as a token (which would make sequences too long), subword tokenizers find a middle ground:

Common words like "the" or "and" become single tokens.
Less common words are split into subword pieces: "unforgettable" might become ["un", "forget", "table"].
Rare words or technical terms are broken into smaller fragments, ensuring any text can be represented.

A typical LLM has a vocabulary of 32,000 to 100,000 tokens. Each token is mapped to an integer ID, and each ID is associated with a learned embedding vector — a dense numerical representation that captures the token's meaning in context.

Tokenization matters practically. When ummless sends your transcribed speech to Claude for refinement, the text is first tokenized. The token count affects processing speed and cost. Efficient tokenization means your refinement completes faster.

The Transformer Architecture

Nearly all modern LLMs are built on the Transformer architecture, introduced in 2017. The Transformer replaced earlier recurrent architectures with a mechanism called self-attention that processes all positions in a sequence simultaneously rather than one at a time.

A Transformer consists of stacked layers, each containing two main sub-layers:

Self-Attention

Self-attention allows every token in the input to "look at" every other token and determine how relevant each one is. For the sentence "The developer tested the code she wrote," the pronoun "she" needs to attend to "developer" to resolve the reference correctly.

The mechanism works through three learned projections for each token:

Query (Q): What information is this token looking for?
Key (K): What information does this token offer?
Value (V): What content does this token carry?

Attention scores are computed by taking the dot product of queries and keys, scaling the result, applying a softmax to get a probability distribution, and then using those probabilities to create a weighted sum of values. This produces a new representation for each token that incorporates information from across the entire sequence.

In practice, Transformers use multi-head attention — running multiple attention computations in parallel, each with different learned projections. Different heads can specialize in different types of relationships: one head might track syntactic dependencies, another might track semantic similarity, and another might track positional relationships.

Feed-Forward Networks

After attention, each token's representation passes through a position-wise feed-forward network — typically two linear transformations with a nonlinear activation function between them. This component adds representational capacity and allows the model to transform the attended information into useful features for the next layer.

Layer Norms and Residual Connections

Each sub-layer is wrapped with a residual connection (adding the sub-layer's input to its output) and layer normalization. Residual connections allow gradients to flow through deep networks during training, and layer normalization stabilizes the activations.

Stacking It All Together

A full LLM stacks dozens of these Transformer layers. Claude and similar models may have 80 or more layers. As information flows through the stack, the model builds progressively more abstract representations — from surface-level token features in early layers to high-level semantic and reasoning features in later layers.

Training: Learning from Text

LLM training happens in two major phases:

Pre-training

The model is trained on a massive corpus of text — books, articles, code, websites — using self-supervised learning. For each position in the training data, the model predicts the next token given all preceding tokens. The prediction is compared against the actual next token, and the resulting error signal is used to update the model's billions of parameters through backpropagation and gradient descent.

Pre-training requires enormous computational resources — thousands of GPUs running for weeks or months. The result is a general-purpose language model that can complete text in a wide variety of styles and domains.

Post-training

Raw pre-trained models are capable but not particularly useful or safe. Post-training aligns the model with human preferences through techniques like:

Supervised fine-tuning (SFT): Training on curated examples of helpful, well-formatted responses to instructions.
Reinforcement Learning from Human Feedback (RLHF): Training a reward model on human preference data, then using reinforcement learning to optimize the LLM's outputs against that reward model.
Constitutional AI (CAI): Using a set of principles to guide the model's self-improvement through critique and revision.

Post-training transforms the model from a next-token predictor into an assistant that follows instructions, maintains appropriate boundaries, and produces outputs in useful formats.

Inference: Generating Text

When you use ummless to refine a transcript, the model performs inference — generating new text based on a prompt. Here's what happens:

Your prompt (the system instructions from your preset, plus your raw transcript) is tokenized and fed through the Transformer stack to produce a representation of the full input.
The model predicts a probability distribution over its entire vocabulary for the next token.
A sampling strategy selects one token from this distribution:
- Greedy decoding always picks the highest-probability token. It's deterministic but can produce repetitive or generic text.
- Temperature sampling scales the logits before softmax. Higher temperature flattens the distribution (more random, more creative), lower temperature sharpens it (more focused, more predictable).
- Top-p (nucleus) sampling samples from the smallest set of tokens whose cumulative probability exceeds a threshold p, balancing diversity and coherence.
- Top-k sampling restricts sampling to the k highest-probability tokens.
The selected token is appended to the sequence, and the process repeats from step 1 for the next position.

This autoregressive process continues until the model produces a stop token or reaches a maximum length. The result is streamed back token by token — which is why you see refined text appear progressively in ummless.

Context Windows and Attention

Every LLM has a context window — the maximum number of tokens it can process at once. This window includes both the input prompt and the generated output. Claude's context window is large enough to handle substantial transcripts, but understanding this constraint explains why very long recordings may need to be processed in segments.

Within the context window, the self-attention mechanism allows the model to reference any part of the input when generating each output token. This is what enables the model to maintain coherence across long passages and follow complex instructions that reference specific parts of your transcript.

When ummless refines your speech, it constructs a prompt that includes:

System instructions from your selected preset — defining the desired output format, tone, and structure.
Your raw transcript — the speech-to-text output from the ASR system.

The LLM processes this prompt through its Transformer layers, drawing on its trained knowledge of language, writing conventions, and instruction-following to produce refined output. It restructures sentences, removes filler words, fixes grammar, and reformats the text according to the preset's specifications — all through the same next-token prediction mechanism.

The quality of refinement depends on several factors: the accuracy of the original transcript, the clarity of the preset instructions, and the model's capabilities. By understanding that the model generates text one token at a time based on statistical patterns, you can write better presets that give clear, specific instructions and produce more consistent results.

Limitations to Keep in Mind

LLMs are powerful but not infallible:

They can hallucinate — generating plausible-sounding but incorrect information. In the context of refinement, this means the model might occasionally alter the meaning of what you said.
They're sensitive to prompting — small changes in instructions can produce meaningfully different outputs. This is why preset design matters.
They don't truly understand — they process statistical patterns in text, not meaning in the human sense. They can produce remarkably coherent output without any grounded understanding of the world.
They're frozen in time — a model's knowledge is limited to its training data cutoff. It won't know about events that happened after training.

These limitations are manageable when you know they exist. Review your refined output, iterate on your presets, and treat the LLM as a powerful tool rather than an infallible editor.