How LLMs Actually Work: The Intuition Behind Every Token
You type a message. A few seconds later, a machine writes back something that feels like thinking. It argues, it explains, it cracks jokes. But behind that illusion of understanding is a system that does exactly one thing: predict the next token. The fact that this single trick produces something that looks like intelligence is the most surprising result in the history of computer science. Let’s open the box and see how it actually works.
First, Forget Everything You Think You Know
LLMs don’t “understand” language the way you do. They don’t have a database of facts they look up. They don’t parse grammar rules. They don’t think, then speak. What they do is something stranger and, arguably, more elegant: they learn statistical patterns across billions of sentences, and those patterns turn out to encode meaning, logic, and even reasoning. Nobody explicitly programmed any of that.
Think of it this way. A child who reads a million books doesn’t memorize them. They develop intuition, a sense of what words belong together, what follows what, what makes sense. LLMs do something eerily similar, except they do it with math.
Step 1: Turning Words Into Numbers
A neural network can’t read text. It can only process numbers. So the first step is converting raw text into a sequence of integers. This process is called tokenization.
Here’s the counterintuitive part: tokens aren’t always words. The model uses a fixed vocabulary of roughly 50,000–100,000 subword units. Common words like “the” get their own token. Rare words get split: “unbelievable” might become ["un", "believ", "able"]. This is why models sometimes struggle with spelling. They literally don’t see individual letters.
The algorithm behind this (usually BPE, short for Byte Pair Encoding) works by starting with individual characters, then iteratively merging the most frequent adjacent pairs until the vocabulary hits the target size. It’s a compression algorithm repurposed for language.
Key insight: The model never sees raw text. It sees a sequence of integer IDs. But these IDs are just indices with no meaning yet. For that, we need the next step.
Step 2: From IDs to Meaning, Embeddings
A token ID like 3797 is just a number. It doesn’t encode that “cat” is an animal, that it’s related to “kitten”, or that it’s nothing like “democracy.” To give it meaning, we use that integer as an index into a giant lookup table called the embedding matrix, and retrieve a dense vector of, say, 4,096 floating-point numbers.
This vector is the model’s understanding of that token. But here’s what’s remarkable: these vectors aren’t hand-designed. They’re learned during training, and they end up encoding semantic relationships geometrically. Words with similar meanings cluster together. “King” minus “man” plus “woman” lands near “queen.” That’s not a party trick; it falls out naturally from the training objective.
Positional Encoding
There’s a problem, though. If you just look up embeddings for each token, you lose all information about word order. “The dog bit the man” and “the man bit the dog” would produce the same set of vectors. To fix this, we add positional encodings: a second set of vectors that encode where each token sits in the sequence. The original Transformer used sine and cosine waves at different frequencies; modern models often learn these positions directly.
The result: each token enters the model as a single vector that encodes both what it is and where it is.
Step 3: Attention, the Mechanism That Changed Everything
This is the core innovation. The 2017 paper “Attention Is All You Need” introduced a mechanism that lets every token look at every other token and decide what’s relevant. It’s called self-attention, and it’s the reason Transformers work.
Here’s the intuition. Imagine you’re reading the sentence: “The cat sat because it was tired.” When you hit the word “it”, your brain instantly resolves that “it” refers to “cat.” You don’t think about this. You just know. Self-attention is the mathematical machinery that lets a model do the same thing.
How It Works (Without the Math)
For each token, the model computes three things:
- Query: “What am I looking for?”
- Key: “What do I contain?”
- Value: “What information do I carry?”
The attention score between two tokens is the dot product of one token’s Query and another’s Key. High score = high relevance. These scores are normalized into a probability distribution (they sum to 1), and the output for each token is a weighted sum of all Values.
In our example, when processing “it”, the Query for “it” has a high dot-product with the Key for “cat”, so “cat” gets high attention weight, and its Value (encoding the concept of a cat) gets mixed into the representation of “it.”
Multi-Head Attention
One attention computation captures one type of relationship. But language has many simultaneous relationships: syntactic (subject-verb), semantic (pronoun-antecedent), positional (adjacent words), long-range (topic coherence). So the model runs attention multiple times in parallel, typically 32 to 128 “heads”, each learning to focus on different patterns. Their outputs are concatenated and projected back down.
Why it matters: Before attention, models processed text sequentially (left to right). A word at position 200 had to pass its information through 200 steps to reach position 1. With attention, every token can directly access every other token in a single step. This is what made long-range understanding possible.
Step 4: The Full Transformer Layer
Attention is only half of each Transformer layer. After attention, the output passes through a feed-forward network (FFN): two linear transformations with a nonlinearity in between. If attention is about which tokens to combine, the FFN is about what to compute from those combinations.
Each layer also includes:
- Layer normalization: keeps values from exploding or vanishing
- Residual connections: a “skip connection” that adds the input directly to the output, letting gradients flow cleanly and letting each layer learn a refinement rather than a complete transformation
A modern LLM stacks 32 to 128 of these layers (sometimes more). Research has shown that different depths learn different things. Early layers tend to capture syntax: part-of-speech, phrase boundaries. Middle layers encode semantic meaning like word sense and entity types. Deep layers handle abstract reasoning, factual recall, and planning.
| Depth | What it tends to learn | Example |
|---|---|---|
| Layers 1–10 | Syntax, local structure | Subject-verb agreement |
| Layers 11–40 | Semantics, relationships | ”It” refers to “cat” |
| Layers 41+ | Facts, reasoning, planning | ”Capital of France is Paris” |
This isn’t a strict division; it’s a gradient. But probing experiments consistently find this layered emergence of capability.
Step 5: Predicting the Next Token
After passing through all layers, each token’s vector has been refined from a raw embedding into a rich, contextual representation. The final step: a prediction head that converts this vector into a probability distribution over the entire vocabulary.
This is a single matrix multiplication followed by a softmax function. The output is a list of 50,000+ probabilities, one per token in the vocabulary, that sum to 1. The model then samples from this distribution: usually not just taking the highest-probability token (greedy decoding), but using a controlled randomness (temperature, top-p) to keep the output interesting.
The chosen token is appended to the input, and the entire process repeats. Token by token, the model generates its response. This is called autoregressive generation, where each prediction is conditioned on everything that came before.
This is the entire trick. There is no separate “thinking” module, no planning algorithm, no world model. Just next-token prediction, applied with enough parameters and enough data, producing behavior that looks like understanding.
Step 6: Training, Where the Magic Comes From
The architecture is elegant, but it’s inert until trained. Training is where billions of parameters learn their values. Think of it like educating a student from scratch. The student goes through distinct phases: absorbing knowledge, learning how to apply it, and then sitting exams that sharpen their judgment. Each phase builds on the last, and none of them alone is sufficient.
Phase 1: Pre-training, Reading Everything
A student who wants to become a doctor first has to read. Textbooks, research papers, case studies, patient notes. Years of it. They’re not being tested yet. They’re building a model of the world: how symptoms relate to diagnoses, how drugs interact, what a healthy reading looks like versus a worrying one. The more they read, the richer and more interconnected their knowledge becomes.
Pre-training works the same way. The model is fed a huge fraction of the written internet: books, articles, code, forums, Wikipedia, scientific papers, hundreds of billions of tokens in total. For each position in each document, it attempts to predict the next token. When it’s wrong, the error signal flows backwards through the network (backpropagation), nudging every parameter slightly toward being less wrong.
This is called the cross-entropy loss: the model’s predicted probability distribution is compared to reality (which token actually came next), and the gap is quantified. Over hundreds of millions of gradient updates, the model gets better and better at predicting what comes next in human text.
After pre-training, the model has absorbed an enormous amount of knowledge. It knows that Paris is the capital of France, that recursion requires a base case, that “the cat sat on the” is very likely followed by “mat” or “floor”. But knowing things is not the same as knowing how to respond helpfully. A pre-trained model asked a question will just as readily continue the question as answer it, because both are valid completions of the pattern. The student has read every book in the library. They still don’t know how to sit an exam.
The compute cost is staggering. Frontier models train on tens of thousands of GPUs for months, consuming megawatt-hours of electricity. But the result is a model that has compressed an extraordinary amount of human knowledge into its weights.
Phase 2: Supervised Fine-tuning, Learning the Format
Before any reinforcement learning, there’s a quieter step: supervised fine-tuning (SFT). Human contractors write examples of ideal behavior, a question followed by a good answer, a request followed by a helpful response. The model is trained on these examples directly, the same way pre-training works, but on a much smaller and carefully curated dataset.
This is where the model learns the format of being an assistant. It learns that when a human asks a question, the appropriate next tokens are an answer, not another question. It learns to be direct, to structure its responses, to stay on topic.
The student analogy: this is the study group before the exam. Someone who has already passed sits down with the student and walks them through a set of worked examples. Not to transfer knowledge (the student already has that from pre-training) but to show them how that knowledge is supposed to be expressed in this context.
Phase 3: Reinforcement Learning, the Exam That Sharpens Judgment
Fine-tuning gets the format right, but it doesn’t teach the model to discriminate between responses that are merely correct and responses that are genuinely good. The distinction between an adequate answer and an excellent one requires feedback, not just examples.
This is where reinforcement learning from human feedback (RLHF) comes in.
Human raters are shown multiple model outputs for the same prompt and asked to rank them by quality. Which response is more accurate? More honest? More clearly explained? A separate reward model is trained on thousands of these rankings to predict which outputs humans prefer. Then the LLM is fine-tuned using reinforcement learning (specifically PPO or a simpler variant called DPO) to generate responses that score highly on this reward signal, while staying close enough to its pre-trained behavior that it doesn’t collapse into reward-hacking gibberish.
Think of it as the exam. The student writes an answer. The examiner marks it. The mark isn’t just right-or-wrong; it’s a graded judgment of quality, clarity, and usefulness. The student takes the feedback, adjusts their approach, and writes the next answer slightly better. Repeat this across millions of examples, and the student learns not just what is true but how to communicate it in the way that earns the best marks.
This is what turns a text-completion engine into a useful assistant. It’s also why models can feel like they have opinions and preferences. They’ve learned, from thousands of human comparisons, which response styles people actually find valuable. The model isn’t expressing its own preferences. It’s reflecting yours back at you, optimized and amplified.
What LLMs Don’t Do
Understanding what LLMs aren’t is as important as understanding what they are.
They don’t retrieve facts from a database. Knowledge is encoded implicitly in the weights, not stored in any queryable form. This is why they can be confidently wrong (hallucination). The model produces whatever token sequence has high probability given the context, whether or not it’s true.
They don’t reason in the way humans do. When a model solves a math problem step by step, it’s generating tokens that look like reasoning, because it’s seen millions of examples of step-by-step solutions in training data. Whether this constitutes “real” reasoning is one of the biggest open questions in AI.
They don’t learn from your conversation. Each conversation is stateless. The model receives your full conversation as input text, processes it, and generates the next token. It doesn’t update its weights. It doesn’t remember you tomorrow (unless a memory system is built on top).
They don’t have goals or desires. The model is optimizing a mathematical objective (predict the next token well). Anything that looks like motivation is a pattern it’s reproducing from training data.
The Unreasonable Effectiveness of Scale
Perhaps the most startling finding in LLM research is that bigger models don’t just get better. They develop qualitatively new capabilities. A model with 1 billion parameters might struggle with basic arithmetic. At 10 billion, it handles simple math. At 100 billion, it can write code, explain metaphors, and translate between languages it was barely trained on. These are called emergent abilities, and they appear suddenly at certain scale thresholds, not gradually.
The scaling laws discovered by researchers show that model performance improves as a smooth power-law function of three variables: parameter count, dataset size, and compute budget. Double the compute, get a predictable improvement. This is what has driven the race to build ever-larger models. It’s one of the most reliable empirical relationships in AI.
So, What Is an LLM?
An LLM is a compression algorithm. It compresses the statistical regularities of human language, and with them, to a surprising degree, human knowledge and reasoning, into a set of neural network weights. When you prompt it, you’re giving it a context, and it’s decompressing the most likely continuation.
The fact that “most likely continuation” can include correct code, accurate medical advice, creative fiction, and nuanced philosophical argument tells us something profound about the structure of language itself. Language isn’t just communication; it’s a lossy encoding of thought. And a model that masters the patterns of language ends up mastering a shadow of the patterns of thought.
That’s the real lesson of LLMs. Not that we’ve built a thinking machine, but that the boundary between statistical patterns and genuine understanding is far blurrier than anyone expected.