Thinking Tokens and the Reasoning Gap in Agent Systems
LLMs predict the next token. That’s the whole machinery. But for hard problems like multi-step math, code debugging, or planning across many constraints, committing to output tokens while you’re still figuring out the answer leads to shallow, brittle results. Thinking mode was built to fix this. Understanding how it works, and where it breaks down, is essential if you’re building anything serious with language models today.
The Core Problem: No Scratch Space
When a standard LLM generates a response, it’s doing two things at once: reasoning through the problem and producing the final answer. There’s no separation. Every token it outputs is committed, and once it’s in the sequence, it shapes every subsequent token.
For simple questions this is fine. For anything multi-hop, it’s a fundamental bottleneck. The model must jump from question to answer without any space to explore wrong paths, backtrack, or build up intermediate conclusions. It’s like being asked to write the final draft of a proof without scratch paper.
Thinking mode decouples these two phases. It gives the model a scratchpad region, a block of generation where it can explore hypotheses, flag uncertainties, and explicitly reject its own earlier reasoning before writing the clean answer.
What’s Actually Happening Inside One Inference Call
When thinking mode is active, a single model call produces two distinct regions of output.
Phase 1: The thinking block. The model generates tokens into a scratchpad. These are computationally identical to any other tokens: same vocabulary, same attention heads, same softmax. What makes them “thinking tokens” is behavioral. The model has been fine-tuned to use this region for exploration rather than final output. It can write “wait, that’s wrong, let me reconsider” and mean it, because those words are scaffolding, not answers.
Every thinking token sits in the KV cache and is attended to by everything that follows. By the time the model transitions to the output phase, it’s already built up a dense prefix of intermediate reasoning. The answer is being generated by a model that has already done the hard work.
Phase 2: The output block. The model writes the user-facing response. By this point the reasoning is done. The serving layer (Anthropic, Google, OpenAI) separates the two regions and returns them in distinct API fields. Whether thinking tokens are shown to users, hidden, or returned only to developers is a product choice. Under the hood, they’re just tokens.
<thought>…</thought>) pause briefly then loop back as the next input — they never leave the LLM. Once </thought> is emitted, the visible answer tokens pause then stream to the Agent Runtime.
Why It Actually Improves Accuracy
The gains aren’t mysterious once you understand autoregressive mechanics.
When a model writes an intermediate reasoning step like “the equation has two terms, so isolating x means subtracting 7 from both sides,” those tokens narrow the probability distribution for what comes next. The correct next token becomes dramatically more likely because the context now points strongly toward it. This is the insight behind chain-of-thought prompting: intermediate tokens aren’t just explanation, they’re load-bearing scaffolding that shapes the probability landscape.
Self-correction also becomes possible. In standard generation, a wrong intermediate step is locked in. It’s in the cache and it’s conditioning everything downstream. In the thinking scratchpad, the model can generate a hypothesis, reason about it across subsequent tokens, and explicitly reject it. Standard generation structurally cannot do this, because there’s no space for dead ends in a final answer.
The thinkingBudget parameter in modern APIs controls exactly how many tokens the model gets for this phase. More tokens, more space for complex reasoning. Simple arithmetic might need 256. A multi-step proof or a nuanced code debugging session might warrant 4,096 or more.
The Hidden Problem: Agentic Systems
Everything described above happens within a single inference call. The benefits are real and measurable. But the picture changes completely when you build an agent loop.
A ReACT loop (Reason, Act, Observe, Reason again) isn’t controlled by the model’s autoregressive generation. It’s controlled by an agent runtime. The loop works like this:
- The runtime calls the model with a message history.
- The model generates a response (possibly with thinking tokens and a tool call).
- The runtime extracts the tool call and executes it.
- The runtime constructs a new message history: prior messages + tool call + tool result.
- The runtime calls the model again, a completely fresh inference pass.
Step 5 is where thinking tokens disappear. The runtime builds the new message history from structured response objects: visible text, function calls, function results. Thinking tokens are treated as ephemeral metadata. They were useful for producing that turn’s output, but they are not part of the conversation transcript.
The model at turn N+1 has no access to the reasoning from turn N. It must re-derive intent, re-interpret intermediate results, and reconstruct the logical chain from scratch from a sparser context. For short chains this is tolerable. For long, multi-step tasks where you check three subsystems, compare results, and act on the worst, the model progressively loses coherence. It may repeat actions, misread its own earlier outputs, or drift from the original plan.
This is not a bug in thinking mode. It’s a fundamental architectural consequence. In pure autoregressive generation, the KV cache is continuous across the full sequence. In an agentic loop, each model call is a fresh pass with a freshly built context window. Thinking tokens from call N are structurally incapable of influencing call N+1 unless the runtime explicitly includes them, which by default it does not.
The Fix: Materialise Reasoning as Tool Arguments
The most practical solution is to force the model to express its reasoning as a required parameter in every tool call. Add a thought field to your tool’s input schema:
{
"name": "getSubsystemStatus",
"parameters": {
"thought": {
"type": "string",
"description": "Your reasoning for why you're checking this subsystem now."
},
"subsystem": { "type": "string" }
},
"required": ["thought", "subsystem"]
}
The model then produces calls like:
getSubsystemStatus(
thought: "Cooling is at 72%, below the 80% threshold. Checking hydraulics next to compare.",
subsystem: "hydraulics"
)
This thought string survives across turns because function call arguments are first-class conversation objects. The runtime serialises them faithfully into every subsequent message history, so the model at turn N+1 can attend to the reasoning from turn N because it’s literally a string in the context window.
You get roughly 90% of the reasoning continuity benefit at around 1% of the token cost: one sentence versus hundreds of tokens of scratchpad.
A Quick Note on What “Thinking Tokens” Actually Means
The industry uses one label for two fundamentally different mechanisms.
The first, and currently dominant, approach is discrete text tokens with API-level separation. This is how Claude’s extended thinking, Gemini 2.5, and almost certainly OpenAI’s o1/o3 series work. The thinking tokens are regular tokens; the infrastructure just separates and routes them differently. The model has been fine-tuned to use the designated region for exploration.
The second, more experimental approach is latent continuous reasoning, as in models like Meta’s Coconut that reason directly in the embedding space, producing hidden-state vectors instead of words. These can’t be read. The hypothesis is that natural language is an inefficient bottleneck for reasoning: a 4096-dimensional float vector carries more information than a single word token. The trade-off is interpretability.
Both approaches improve reasoning quality. The architectural implications for agent systems are the same either way: reasoning that happens inside a single call stays inside that call.
The thinking budget parameter is a dial, not a switch. Simple arithmetic might need 256 tokens of reasoning. A multi-step debugging session with tool calls spanning several subsystems might need 4,096 or more. The right number depends on task complexity, and the failure mode when it’s too low is the same as no thinking at all: the model commits to answers before it has done the work.
The boundary between what thinking tokens can and can’t do is sharp. Within a single inference call, they’re genuinely powerful. Across a call boundary in an agent loop, they’re gone. Knowing which side of that boundary your problem lives on is most of the work.