Back

Why Your Claude Code Limit Runs Out in 20 Minutes (And the Fix Nobody Talks About)

Your tool output isn't stored once — it's re-sent as input tokens on every API call. Here's the specific mechanism behind the fastest context burns, traced from source code.

Claude Code
Context Mode
Token Optimization
MCP
Developer Tools

Anthropic's Lydia Hallie confirmed it this week: limits are tighter, 1M-context sessions burn faster, and "most of the fastest burn came down to a few token-heavy patterns."

She's right. But there's a specific mechanism behind those "token-heavy patterns" that I can prove from the source code. I've been staring at this problem for months because I maintain context-mode, an MCP plugin used by 57,800+ developers across 12 platforms.

How Claude Code Manages Your Conversation

Claude Code maintains an array called mutableMessages that holds your entire conversation history. Every user message, every assistant response, every tool result. This array is sent to the Anthropic API on every single turn.

When you ask Claude to run a command, the tool output gets pushed into this array. That output stays there until compaction fires.

Here's the part most people miss: that tool output isn't just stored once. It's re-sent as input tokens on every subsequent API call. The API is stateless. The client sends the full conversation history every time.

The Compounding Cost

Let's say you run gh issue list and it returns 59KB of JSON. That's roughly 15,000 tokens.

On the next turn, those 15,000 tokens are sent again as input. And the turn after that. And the turn after that. If you have 50 turns before compaction, that single tool call cost you 15,000 x 50 = 750,000 input tokens.

Now multiply that by 20 tool calls in a session. Each averaging 30KB of output. That's 600KB of tool output living in mutableMessages, re-sent 50 times. 30MB of input tokens from tool output alone.

This is where your limit goes.

Source Code Evidence

I traced this through Claude Code's implementation:

  • The conversation array is passed to API calls at print.ts:L2965 and L3857
  • Tool results are included as FunctionCallOutput items in for_prompt()
  • There is a truncate_function_output_payload() function that shortens large outputs, but it never removes them entirely
  • They stay in the conversation until the compact service runs

The 429 rate limit is enforced server-side. ccrClient.ts:L623-627 reads the Retry-After header. SerialBatchEventUploader.ts:L22-23 confirms that multiple sessions share the same rate limit pool. So parallel sessions burn the same quota.

Lydia's Tips, Quantified

Lydia recommended CLAUDE_CODE_AUTO_COMPACT_WINDOW=200000 to cap your context window. This helps because compaction fires earlier, clearing stale tool output. But here's the catch: if tool output already consumed 150KB of that 200KB window, you have 50KB left for actual reasoning. The tool output ate your thinking space.

Her tip about fresh sessions instead of resuming after 1 hour makes sense for the same reason. Anthropic's prompt cache has a 5-minute TTL. After an hour, the cache is cold. Resuming means re-sending the entire mutableMessages array without cache hits. Every token is billed at full price.

What If Tool Output Never Entered the Conversation?

This is the problem I set out to solve with context-mode.

context-mode is an MCP server that intercepts tool calls before they execute. The PreToolUse hook (hooks/core/routing.mjs:L142-263) classifies every tool call. If it's a command that produces large output — gh, docker, kubectl, npm test, git log, Playwright snapshots, file reads, web fetches — it redirects execution to a sandboxed subprocess.

The subprocess runs the command. The raw output stays in the subprocess. Only your printed summary (stdout) enters the conversation. The raw data gets indexed into an FTS5 full-text search database with BM25 ranking. You can search it anytime, but it never touches mutableMessages.

Measured Results

These are real benchmarks, not estimates:

ScenarioRaw OutputIn ContextReduction
GitHub issues (gh issue list)58.9 KB1,139 bytes98%
Nginx access log analysis45.1 KB155 bytes99.7%
Playwright page snapshot56 KB430 bytes99.2%
20 GitHub issues with comments58.9 KB1.1 KB98%

The Compounding Effect in Practice

Without context-mode: 20 tool calls x 30KB average = 600KB in mutableMessages. Over 50 turns = 30MB of re-sent input tokens.

With context-mode: 20 tool calls x 1KB average = 20KB in mutableMessages. Over 50 turns = 1MB of re-sent input tokens.

Same work. Same results. 30x less token consumption. Your limit lasts 30x longer.

How the Interception Works

The routing engine classifies tool calls into categories:

  • Bash commands containing curl, wget, or inline HTTP calls are blocked entirely and redirected to sandbox execution
  • WebFetch calls are denied and replaced with ctx_fetch_and_index, which converts HTML to markdown, chunks it, and indexes it into FTS5 — the raw HTML (often 100KB+) never enters context
  • Large output commands are intercepted by the PostToolUse hook (hooks/posttooluse.mjs:L38-49), which captures 13 event categories into a persistent SessionDB

When compaction fires, the PreCompact hook builds a resume snapshot from these events. The model recovers its working state without re-dumping raw data into the conversation.

The session continuity system uses a two-database architecture. A persistent SessionDB per project stores events in real-time. An ephemeral FTS5 ContentStore per process holds indexed tool output. When compaction fires, only a compact summary table plus search queries are injected. The model searches for details on demand via the existing search tool. Raw session events never enter context.

What This Means for Lydia's Advice

Every tip she gave still applies. Use Sonnet over Opus. Lower effort level when you don't need deep reasoning. Cap your context window. Start fresh sessions.

But combine those tips with keeping tool output out of your conversation, and the effect multiplies. A capped 200KB window where tool output takes 20KB instead of 150KB means you have 180KB for reasoning instead of 50KB. Compaction fires less often. Cache hits are more frequent because your conversation is smaller and more stable.

The limit isn't just about how many tokens you use. It's about how many tokens you re-use, turn after turn, without realizing it.


57,800+ developers. 47,600 npm installs. 10,200 marketplace installs. 6,400 GitHub stars. 12 platforms: Claude Code, Gemini CLI, VS Code Copilot, Cursor, OpenCode, Codex CLI, KiloCode, Kiro, Antigravity, Zed, and more.

Open source under Elastic License 2.0: github.com/mksglu/context-mode