Prompt caching for cost: cutting Anthropic bills 60%+
Prompt caching is the highest-ROI optimization for heavy Claude API users. Done right, it cuts costs 50-80%. Here's how — with concrete patterns.
Anthropic's prompt caching is the highest-ROI optimization for heavy Claude API users. Done right, costs drop 50-80% on cached content. Done wrong, you save nothing.
This post is the practical guide — mechanics, patterns, and what to actually configure.
The mechanics, briefly
When you send a prompt to Claude with cache_control markers, Anthropic stores the marked prefix in a cache. Next time you send a request with the same exact prefix, you pay 10% of the regular input price for that prefix instead of 100%.
The caveats:
- Exact match required: any character change in the cached prefix = cache miss.
- TTL: 5 minutes default; 1-hour available at premium.
- Per-request granularity: each cache_control marker creates one cache breakpoint.
- Order matters: the cache is the prefix; you can't cache the middle and not the prefix.
What to cache (high-ROI candidates)
Anything stable and reused across multiple calls:
CLAUDE.md / system prompts
Your project's CLAUDE.md. ~120 lines, 2k tokens. Reused in every agent interaction in that project.
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": claude_md_content,
"cache_control": {"type": "ephemeral"}
},
{
"type": "text",
"text": user_prompt,
}
]
}
]
The CLAUDE.md gets cached. Second call: 10% of input cost.
Tool definitions
If you're building agentic loops with custom tools, the tool definitions are static. Cache them.
Reference documents
Long reference docs the agent consults — API specs, glossary, internal documentation.
Codebase summaries
If you've extracted a "this is what this codebase does" summary (~5-10k tokens) for the agent's context, cache it.
What NOT to cache
Per-request user input
The user's question changes every turn. Don't put it in the cached prefix.
Recently modified files
If a file is in the cached prefix and you modify it, the cache misses next call. For files the agent edits, cache them only if the edit cadence is much lower than the call cadence.
Random data
Caching helps repeated content. Random per-call payloads (timestamps, IDs) prevent cache hits — exclude them from the cached prefix.
Cache structuring patterns
Pattern 1: layered cache breakpoints
Up to 4 cache_control markers per request. Use them strategically:
[CLAUDE.md global] <- breakpoint 1 (most stable)
[Project-specific reference] <- breakpoint 2 (changes monthly)
[Session-specific context] <- breakpoint 3 (changes per session)
[User's current question] <- not cached (per-turn)
The first three are cached at increasing freshness levels. Whatever changes only invalidates from that breakpoint onward.
Pattern 2: cache the most-stable-first
Order matters. Put your most-stable content earliest in the prompt; less-stable content later. The cache is the prefix.
If you put session context before CLAUDE.md, every session change invalidates everything. Put CLAUDE.md first.
Pattern 3: separate one-off from recurring
Don't mix one-off content (this turn's question, this turn's file diff) into the cached prefix. They'll cache-miss every turn anyway and bloat the cache.
Concrete cost example
A typical Claude Code session for refactoring:
- CLAUDE.md: 2k tokens (cached).
- Tool definitions: 1k tokens (cached).
- Codebase summary: 5k tokens (cached, occasionally invalidated).
- Per-turn context (current diff, current question): 1-2k tokens (not cached).
20 turns in a session.
Without caching
20 × (2k + 1k + 5k + 1.5k) = 20 × 9.5k = 190k input tokens. At $3/Mt for Sonnet: $0.57 per session input.
With caching
First call: 9.5k uncached, 1.5k user. Cache write: ~$0.03. Subsequent 19 calls: 1.5k user (full price) + 8k cached (at 10%) = 1.5k + 0.8k = 2.3k effective tokens. Total: $0.03 + 19 × ($0.0023 × something) ≈ $0.10 per session input.
Savings: ~80%. Across thousands of sessions/month, this is meaningful.
Anti-patterns
Caching content that changes
Caching [CLAUDE.md][live timestamp][user question] defeats the purpose — the timestamp invalidates every turn. Move dynamics out of the cached prefix.
Caching too aggressively
You only get 4 breakpoints. Using all 4 for marginal savings (caching a 200-token snippet) wastes a precious slot.
Reserve breakpoints for content >1k tokens.
Caching per-call
Cache writes have a setup cost (5% premium on first call). If your "cache" is only used once, you've spent more than necessary.
Confirm reuse before adding cache_control.
Forgetting cache TTL
5-minute default. If your sessions are bursty (5 minutes of work, 30 minutes idle, then 5 more minutes), the cache evicts during idle. Either use 1-hour caching (extra fee) or accept the periodic re-cache cost.
How to measure impact
Anthropic returns cache info in API responses:
"usage": {
"input_tokens": 1500,
"cache_creation_input_tokens": 0,
"cache_read_input_tokens": 8000,
"output_tokens": 500
}
Track:
cache_creation_input_tokens: tokens you wrote to the cache (5% premium).cache_read_input_tokens: tokens you read from cache (90% discount).
Cache hit rate = cache_read / (cache_read + input_tokens) for the cached portion. Aim for >70% on stable workflows.
Tooling
Several tools help with caching:
Anthropic SDK (Python/TypeScript)
Both expose cache_control markers as parameters. Set them in your message construction; the SDK handles the wire format.
Claude Code
Uses caching automatically. You don't configure; check usage logs to see cache hit rates.
Custom integrations
If you're building your own agent loop, you set cache_control headers manually. The Anthropic SDK is the simplest path.
Monitoring
Log cache_read vs cache_creation token counts. Build a dashboard. Track cost per session. Sudden drops in cache_read indicate a prefix change broke caching.
When caching breaks
Three failure modes to watch:
Cache prefix mutated
You changed CLAUDE.md slightly. Now every turn is a cache miss until the cache repopulates.
Diagnostic: high cache_creation_input_tokens after a change.
Fix: don't change CLAUDE.md mid-session, or accept the re-cache cost.
Order changed
You moved a section in your prompt. The cache prefix is now different from the previous prefix.
Diagnostic: high cache_creation_input_tokens after a refactor.
Fix: keep prompt structure stable.
Provider-side caching changes
Anthropic occasionally changes cache behavior. Rare but real. Watch the changelog.
Cost summary
For a heavy Claude Code user (20 sessions × 20 turns/day, ~5k tokens cached prefix):
- Without caching: ~$50-80/month input cost.
- With caching: ~$10-20/month input cost.
The difference is real. For users running thousands of API calls/day (CI integrations, automated workflows), the savings scale linearly.
Verdict
Prompt caching is the highest-ROI Anthropic API optimization in 2026. The setup is simple (add cache_control markers to stable content); the savings are 50-80%.
What to cache:
- CLAUDE.md / system prompts.
- Tool definitions.
- Reference documents.
- Codebase summaries.
What NOT to cache:
- Per-turn user input.
- Recently modified files.
- Anything random.
Anthropic's documentation has the wire-format details. The principles transfer regardless of which SDK you use.
For Claude Code users: caching happens automatically. You benefit without configuring. Monitor your cache hit rates if you're cost-sensitive.
mq-dir doesn't relate directly to caching but pairs with the broader AI workflow — the file manager visualizes work; the cache is server-side optimization.
mq-dir is fully open source.
MIT licensed, zero telemetry. Read the source, file an issue, send a PR.
★ Star on GitHub →Frequently asked questions
References
- [1]
Ready to try mq-dir?
A native quad-pane file manager built for AI multi-tasking on macOS. Free, MIT licensed, zero telemetry.
Related posts
Local LLMs on macOS in 2026: when they're worth the GPU
Local LLMs got dramatically better in 2025-2026. They're competitive with frontier APIs for some workflows; not all. Here's the honest picture.
Claude Code memory without polluting global config
Claude Code's memory feature is powerful but easy to misuse. The pattern that scales — what to put in global memory, what to put per-project, what to never persist.
File-context strategies for AI agents: what to feed, what to skip, what to summarize
When an AI agent has access to your whole repo, it doesn't read your whole repo. Here's how to choose what enters context, what stays out, and how that decision affects output quality.