AI Tools

Prompt caching for cost: cutting Anthropic bills 60%+

Prompt caching is the highest-ROI optimization for heavy Claude API users. Done right, it cuts costs 50-80%. Here's how — with concrete patterns.

Honam Kang5 min read

Anthropic's prompt caching is the highest-ROI optimization for heavy Claude API users. Done right, costs drop 50-80% on cached content. Done wrong, you save nothing.

This post is the practical guide — mechanics, patterns, and what to actually configure.

The mechanics, briefly

When you send a prompt to Claude with cache_control markers, Anthropic stores the marked prefix in a cache. Next time you send a request with the same exact prefix, you pay 10% of the regular input price for that prefix instead of 100%.

The caveats:

  1. Exact match required: any character change in the cached prefix = cache miss.
  2. TTL: 5 minutes default; 1-hour available at premium.
  3. Per-request granularity: each cache_control marker creates one cache breakpoint.
  4. Order matters: the cache is the prefix; you can't cache the middle and not the prefix.

What to cache (high-ROI candidates)

Anything stable and reused across multiple calls:

CLAUDE.md / system prompts

Your project's CLAUDE.md. ~120 lines, 2k tokens. Reused in every agent interaction in that project.

messages = [
  {
    "role": "user",
    "content": [
      {
        "type": "text",
        "text": claude_md_content,
        "cache_control": {"type": "ephemeral"}
      },
      {
        "type": "text",
        "text": user_prompt,
      }
    ]
  }
]

The CLAUDE.md gets cached. Second call: 10% of input cost.

Tool definitions

If you're building agentic loops with custom tools, the tool definitions are static. Cache them.

Reference documents

Long reference docs the agent consults — API specs, glossary, internal documentation.

Codebase summaries

If you've extracted a "this is what this codebase does" summary (~5-10k tokens) for the agent's context, cache it.

What NOT to cache

Per-request user input

The user's question changes every turn. Don't put it in the cached prefix.

Recently modified files

If a file is in the cached prefix and you modify it, the cache misses next call. For files the agent edits, cache them only if the edit cadence is much lower than the call cadence.

Random data

Caching helps repeated content. Random per-call payloads (timestamps, IDs) prevent cache hits — exclude them from the cached prefix.

Cache structuring patterns

Pattern 1: layered cache breakpoints

Up to 4 cache_control markers per request. Use them strategically:

[CLAUDE.md global]               <- breakpoint 1 (most stable)
[Project-specific reference]     <- breakpoint 2 (changes monthly)
[Session-specific context]       <- breakpoint 3 (changes per session)
[User's current question]        <- not cached (per-turn)

The first three are cached at increasing freshness levels. Whatever changes only invalidates from that breakpoint onward.

Pattern 2: cache the most-stable-first

Order matters. Put your most-stable content earliest in the prompt; less-stable content later. The cache is the prefix.

If you put session context before CLAUDE.md, every session change invalidates everything. Put CLAUDE.md first.

Pattern 3: separate one-off from recurring

Don't mix one-off content (this turn's question, this turn's file diff) into the cached prefix. They'll cache-miss every turn anyway and bloat the cache.

Concrete cost example

A typical Claude Code session for refactoring:

  • CLAUDE.md: 2k tokens (cached).
  • Tool definitions: 1k tokens (cached).
  • Codebase summary: 5k tokens (cached, occasionally invalidated).
  • Per-turn context (current diff, current question): 1-2k tokens (not cached).

20 turns in a session.

Without caching

20 × (2k + 1k + 5k + 1.5k) = 20 × 9.5k = 190k input tokens. At $3/Mt for Sonnet: $0.57 per session input.

With caching

First call: 9.5k uncached, 1.5k user. Cache write: ~$0.03. Subsequent 19 calls: 1.5k user (full price) + 8k cached (at 10%) = 1.5k + 0.8k = 2.3k effective tokens. Total: $0.03 + 19 × ($0.0023 × something) ≈ $0.10 per session input.

Savings: ~80%. Across thousands of sessions/month, this is meaningful.

Anti-patterns

Caching content that changes

Caching [CLAUDE.md][live timestamp][user question] defeats the purpose — the timestamp invalidates every turn. Move dynamics out of the cached prefix.

Caching too aggressively

You only get 4 breakpoints. Using all 4 for marginal savings (caching a 200-token snippet) wastes a precious slot.

Reserve breakpoints for content >1k tokens.

Caching per-call

Cache writes have a setup cost (5% premium on first call). If your "cache" is only used once, you've spent more than necessary.

Confirm reuse before adding cache_control.

Forgetting cache TTL

5-minute default. If your sessions are bursty (5 minutes of work, 30 minutes idle, then 5 more minutes), the cache evicts during idle. Either use 1-hour caching (extra fee) or accept the periodic re-cache cost.

How to measure impact

Anthropic returns cache info in API responses:

"usage": {
  "input_tokens": 1500,
  "cache_creation_input_tokens": 0,
  "cache_read_input_tokens": 8000,
  "output_tokens": 500
}

Track:

  • cache_creation_input_tokens: tokens you wrote to the cache (5% premium).
  • cache_read_input_tokens: tokens you read from cache (90% discount).

Cache hit rate = cache_read / (cache_read + input_tokens) for the cached portion. Aim for >70% on stable workflows.

Tooling

Several tools help with caching:

Anthropic SDK (Python/TypeScript)

Both expose cache_control markers as parameters. Set them in your message construction; the SDK handles the wire format.

Claude Code

Uses caching automatically. You don't configure; check usage logs to see cache hit rates.

Custom integrations

If you're building your own agent loop, you set cache_control headers manually. The Anthropic SDK is the simplest path.

Monitoring

Log cache_read vs cache_creation token counts. Build a dashboard. Track cost per session. Sudden drops in cache_read indicate a prefix change broke caching.

When caching breaks

Three failure modes to watch:

Cache prefix mutated

You changed CLAUDE.md slightly. Now every turn is a cache miss until the cache repopulates.

Diagnostic: high cache_creation_input_tokens after a change. Fix: don't change CLAUDE.md mid-session, or accept the re-cache cost.

Order changed

You moved a section in your prompt. The cache prefix is now different from the previous prefix.

Diagnostic: high cache_creation_input_tokens after a refactor. Fix: keep prompt structure stable.

Provider-side caching changes

Anthropic occasionally changes cache behavior. Rare but real. Watch the changelog.

Cost summary

For a heavy Claude Code user (20 sessions × 20 turns/day, ~5k tokens cached prefix):

  • Without caching: ~$50-80/month input cost.
  • With caching: ~$10-20/month input cost.

The difference is real. For users running thousands of API calls/day (CI integrations, automated workflows), the savings scale linearly.

Verdict

Prompt caching is the highest-ROI Anthropic API optimization in 2026. The setup is simple (add cache_control markers to stable content); the savings are 50-80%.

What to cache:

  1. CLAUDE.md / system prompts.
  2. Tool definitions.
  3. Reference documents.
  4. Codebase summaries.

What NOT to cache:

  1. Per-turn user input.
  2. Recently modified files.
  3. Anything random.

Anthropic's documentation has the wire-format details. The principles transfer regardless of which SDK you use.

For Claude Code users: caching happens automatically. You benefit without configuring. Monitor your cache hit rates if you're cost-sensitive.

mq-dir doesn't relate directly to caching but pairs with the broader AI workflow — the file manager visualizes work; the cache is server-side optimization.

Open source

mq-dir is fully open source.

MIT licensed, zero telemetry. Read the source, file an issue, send a PR.

★ Star on GitHub →

Frequently asked questions

For typical agentic workflows with stable context (CLAUDE.md, tool definitions, reference docs): 50-80% on input tokens. The cache hit costs 10% of the regular input price. The break-even is roughly 2-3 reuses; beyond that you're saving.

References

  1. [1]

Ready to try mq-dir?

A native quad-pane file manager built for AI multi-tasking on macOS. Free, MIT licensed, zero telemetry.

v0.1.0-beta.11 · MIT · macOS 14.0+ · github