Prompt Caching Explained: What Claude Code's Cache Write Actually Means

Prompt Caching in Claude Code: From Cache Write to Cost-Efficient LLM Engineering in 2026

May 12, 2026 · Analysis Report · AI Engineering

💡 The 'Cache Write' line in Claude Code's status screen is more than a usage counter. This report starts with what that entry actually means at the technical level, then covers how LLM prompt caching works, how much control developers have over it, where it pays off, token efficiency strategies, and the performance optimization trends — and security risks — that define AI engineering practice in 2026.

What Is 'Cache Write' in Claude Code?

Claude Code's /status and usage screens break token counts into four categories per model: Input · Output · Cache Read · Cache Write. The 'cache' here is not a browser cache, a CDN, or a vector store used in RAG. It refers specifically to the KV cache (Key-Value cache) — the intermediate state produced during model inference — persisted on the server side across requests.

During inference, an LLM computes attention over all input tokens, producing Key/Value matrices for each layer. Rather than recomputing these from scratch on every request, the model can persist them to storage — a Cache Write — and retrieve them on subsequent calls that share the same prefix — a Cache Read. Claude Code's Cache Write count is the number of tokens newly committed to the server-side KV cache. Cache Read is the count of tokens served from that existing cache.

🧠 In other words, a Cache Write is not a cost — it is upfront investment that unlocks a 90% discount on subsequent calls. You pay a 25% surcharge on the initial Write and recoup it across repeated hits.

What Gets Cached — and How Much Control Do You Have?

What the Cache Actually Stores

The cache stores the model's internal representations (activations) of the tokens it has processed — it is not a key-value store where you can arbitrarily insert data. What actually gets cached is whatever occupies the leading, stable portion of the prompt — specifically:

System prompt — persona definitions, policies, and role instructions

Tool definitions — JSON schemas, function signatures (often several thousand tokens)

Large documents or codebases injected via RAG

Accumulated conversation context — the static portion of a multi-turn session

The common thread: anything that does not change between requests — the static prefix — is a candidate for caching.

Can You Write to the Cache Directly?

The short answer: no — you cannot write arbitrary data to the cache the way you would with a PUT or SET in a database. However, you can influence what gets cached indirectly by controlling where cache breakpoints are placed in the prompt. Each provider exposes this differently:

Provider Control Mechanism Developer Control
Anthropic Explicit — cache_control: ephemeral Developer draws the cache boundary
OpenAI Fully automatic (prompts ≥ 1,024 tokens) Indirect control via prompt structure
Google Gemini Implicit + cachedContents API Explicit cache objects with configurable TTL

When Cache Write counts rise in Claude Code, it is not because the user issued a caching command. It means Claude Code's internal agent loop automatically marked the system prompt, tool definitions, and file context it loaded as cache candidates.

Caching Policies Compared: Anthropic, OpenAI, and Google (2026)

All three providers have converged on roughly 90% Cache Read discounts, but the differences that matter in practice are the minimum cacheable unit, TTL, and whether a Write surcharge applies.

📊 Maximum Cache Read Discount by Provider

Anthropic
90%
OpenAI
~90%
Google Gemini
90%
Item Anthropic OpenAI Google
Control Explicit breakpoints Fully automatic Implicit + Explicit
Write Surcharge +25% None Charged per storage duration
Minimum Unit 1,024–2,048 tokens 1,024 tokens 32,768 tokens (explicit)
Default TTL 5 min (1-hour option available) 5–10 min 1 hour to several days

Sources: Anthropic Docs (prompt-caching), OpenAI Blog "API Prompt Caching", Google Vertex AI Context Caching documentation

Why Effective Cache Design Is a Core Engineering Skill

🔥 The Context Inflation Problem

Production prompts in 2026 are not single-line queries. Tens of thousands of lines of codebase, hundreds of pages of PDFs, and dozens of tool definitions can enter context simultaneously. Without caching, reprocessing all of that on every request adds latency in the tens of seconds.

💰 The Break-Even Problem

With Anthropic, a Cache Write costs roughly 25% more than a standard input token. That means you need at least two cache hits before the economics tip in your favor. The skill is not enabling caching — it is designing prompts so the cache actually gets reused.

⚠️ The Rigidity of Prefix-Only Matching

Cache matching is exact, starting from the first byte. A single now() timestamp or user-specific token inserted anywhere in the prefix invalidates everything downstream.

When Caching Pays Off: Practical Use Cases

System Prompt Persona · Policy Tool Definitions JSON Schema Static Context RAG · Codebase Dynamic Input User Query Cacheable (Static Prefix) Not Cached

Large-codebase agents — project tree, key source files, and build system documentation placed after the system prompt and cached. This is exactly how Claude Code operates internally.

Multi-tool agents — dozens of function signatures and JSON schemas cached once, eliminating the fixed overhead paid on every turn.

Detailed personas and few-shot examples (5K–50K tokens) — write once, get 90% off on every subsequent call.

Evaluation pipelines — running the same system prompt against hundreds of different inputs guarantees automatic cache hits on every request.

Token Efficiency: Four Strategies That Work Together

Caching is one lever among several. In practice, effective token efficiency combines all four of the following axes.

🧱 ① Prompt Layering

Arrange prompt sections in strict static-to-dynamic order: [System] → [Tools/Docs] → [Persistent Context] → [User Input]. Any dynamic content that appears before the static prefix breaks the cache for everything that follows.

📦 ② Data Density

Represent the same information in fewer tokens by preferring Markdown, CSV, or YAML over verbose JSON. Current-generation models parse structured text well, and the token savings are real.

✂️ ③ Output Token Control

Output tokens cost 3–5× more than input tokens. Keep them in check with max_tokens limits, explicit brevity instructions, and structured output constraints (JSON Schema).

⏰ ④ TTL-Aware Request Design

Anthropic's default TTL is five minutes. For workflows where user interactions are spaced further apart, batch requests together or switch to the one-hour TTL option to avoid paying the Write surcharge repeatedly.

The 2026 Playbook: Compound AI Systems

The most effective engineers today do not build around a single model and a single prompt. They design Compound AI Systems — pipelines that combine multiple models, caching layers, and retrieval mechanisms in a coordinated architecture.

Strategy Core Idea
🛣️ Model Routing Route simple, cache-friendly tasks to smaller models (Haiku/GPT-4o mini); reserve complex reasoning for Sonnet/Opus/o1.
🧠 Semantic Caching Use Redis or pgvector to serve semantically similar queries directly from a response cache, bypassing LLM calls entirely.
📦 Batch API Run non-latency-sensitive workloads as overnight batches at ~50% lower cost.
🏗️ Hybrid Prompt Enforce a hard static/dynamic boundary in every prompt, and track Cache Hit Rate as a first-class engineering KPI.
🧠 The competitive edge in 2026 is not the size of the model you use. It is how intelligently you cache and how dynamically you route requests to the right model for the job.

Cache Pitfalls: Technical Limits and Security Risks

Technical Limits

🔴 Fragile invalidation — a single extra space or a date field inserted anywhere in the prefix invalidates the entire downstream cache.

🔴 TTL backfire — if no cache hit occurs within five minutes of a Write, you have paid the 25% surcharge for nothing, ending up more expensive than not caching at all.

🔴 Granularity dead zones — OpenAI's 128-token caching granularity means prompts just below a threshold (e.g., 1,023 tokens) have a 0% cache hit rate.

Security and Privacy Risks

In the 2025–2026 timeframe, the prompt cache has emerged as a new attack surface. Academic and industry reports identify four primary threats.

⚠️ Timing side-channels (PROMPTPEEK class) — by measuring response latency differences between cache hits and misses, an attacker can infer which sensitive tokens are present in the server's cache (NDSS 2025).

⚠️ Multi-tenant KV cache leakage — in multi-user deployments where KV caches are not physically isolated, portions of one user's prompt can leak at the token level to another user.

⚠️ Semantic cache poisoning — an attacker who injects a malicious response into a semantic cache causes that response to be returned to legitimate users asking similar questions.

⚠️ Quota drain — high-volume requests with minor variations force repeated Cache Writes, intentionally inflating the target's costs via the write surcharge.

Standard Mitigations

Risk Recommended Response
Invalidation Fix static-before-dynamic order; push timestamps to the final prompt block.
5-min TTL Batch requests; switch to the one-hour TTL option when needed.
Side-Channel Add response jitter; exclude sensitive tokens from cached sections.
Cross-Tenant For sensitive workloads, require Dedicated or Private Cache options from the provider.

What That Cache Write Count Really Tells You

The Cache Write count in Claude Code's status screen is not a vanity metric. It is an economic indicator: how much static context the current session has committed to the KV cache, and how many future requests are positioned to receive a 90% input token discount. You cannot PUT arbitrary data into the cache directly, but you can drive cache efficiency through prompt structure — specifically, the ordering and stability of system prompts and tool definitions.

🎯 Four-Layer Design for LLM Cost Engineering in 2026

① Keep static context as long and stable as possible.

② Isolate all dynamic content at the end.

③ Route requests to the right model for the task complexity.

④ Design with cache TTL constraints and security risks in mind.

→ The goal is not to use fewer tokens — it is to accumulate more reusable ones.

References

📚 Anthropic Prompt Caching Docs — docs.anthropic.com/en/docs/build-with-claude/prompt-caching

📚 OpenAI API Prompt Caching — openai.com/index/api-prompt-caching

📚 Google Vertex AI Context Caching — cloud.google.com/vertex-ai/generative-ai/docs/context-caching

📚 NDSS Symposium 2025 — "PROMPTPEEK: Side-Channel Attacks on LLM Prompt Caching"

📚 arXiv 2025 — "Privacy Risks in Multi-tenant KV-Cache Systems"

This content is provided for informational purposes only. Decisions about adopting specific technologies or services should be based on your own environment and requirements.

© 2026 OguTech Notes. All rights reserved.

S
SW Develope
Software development notes

Collecting and distilling software development resources — each post reviewed before publishing.

Written from publicly available data and references. Last updated: June 8, 2026

댓글

이 블로그의 인기 게시물

Cutting Claude Code Token Usage by 75%: What the Caveman Technique Actually Delivers

Claude Code ultracode — What It Is, How to Enable It, and Who Can Use It

Does Open-Source Headroom Cut LLM Costs by 90%? A Fact Check