Prompt Caching in 2026: How to Cut Your LLM API Costs by 90%
Prompt Caching in 2026: How to Cut Your LLM API Costs by 90%

Three months ago I was staring at an invoice from Anthropic. $847 for the month. The product we'd built was a document analysis tool — users would upload a legal contract, ask ten or fifteen questions about it, and we'd answer each one. Every question hit the API with the full 40,000-token contract prepended. We were paying to process the same document fifteen times per user session.
The fix took ninety minutes and dropped our bill to $74 the following month.
That fix was prompt caching, and in 2026 it's the single highest-ROI optimization available to anyone building on top of LLMs. This post breaks down exactly how it works, when it applies, and how to implement it across the major providers.
What Prompt Caching Actually Is
When you send a request to an LLM API, the model processes every token in your prompt from scratch — your system prompt, any context you've prepended, the conversation history, the user's message, all of it. For a 40,000-token document that's significant compute on every call.
Prompt caching tells the API: "The first N tokens of this prompt are always the same — process them once, store the result, and reuse it for subsequent requests." On Anthropic's API, cached tokens cost 0.1× the standard input price on reuse (a 90% discount). The initial cache write costs 1.25× to create the cache entry. After the breakeven point of roughly 1.3 repeated uses, every subsequent call is significantly cheaper.
The key constraint: caching only applies to a prefix of your prompt. Everything up to the cache boundary must be identical across calls. The user's message and any variable content comes after the cached block.
[CACHED PREFIX — same every call] [DYNAMIC — varies per call]
System prompt User message
+ Long document/context + Conversation turn
+ Few-shot examples
This is why document Q&A, code analysis, and RAG with static knowledge bases are perfect use cases. The expensive context is fixed; only the question changes.
The Problem: Paying to Re-Read the Same Document Fifteen Times
Here's what our original (expensive) code looked like:
def answer_question(document: str, question: str) -> str:
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=1024,
messages=[
{
"role": "user",
"content": f"Here is a legal contract:\n\n{document}\n\nQuestion: {question}"
}
]
)
return response.content[0].text
Each call to answer_question sends the full document as input tokens. For a 40,000-token contract, at Claude Opus pricing ($15 per million input tokens), that's $0.60 per question. Fifteen questions per session = $9.00 per user session. At 90 sessions per month, we hit $810.
$ python3 estimate_cost.py --tokens 40000 --calls 15 --sessions 90
Monthly estimate: $810.00
Cache savings at 90% discount: $729.00
The document never changes within a session. We were throwing money away.
How Prompt Caching Works: The Mechanics

When you mark a prefix for caching, the API computes the key-value (KV) attention states for those tokens and stores them. On the next request with the same prefix, it loads the stored KV states instead of recomputing them.
Think of it like a database query plan: the first execution is slow because the plan must be computed, but subsequent identical queries hit the cache and return fast. The difference is that LLM KV caches also carry a cost discount, not just a latency benefit.
Cache Lifetime and Invalidation
On Anthropic's API, cache entries have a 5-minute TTL that resets on each cache hit. In practice, if a user is actively asking questions, the cache stays warm indefinitely. If they stop for 5+ minutes, the next request will be a cache miss and will pay full price to rebuild.
OpenAI uses automatic prefix caching on supported models (GPT-4o and o-series) — no API changes needed, the discount applies automatically for prefixes longer than 1,024 tokens. The discount is 50% on cache hits (less generous than Anthropic's 90%, but zero implementation cost).
Google's Gemini API uses "context caching" with explicit TTL management, billed as a separate storage cost plus a discounted inference cost. For very long contexts (1M+ tokens), this model can be more cost-effective than Anthropic's approach.
Provider comparison (as of April 2026):
┌──────────────────┬────────────────┬─────────────┬────────────────────┐
│ Provider │ Cache discount │ Min prefix │ Implementation │
├──────────────────┼────────────────┼─────────────┼────────────────────┤
│ Anthropic Claude │ 90% on reuse │ 1,024 tokens│ Explicit cache_control │
│ OpenAI GPT-4o │ 50% automatic │ 1,024 tokens│ None (automatic) │
│ Google Gemini │ ~75% + storage │ 32,768 tokens│ Explicit TTL mgmt │
└──────────────────┴────────────────┴─────────────┴────────────────────┘
Implementation: Anthropic Prompt Caching
Enabling caching on Anthropic's API requires adding a cache_control block to the content you want cached. The cache boundary goes at the end of the prefix you want stored.
Basic Document Q&A with Caching
import anthropic
client = anthropic.Anthropic()
def answer_question_cached(document: str, question: str) -> str:
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a legal document analyst. Answer questions accurately based only on the provided contract.",
"cache_control": {"type": "ephemeral"} # Cache this system prompt too
}
],
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": f"Here is the contract to analyze:\n\n{document}",
"cache_control": {"type": "ephemeral"} # Cache breakpoint
},
{
"type": "text",
"text": f"Question: {question}"
# No cache_control — this is dynamic
}
]
}
]
)
# Check what the API actually cached
usage = response.usage
print(f"Cache write: {usage.cache_creation_input_tokens}")
print(f"Cache read: {usage.cache_read_input_tokens}")
print(f"Regular: {usage.input_tokens}")
return response.content[0].text
On the first call, cache_creation_input_tokens will equal your document size. On subsequent calls within 5 minutes, cache_read_input_tokens will be that size and input_tokens will only reflect the new question text.
Terminal Output — First Call vs. Second Call
# First call (cache miss — building cache)
$ python3 qa.py --doc contract.txt --q "What is the contract term?"
Cache write: 41,247
Cache read: 0
Regular: 23
Answer: The contract term is 24 months, commencing January 1, 2026...
# Second call (cache hit — 41K tokens free)
$ python3 qa.py --doc contract.txt --q "Who are the parties involved?"
Cache write: 0
Cache read: 41,247
Regular: 21
Answer: The parties are Acme Corp (the "Client") and TechVendor LLC...
The 41,247-token document is only charged at full price once. Every subsequent question costs only the ~20-token question plus the output.
The Gotcha That Cost Us Two Days
We implemented caching, deployed it, and saw... zero cache hits in production. The API was charging full price every time. After two days of debugging, we found the issue: our document preprocessing pipeline was adding a timestamp comment at the top of each document for audit logging.
# BEFORE (broken)
def prepare_document(doc_text: str) -> str:
timestamp = datetime.utcnow().isoformat()
return f"<!-- Processed: {timestamp} -->\n{doc_text}" # Changes every call!
# AFTER (fixed)
def prepare_document(doc_text: str) -> str:
return doc_text.strip() # Normalize only — no dynamic content in cached prefix
Rule: Everything in your cached prefix must be deterministically identical across calls. No timestamps, no session IDs, no random seeds, no dynamic interpolation. If even one character differs, the API treats it as a cache miss.
A second gotcha: the cache is per API key but not per user. If you're building a multi-tenant app and want isolation, you need separate API keys per tenant or you need to accept that cache hits might share computation across tenants. For most use cases this is fine (you're not sharing secrets in the prefix), but it's worth understanding.
Multi-Turn Conversations: Caching the Growing History
For chatbot-style applications, the optimal caching strategy is to put the cache breakpoint at the end of the conversation history, excluding only the latest user turn.
Only the new turn is re-processed. ```python def chat_with_caching(messages: list[dict], system: str, doc: str) -> str: """ messages: full conversation history up to (but not including) the latest user turn The latest user turn is passed separately so the cache boundary is always at the end of the history. """ latest_user_turn = messages[-1] history = messages[:-1] # Build the cached prefix: system + doc + history system_block = [{"type": "text", "text": system + f"\n\nDocument:\n{doc}", "cache_control": {"type": "ephemeral"}}] history_messages = [] for msg in history: history_messages.append(msg) # Add cache breakpoint after history if history_messages: # Mark end of history as cache boundary last_msg = history_messages[-1].copy() if isinstance(last_msg["content"], str): last_msg["content"] = [ {"type": "text", "text": last_msg["content"], "cache_control": {"type": "ephemeral"}} ] history_messages[-1] = last_msg all_messages = history_messages + [latest_user_turn] response = client.messages.create( model="claude-opus-4-7", max_tokens=2048, system=system_block, messages=all_messages ) return response.content[0].text ``` In a 20-turn conversation with a 40,000-token document, without caching you'd pay for 40,000 tokens × 20 turns = 800,000 input tokens. With caching, you pay for the 40,000-token write once plus ~200 tokens per turn for the new messages. That's a **98% reduction** in input token costs for long conversations. --- ## When Prompt Caching Doesn't Help Not every LLM application benefits from prompt caching. Here's an honest breakdown:

Companion Code
Working implementations for all patterns in this post are in the companion repo: github.com/amtocbot-droid/amtocbot-examples/tree/main/prompt-caching-2026
The repo includes:
- basic_caching.py — Single-document Q&A with cache hit/miss logging
- multi_turn_caching.py — Conversation history caching pattern
- cache_warming.py — Pre-warming strategy for high-traffic documents
- openai_auto_cache.py — OpenAI GPT-4o automatic prefix caching comparison
- cost_estimator.py — CLI tool to estimate savings for your use case
Conclusion
Prompt caching is not a premature optimization. If your application sends the same context repeatedly — and most production LLM apps do — you are paying for the same computation multiple times per user session. The Anthropic implementation takes about 90 minutes to add, reduces costs by 80-90% for eligible patterns, and reduces latency on cache-hit calls as a bonus.
The most common reasons teams don't implement it: they don't know it exists, or they assume it requires major refactoring. Neither is true. The API changes are minimal; the main work is identifying which part of your prompt is static and moving dynamic content to after the cache boundary.
Start by measuring your current cache hit rate (even if it's zero). Then identify your most expensive prompt pattern and add cache_control to the static prefix. Check the usage stats in the response to confirm the cache is working. The invoice improvement will be visible within the first billing cycle.
Sources
- Anthropic Prompt Caching Documentation — Official API reference with pricing and implementation details
- OpenAI Prompt Caching Guide — Automatic prefix caching for GPT-4o and o-series models
- Google Gemini Context Caching — Explicit TTL-based caching with storage billing model
- Anthropic API Pricing Page — Current token prices and cache write/read multipliers
About the Author
Toc Am
Founder of AmtocSoft. Writing practical deep-dives on AI engineering, cloud architecture, and developer tooling. Previously built backend systems at scale. Reviews every post published under this byline.
Published: 2026-04-19 · Written with AI assistance, reviewed by Toc Am.
☕ Buy Me a Coffee · 🔔 YouTube · 💼 LinkedIn · 🐦 X/Twitter
Comments
Post a Comment