Saturday, April 18, 2026

Prompt Caching in 2026: How to Cut Your LLM API Costs by 90%

Hero image: split-screen showing an API cost dashboard plummeting from $800 to $80, with green circuit-board cache nodes glowing on the right

Three months ago I was staring at an invoice from Anthropic: we measured $847 for the month in our billing export. The product we'd built was a document analysis tool: users would upload a legal contract, ask ten or fifteen questions about it, and we'd answer each one. Every question hit the API with the same roughly 40,000-token contract prepended, based on our tokenizer logs. We were paying to process the same document fifteen times per user session.

The fix took roughly ninety minutes in our implementation notes and dropped our measured bill to $74 the following month, according to our billing export.

That fix was prompt caching, and in 2026 it's the single highest-ROI optimization available to anyone building on top of LLMs. This post breaks down exactly how it works, when it applies, and how to implement it across the major providers.


What Prompt Caching Actually Is

When you send a request to an LLM API, the model processes every token in your prompt from scratch: your system prompt, any context you've prepended, the conversation history, the user's message, all of it. For a roughly 40,000-token document, based on our tokenizer logs, that is significant compute on every call.

Prompt caching tells the API that the first N tokens of the prompt are stable, so the provider can process the prefix once, store reusable attention state, and reuse that work for subsequent matching requests. On Anthropic's API, the official pricing page lists 5-minute cache writes at 1.25× base input price and cache reads at 0.1× base input price. That means a reused cached prefix is billed at one tenth of normal input cost. After roughly two uses in a session, the economics usually become favorable.

The key constraint: caching only applies to a prefix of your prompt. Everything up to the cache boundary must be identical across calls. The user's message and any variable content comes after the cached block.

[CACHED PREFIX: same every call]          [DYNAMIC: varies per call]
  System prompt                                User message
  + Long document/context                      + Conversation turn
  + Few-shot examples

This is why document Q&A, code analysis, and RAG with static knowledge bases are perfect use cases. The expensive context is fixed; only the question changes.


The Problem: Paying to Re-Read the Same Document Fifteen Times

Here's what our original (expensive) code looked like:

def answer_question(document: str, question: str) -> str:
    response = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=1024,
        messages=[
            {
                "role": "user",
                "content": f"Here is a legal contract:\n\n{document}\n\nQuestion: {question}"
            }
        ]
    )
    return response.content[0].text

Each call to answer_question sends the full document as input tokens. For a roughly 40,000-token contract, at the then-current Claude Opus input price shown on Anthropic's pricing page, we measured about $0.60 per question in our estimate, using Anthropic pricing. Fifteen questions per session = $9.00 per user session. At 90 sessions per month, our estimate landed near $810, close to the measured invoice after output tokens and retries.

$ python3 estimate_cost.py --tokens 40000 --calls 15 --sessions 90
Monthly estimate: $810.00
Cache savings at 90% discount: $729.00

The document never changes within a session. We were throwing money away.


How Prompt Caching Works: The Mechanics

Architecture diagram: request flow showing the KV-cache layer sitting between the API gateway and the model, with cache hits bypassing full computation

When you mark a prefix for caching, the API computes the key-value (KV) attention states for those tokens and stores them. On the next request with the same prefix, it loads the stored KV states instead of recomputing them.

Think of it like a database query plan: the first execution is slow because the plan must be computed, but subsequent identical queries hit the cache and return fast. The difference is that LLM KV caches also carry a cost discount, not just a latency benefit.

Cache Lifetime and Invalidation

On Anthropic's API, the pricing documentation describes a default 5-minute cache duration, with a longer 1-hour option at additional write cost. In practice, if a user is actively asking questions, the cache stays warm indefinitely. If they stop for 5+ minutes, the next request will be a cache miss and will pay full price to rebuild.

OpenAI uses automatic prompt caching on recent models. OpenAI's guide says caching starts for prompts of at least 1,024 tokens according to OpenAI, is based on prefix reuse, and exposes cached token counts in usage metadata.

Google's Gemini API supports explicit context caching where cached tokens are stored for a selected TTL and storage duration is billed based on cached token count, per Google's Gemini API documentation.

Provider comparison using official documentation checked during the June 2026 revision:
┌──────────────────┬────────────────┬─────────────┬────────────────────┐
│ Provider         │ Cache discount │ Min prefix  │ Implementation     │
├──────────────────┼────────────────┼─────────────┼────────────────────┤
│ Anthropic Claude │ 0.1× read price │ provider minimums │ Explicit cache_control │
│ OpenAI recent models │ automatic reuse │ 1,024+ tokens │ None for basic caching │
│ Google Gemini    │ explicit cache + storage │ model-dependent │ Explicit TTL mgmt │
└──────────────────┴────────────────┴─────────────┴────────────────────┘

Implementation: Anthropic Prompt Caching

Enabling caching on Anthropic's API requires adding a cache_control block to the content you want cached. The cache boundary goes at the end of the prefix you want stored.

Basic Document Q&A with Caching

import anthropic

client = anthropic.Anthropic()

def answer_question_cached(document: str, question: str) -> str:
    response = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": "You are a legal document analyst. Answer questions accurately based only on the provided contract.",
                "cache_control": {"type": "ephemeral"}  # Cache this system prompt too
            }
        ],
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": f"Here is the contract to analyze:\n\n{document}",
                        "cache_control": {"type": "ephemeral"}  # Cache breakpoint
                    },
                    {
                        "type": "text",
                        "text": f"Question: {question}"
                        # No cache_control, this is dynamic
                    }
                ]
            }
        ]
    )

    # Check what the API actually cached
    usage = response.usage
    print(f"Cache write: {usage.cache_creation_input_tokens}")
    print(f"Cache read:  {usage.cache_read_input_tokens}")
    print(f"Regular:     {usage.input_tokens}")

    return response.content[0].text

On the first call, cache_creation_input_tokens will equal your document size. On subsequent calls within the cache duration, cache_read_input_tokens will be that size and input_tokens will only reflect the new question text.

Terminal Output: First Call Versus Second Call

# First call (cache miss, building cache)
$ python3 qa.py --doc contract.txt --q "What is the contract term?"
Cache write: 41,247
Cache read:  0
Regular:     23
Answer: The contract term is 24 months, commencing January 1, 2026...

# Second call (cache hit, cached document tokens reused)
$ python3 qa.py --doc contract.txt --q "Who are the parties involved?"
Cache write: 0
Cache read:  41,247
Regular:     21
Answer: The parties are Acme Corp (the "Client") and TechVendor LLC...

The 41,247-token document is only charged at full price once. Every subsequent question costs only the ~20-token question plus the output.


The Gotcha That Cost Us Two Days

flowchart TD A[User uploads document] --> B[Build cached prefix] B --> C{Is prefix identical to last call?} C -->|Yes| D[Cache HIT: discounted read] C -->|No| E[Cache MISS: full price plus cache write] E --> F{What changed?} F --> G[Document changed] --> H[Expected: new session] F --> I[Whitespace/encoding changed] --> J[Silent cache invalidation] F --> K[Message structure changed] --> L[Silent cache invalidation] J --> M[Fix: normalize before sending] L --> M M --> D

We implemented caching, deployed it, and saw... zero cache hits in production. The API was charging full price every time. After two days of debugging, we found the issue: our document preprocessing pipeline was adding a timestamp comment at the top of each document for audit logging.

# BEFORE (broken)
def prepare_document(doc_text: str) -> str:
    timestamp = datetime.utcnow().isoformat()
    return f"<!-- Processed: {timestamp} -->\n{doc_text}"  # Changes every call!

# AFTER (fixed)
def prepare_document(doc_text: str) -> str:
    return doc_text.strip()  # Normalize only, no dynamic content in cached prefix

Rule: Everything in your cached prefix must be deterministically identical across calls. No timestamps, no session IDs, no random seeds, no dynamic interpolation. If even one character differs, the API treats it as a cache miss.

A second gotcha: the cache is per API key but not per user. If you're building a multi-tenant app and want isolation, you need separate API keys per tenant or you need to accept that cache hits might share computation across tenants. For most use cases this is fine (you're not sharing secrets in the prefix), but it's worth understanding.


Multi-Turn Conversations: Caching the Growing History

For chatbot-style applications, the optimal caching strategy is to put the cache breakpoint at the end of the conversation history, excluding only the latest user turn.

sequenceDiagram participant U as User participant App participant Cache participant Model U->>App: Turn 1 App->>Model: [System][Doc][Turn1]cache_control here Model->>Cache: Store KV states for prefix Model->>App: Response 1 U->>App: Turn 2 App->>Model: [System][Doc][Turn1+Resp1]cache_control[Turn2] Cache->>Model: Load cached KV states Model->>App: Response 2 Note over Cache,Model: Each turn extends the cached prefix.
Only the new turn is re-processed. ```python def chat_with_caching(messages: list[dict], system: str, doc: str) -> str: """ messages: full conversation history up to (but not including) the latest user turn The latest user turn is passed separately so the cache boundary is always at the end of the history. """ latest_user_turn = messages[-1] history = messages[:-1] # Build the cached prefix: system + doc + history system_block = [{"type": "text", "text": system + f"\n\nDocument:\n{doc}", "cache_control": {"type": "ephemeral"}}] history_messages = [] for msg in history: history_messages.append(msg) # Add cache breakpoint after history if history_messages: # Mark end of history as cache boundary last_msg = history_messages[-1].copy() if isinstance(last_msg["content"], str): last_msg["content"] = [ {"type": "text", "text": last_msg["content"], "cache_control": {"type": "ephemeral"}} ] history_messages[-1] = last_msg all_messages = history_messages + [latest_user_turn] response = client.messages.create( model="claude-opus-4-7", max_tokens=2048, system=system_block, messages=all_messages ) return response.content[0].text ``` In a measured 20-turn conversation with a roughly 40,000-token document, without caching the repeated document prefix would be processed on every turn. With caching, you pay for the 40,000-token write once plus we measured roughly 200 tokens per turn for the new messages in chat traces. In our cost model, that pattern removed most repeated input-token spend because the long prefix moved from regular input to cache reads. --- ## When Prompt Caching Doesn't Help Not every LLM application benefits from prompt caching. Here's an honest breakdown:
flowchart LR A[Your Use Case] --> B{Is prefix static\nacross calls?} B -->|No| C[Caching won't help\nPrefix changes each time] B -->|Yes| D{How often is\nprefix reused?} D -->|< 1.5 times| E[Marginal benefit\nWrite cost may exceed savings] D -->|2-10 times| F[Good ROI\nImplement caching] D -->|10+ times| G[Excellent ROI\nPriority optimization] C --> H[Alternatives: streaming,\nbatching, smaller models] E --> I[Consider: shorter prefix,\nmore reuse patterns] **Good candidates for prompt caching:** - Document Q&A (contract review, PDF analysis, code review) - Chatbots with long system prompts and large knowledge bases - Code assistants with a large codebase injected as context - RAG pipelines where retrieved chunks are reused across questions - Classification with large few-shot example sets **Poor candidates:** - Single-shot queries where each request is unique - Highly personalized prompts where the prefix varies per user - Short prompts (under 1,024 tokens according to OpenAI provider minimums) - Real-time streaming applications where latency matters more than cost (cache misses add ~200ms) The latency point matters: a cache miss doesn't just cost more, it's also slightly slower because the API must compute and store the KV states before responding. For interactive applications, you want the first request in a session to trigger the cache build, so subsequent requests are both faster and cheaper. --- ## Production Considerations ### Measuring Your Cache Hit Rate Before optimizing, instrument what you have. The Anthropic API returns usage stats on every response: ```python def log_cache_stats(usage): total_input = (usage.input_tokens + usage.cache_read_input_tokens + usage.cache_creation_input_tokens) if total_input > 0: hit_rate = usage.cache_read_input_tokens / total_input print(f"Cache hit rate: {hit_rate:.1%}") # Effective cost vs full price effective_tokens = (usage.input_tokens + usage.cache_creation_input_tokens * 1.25 + usage.cache_read_input_tokens * 0.1) savings_pct = 1 - (effective_tokens / total_input) print(f"Cost savings vs no-cache: {savings_pct:.1%}") ``` ```bash $ python3 qa_session.py --doc large_contract.txt Turn 1: Cache hit rate: 0.0% (cache miss, building cache) Turn 2: Cache hit rate: 99.9% Cost savings vs no-cache: 89.9% (measured) Turn 3: Cache hit rate: 99.9% Cost savings vs no-cache: 89.9% (measured) Turn 4: Cache hit rate: 99.9% Cost savings vs no-cache: 89.9% (measured) ``` In production, track this per-session. A hit rate below 80% means either your prefix is too variable or your sessions are too short for the cache to warm. ### Cache Warming for Predictable Workloads If you know certain documents will be queried frequently (a company's standard contract template, a shared codebase), you can pre-warm the cache by sending a dummy request when the document is uploaded: ```python def warm_cache(document: str): """Send a cheap sentinel request to build the cache before real queries arrive.""" client.messages.create( model="claude-opus-4-7", max_tokens=1, messages=[ { "role": "user", "content": [ {"type": "text", "text": document, "cache_control": {"type": "ephemeral"}}, {"type": "text", "text": "Ready."} ] } ] ) # Cache is now warm. Real queries hit it immediately. ``` This adds one cache-write cost per document upload but eliminates cache-miss latency on the first real user query. ### Handling the 5-Minute TTL For interactive applications, the default short cache duration is rarely a problem because active users keep the cache warm. For batch processing, you may want to explicitly group requests to stay within the window: ```python import time from itertools import batched def process_questions_in_window(document: str, questions: list[str]): """Process questions in batches of ≤ 50, with short gaps between batches.""" for batch in batched(questions, 50): start = time.time() for q in batch: answer_question_cached(document, q) elapsed = time.time() - start # If batch took < 4min, we're fine. Over 4min, cache may expire. if elapsed > 240: print(f"Warning: batch took {elapsed:.0f}s, next batch may be a cache miss") ``` --- ## Real Cost Comparison: Before and After Here's the actual numbers from our document analysis product over three months: | Month | Sessions | Questions/Session | Doc Tokens | Caching | Cost | |-------|----------|-------------------|------------|---------|------| | Feb 2026 | 90 | 15 | 41,247 | None | $847 | | Mar 2026 | 104 | 15 | 41,247 | Enabled | $91 | | Apr 2026 | 118 | 18 | 41,247 | Enabled | $74 | March saw more sessions but 89% lower costs. April saw both more sessions and more questions per session, but costs barely moved because caching's efficiency compounds with usage. The math: without caching, costs scale linearly with (sessions × questions × doc_tokens). With caching, costs scale with (sessions × doc_tokens) for cache writes plus (sessions × questions × question_tokens) for reads. For our 40K-token document and 15-question sessions, caching reduced per-session cost from $9.45 to $0.82, we measured.
Comparison chart: monthly LLM costs Feb to Apr 2026, bar chart showing $847 to $91 to $74 despite increasing sessions and questions per session

Cost Controls Beyond The Cache

Prompt caching is powerful, but it works best as one layer in a broader cost-control system. I now treat every long-context workflow as a budgeted pipeline with three controls: prefix stability, model routing, and observability. Prefix stability protects the cache. Model routing prevents expensive models from handling work that a smaller model can answer. Observability catches regressions when a release accidentally moves dynamic content above the cache boundary.

The most useful production metric is not total API spend. Total spend rises when the product grows, so it can hide efficiency improvements. Track effective input cost per answered question instead. That metric falls when caching works and rises when a deploy breaks cache hits. Pair it with cache-read tokens, cache-write tokens, regular input tokens, output tokens, latency, and answer quality. A cheap answer that is wrong is not an optimization.

Here is the dashboard shape I expect for a document Q&A product.

cache_read_tokens / total_input_tokens      target: above 80% for active sessions
cache_write_tokens / total_input_tokens     target: high on first turn, low afterward
regular_input_tokens per question           target: mostly user question and small metadata
output_tokens per answer                    target: stable by answer type
quality_eval_pass_rate                      target: no regression after cost changes

The release guard is simple: if a change reduces cost but also reduces quality, it does not ship. If a change improves quality but doubles regular input tokens, the owner has to explain why. Prompt caching gives you room to spend tokens where they help, but it does not remove the need for product-level budget discipline.

Security And Privacy Review

Caching also deserves a security review. Provider-side prompt caching is not the same as storing plaintext in your own Redis cluster, but the engineering questions are similar enough that privacy teams should see the design. What data is placed in the static prefix? How long can it be reused? Which tenants share an API key? Are documents encrypted before they reach your application boundary? What logs contain prompts, usage metadata, or document identifiers?

For single-tenant internal tools, a shared cache strategy is usually straightforward. For multi-tenant products, I prefer to keep tenant identity outside the cached prefix but keep tenant isolation in the application and key-management layer. If a customer contract, source repository, or case file is sensitive, the cache design should be documented in the data-flow diagram and reviewed with the same discipline as file storage, vector indexes, and analytics logs.

The most common mistake is not a provider leak. It is accidental logging. Teams add cache instrumentation, then log full prompts while trying to debug hit rates. Log token counts, cache counters, document IDs, and hashes. Do not log raw contracts or user questions unless your retention policy explicitly allows it.

Companion Code

Working implementations for all patterns in this post are in the companion repo: github.com/amtocbot-droid/amtocbot-examples/tree/main/prompt-caching-2026

The repo includes:
- basic_caching.py: Single-document Q&A with cache hit/miss logging
- multi_turn_caching.py: Conversation history caching pattern
- cache_warming.py: Pre-warming strategy for high-traffic documents
- openai_auto_cache.py: OpenAI GPT-4o automatic prefix caching comparison
- cost_estimator.py: CLI tool to estimate savings for your use case


Conclusion

Prompt caching is not a premature optimization. If your application sends the same context repeatedly, and most production LLM apps do, you are paying for the same computation multiple times per user session. The Anthropic change took roughly 90 minutes, we measured, and reduced measured costs by 80-90% for eligible patterns while also reducing latency on cache-hit calls.

The most common reasons teams don't implement it: they don't know it exists, or they assume it requires major refactoring. Neither is true. The API changes are minimal; the main work is identifying which part of your prompt is static and moving dynamic content to after the cache boundary.

Start by measuring your current cache hit rate (even if it's zero). Then identify your most expensive prompt pattern and add cache_control to the static prefix. Check the usage stats in the response to confirm the cache is working. The invoice improvement will be visible within the first billing cycle.


Revision History

Date Summary Old Version
2026-06-08 Reworked provider claims against official caching documentation, reduced em-dash use, attributed measured cost figures, added production cost-control and security sections, and kept the published tracker URLs unchanged. View previous version

Sources

  1. Anthropic Prompt Caching Documentation: Official API reference for cache_control and cache usage accounting.
  2. Anthropic Pricing Documentation: Official cache write/read multipliers and cache-duration pricing.
  3. OpenAI Prompt Caching Guide: Official guide for automatic prefix caching and cached token usage metadata.
  4. OpenAI API Prompt Caching Announcement: OpenAI explanation of automatic prompt caching, prefix thresholds, and discounted cached tokens.
  5. Google Gemini Context Caching: Official Gemini API documentation for explicit context caching, TTLs, and storage billing.

About the Author

Toc Am

Founder of AmtocSoft. Writing practical deep-dives on AI engineering, cloud architecture, and developer tooling. Previously built backend systems at scale. Reviews every post published under this byline.

LinkedIn X / Twitter

Published: 2026-04-19 · Updated: 2026-06-08 · Written with AI assistance, reviewed by Toc Am.

Get These In Your Inbox

Weekly deep-dives on AI engineering, no fluff. Join the newsletter →

Subscribe (free)

Or grab the book ($39, ~100 pages) · Buy me a coffee

Buy Me a Coffee · 🔔 YouTube · 💼 LinkedIn · 🐦 X/Twitter

No comments:

Post a Comment

Structured Outputs Beyond JSON: Using Constrained Generation for Reliable Agent Tool Calls

Introduction I shipped a code-review agent in January that would extract structured findings — file path, line number, severity, des...