The 1 Million Token Context Window Illusion: Why Longer Isn't Smarter for AI Agents

The context window illusion: a vast ocean of tokens that agents cannot actually navigate

Introduction

Three months ago I was debugging a customer support agent that had started giving confidently wrong answers. Not hallucinating — worse. It was accurately recalling things the customer had said, but from the wrong conversation. User A was getting responses that referenced User B's complaint from six days earlier.

We'd built the system on Gemini 2.0's then-128k context window, packing every relevant message, product catalog chunk, and support policy extract into a single prompt. The logic was sound: bigger context means better recall, right? We hadn't hit the token limit. The model wasn't forgetting anything. So what went wrong?

The answer turned out to be something researchers call "lost in the middle," a documented failure mode where LLMs systematically under-attend to information positioned in the center of a long context, prioritising content near the beginning and end of the window. We weren't running out of space. We were running out of attention.

That incident reframed how I think about context windows. Not as storage, but as working memory. And working memory has constraints that raw size doesn't capture.

With Gemini 2.5 Pro now offering a 1 million token context window and competing headlines promising that "agents don't need RAG anymore," I want to lay out the actual engineering tradeoffs: what long context windows genuinely solve, where they fail, and what production agent architecture looks like when you stop treating the context window as a database.


The Problem: Context Windows Aren't Memory

The conceptual confusion starts with naming. We call it a "context window". The word window implies a view into a larger space, a moving frame. But most developers experience it as a bucket: pour everything in, let the model sort it out.

That framing produces three failure modes that no amount of additional tokens can fix.

Failure mode 1: Attention dilution

In 2023, Liu et al. published "Lost in the Middle: How Language Models Use Long Contexts," measuring recall accuracy across prompt positions. The finding was stark: GPT-3.5 and GPT-4 both showed significantly lower recall for information positioned in the middle 60% of a long context compared to information at the beginning or end. The curve wasn't gradual; it was a valley. Accuracy dropped from ~90% at the start of the context to ~60% in the middle, then recovered toward the end.

Follow-up work from Anthropic and Google has refined this picture. Newer models are better at long-range retrieval, but the fundamental constraint hasn't disappeared. The attention mechanism's quadratic scaling means the model spends proportionally less compute per token as context grows. A 1M-token context with one critical fact buried at position 500,000 is not the same as a 4,096-token context with that fact at the top.

Claude 3.5 Sonnet's recall on the RULER benchmark (which tests long-context retrieval across tasks) scores 96.5% at 32k tokens and drops to around 87% at 128k. That 9.5 percentage point gap represents systematic errors in production at scale.

Failure mode 2: Cost and latency

At current pricing (April 2026), Gemini 2.5 Pro charges $1.25/million input tokens up to 200k, then $2.50/million above that. A single agent request stuffing 800k tokens costs $2.50 in input tokens alone, before output. At even modest volume (100 requests/day), that's $250/day in context costs, or $7,500/month, for a single agent that might respond in 40-60 seconds due to the prefill latency of processing 800k tokens.

Compare that to a hybrid RAG approach: a dense retrieval step costs ~$0.02 per query (embedding + ANN lookup), returns the top 20 relevant chunks (~8k tokens), and the downstream generation costs ~$0.03. The total cost per request is $0.05 vs. $2.50+. That's a 50x cost differential for the same logical operation.

Failure mode 3: No persistence

Every context window is ephemeral. When a session ends, the context is gone. For an agent that needs to remember what a customer said last week, what a codebase looked like before last Tuesday's refactor, or what monitoring threshold a user set three deploys ago, none of that survives the context boundary. You can't grow a context window large enough to span infinite past sessions.

This is the root cause of the bug I opened with. We'd built a stateless system and dressed it up as a persistent one. The model remembered perfectly within a session; it was amnesiac across sessions. That's not a context size problem. It's an architectural one.


Architecture comparison: naive long-context vs. tiered memory architecture for production AI agents

How It Works: Tiered Memory Architecture

Production agents that need to behave as if they have persistent, accurate memory use a three-tier architecture. Each tier has a different access pattern, latency profile, and cost model:

Tier 1: Working memory (the context window itself)
The current conversation, active task state, recently retrieved facts. This is what you put in the prompt. Size: 4k-32k tokens. Latency: 0ms (already loaded). Cost: prompt tokens.

Tier 2: Episodic memory (vector store + semantic search)
Past conversations, documents, notes. Retrieved on-demand by semantic similarity. Size: unlimited. Latency: 50-200ms. Cost: embedding + ANN query (~$0.02/call).

Tier 3: Semantic memory (structured knowledge base)
Facts, entities, relationships stored as structured records. Retrieved by exact match or structured query. Size: unlimited. Latency: 1-10ms. Cost: database query (~$0.001/call).

The key insight is that these tiers serve different query types. "What did the user say in the last message?" is a Tier 1 question. "What's the user's history with billing complaints?" is a Tier 2 question. "What's the user's account tier and contract renewal date?" is a Tier 3 question. Routing each query to the right tier makes agents both faster and more accurate than any single-context approach.

Here's how data flows through this architecture:

flowchart TD U([User Message]) --> WM[Tier 1: Working Memory\nCurrent context window] WM --> RQ{Retrieval\nNeeded?} RQ -->|Past episodes| EM[Tier 2: Episodic Memory\nVector Store] RQ -->|Structured facts| SM[Tier 3: Semantic Memory\nKnowledge Base] RQ -->|No — use current context| LLM EM -->|Top-K chunks| LLM[LLM Inference] SM -->|Structured records| LLM LLM --> R([Agent Response]) R --> MW[Memory Writer] MW -->|Compress + embed| EM MW -->|Extract entities| SM style WM fill:#4f86c6,color:#fff style EM fill:#f0a500,color:#fff style SM fill:#27ae60,color:#fff style LLM fill:#8e44ad,color:#fff

Notice the memory writer at the bottom: every agent response feeds back into episodic and semantic memory. The agent isn't just reading from memory; it's continuously writing to it. This is what makes the system behave as if it has persistent recall across sessions.


Implementation Guide

Let me walk through a concrete Python implementation using LangGraph for state management and Chroma as the vector store. The full working code is in the companion repo: github.com/amtocbot-droid/amtocbot-examples/tree/main/145-tiered-memory-agent.

Setting up the memory tiers

# memory_tiers.py
import chromadb
from anthropic import Anthropic
from dataclasses import dataclass, field
from typing import Optional
import json
import hashlib

client = Anthropic()
chroma = chromadb.PersistentClient(path="./agent_memory")

@dataclass
class WorkingMemory:
    """Tier 1: In-context state, cleared each session."""
    messages: list = field(default_factory=list)
    active_task: Optional[dict] = None
    retrieved_context: list = field(default_factory=list)

    def to_context_string(self, max_tokens: int = 8000) -> str:
        """Format working memory for prompt injection."""
        parts = []
        if self.active_task:
            parts.append(f"Current task: {json.dumps(self.active_task)}")
        if self.retrieved_context:
            parts.append("Retrieved context:")
            for chunk in self.retrieved_context[:5]:  # cap at 5 chunks
                parts.append(f"  - {chunk['content'][:500]}")
        return "\n".join(parts)


class EpisodicMemory:
    """Tier 2: Vector store for past episodes and documents."""

    def __init__(self, collection_name: str):
        self.collection = chroma.get_or_create_collection(
            name=collection_name,
            metadata={"hnsw:space": "cosine"}
        )

    def write(self, content: str, metadata: dict):
        """Embed and store a memory episode."""
        doc_id = hashlib.sha256(content.encode()).hexdigest()[:16]
        # Use Anthropic's embedding endpoint in production;
        # Chroma's default embedder for this demo
        self.collection.add(
            documents=[content],
            metadatas=[metadata],
            ids=[doc_id]
        )

    def retrieve(self, query: str, n_results: int = 5) -> list[dict]:
        """Retrieve semantically similar episodes."""
        results = self.collection.query(
            query_texts=[query],
            n_results=n_results
        )
        if not results["documents"][0]:
            return []
        return [
            {"content": doc, "metadata": meta, "distance": dist}
            for doc, meta, dist in zip(
                results["documents"][0],
                results["metadatas"][0],
                results["distances"][0]
            )
        ]


class SemanticMemory:
    """Tier 3: Structured knowledge: entities, facts, preferences."""

    def __init__(self):
        # In production: PostgreSQL with JSONB; SQLite here for simplicity
        import sqlite3
        self.conn = sqlite3.connect("./agent_memory/semantic.db")
        self.conn.execute("""
            CREATE TABLE IF NOT EXISTS entities (
                entity_id TEXT PRIMARY KEY,
                entity_type TEXT,
                attributes TEXT,
                updated_at TEXT
            )
        """)
        self.conn.commit()

    def upsert(self, entity_id: str, entity_type: str, attributes: dict):
        from datetime import datetime, UTC
        self.conn.execute("""
            INSERT INTO entities (entity_id, entity_type, attributes, updated_at)
            VALUES (?, ?, ?, ?)
            ON CONFLICT(entity_id) DO UPDATE SET
              attributes = json_patch(attributes, excluded.attributes),
              updated_at = excluded.updated_at
        """, (entity_id, entity_type, json.dumps(attributes),
              datetime.now(UTC).isoformat()))
        self.conn.commit()

    def get(self, entity_id: str) -> Optional[dict]:
        cursor = self.conn.execute(
            "SELECT attributes FROM entities WHERE entity_id = ?", (entity_id,)
        )
        row = cursor.fetchone()
        return json.loads(row[0]) if row else None

Routing queries to the right tier

The routing logic is the critical piece. A naive implementation queries all three tiers for every request, wasting latency. A smarter implementation classifies the query before retrieval:

# memory_router.py
from anthropic import Anthropic
import json

client = Anthropic()

ROUTER_PROMPT = """You are a memory routing classifier. Given a user query, decide which memory tiers to query.

Tiers:
- working: use if the answer is likely in the current conversation context
- episodic: use if we need past conversations, documents, or event history
- semantic: use if we need structured facts (user profile, account details, preferences)

Respond with JSON only. Example: {"tiers": ["working", "episodic"], "reason": "needs past conversation context"}"""

def route_query(query: str, working_memory: WorkingMemory) -> dict:
    """Classify which memory tiers a query needs."""
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=100,
        system=ROUTER_PROMPT,
        messages=[{"role": "user", "content": f"Query: {query}\n\nCurrent context summary: {working_memory.to_context_string()[:500]}"}]
    )
    try:
        return json.loads(response.content[0].text)
    except json.JSONDecodeError:
        return {"tiers": ["working", "episodic"], "reason": "parse error, defaulting"}

The routing call costs ~50 tokens on Haiku ($0.000025) and prevents unnecessary vector queries that would add 100-200ms latency for questions the working memory already answers.

LangGraph state integration

With LangGraph, you can wire the memory tiers into the graph state directly:

# agent_graph.py
from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated
import operator

class AgentState(TypedDict):
    messages: Annotated[list, operator.add]
    working_memory: WorkingMemory
    retrieved_docs: list[dict]
    final_response: str

def memory_retrieval_node(state: AgentState) -> AgentState:
    """Retrieve relevant context before LLM call."""
    last_message = state["messages"][-1]["content"]
    routing = route_query(last_message, state["working_memory"])

    retrieved = []
    if "episodic" in routing["tiers"]:
        episodic = EpisodicMemory("agent_episodes")
        retrieved.extend(episodic.retrieve(last_message, n_results=3))

    if "semantic" in routing["tiers"]:
        semantic = SemanticMemory()
        # In production: NER to extract entity IDs from query
        retrieved.extend([{"content": str(r), "source": "semantic"}
                         for r in [semantic.get("user_profile")] if r])

    return {**state, "retrieved_docs": retrieved}

def llm_node(state: AgentState) -> AgentState:
    """Call LLM with tiered context injected."""
    context = "\n\n".join([
        state["working_memory"].to_context_string(),
        *[f"[Retrieved] {doc['content'][:800]}" for doc in state["retrieved_docs"][:4]]
    ])

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=f"You are a helpful assistant with access to memory context.\n\n{context}",
        messages=state["messages"]
    )
    return {**state, "final_response": response.content[0].text}

# Build the graph
graph = StateGraph(AgentState)
graph.add_node("retrieve", memory_retrieval_node)
graph.add_node("generate", llm_node)
graph.set_entry_point("retrieve")
graph.add_edge("retrieve", "generate")
graph.add_edge("generate", END)
agent = graph.compile()

flowchart LR Q([User Query]) --> R[Route Query\nHaiku classifier\n~50 tokens, $0.00003] R -->|working only| WM[Working Memory\n0ms latency] R -->|episodic| VS[Vector Store\nChroma/Pinecone\n50-200ms] R -->|semantic| DB[Structured DB\nSQLite/Postgres\n1-10ms] WM --> AGG[Aggregate Context\n4k-8k tokens] VS --> AGG DB --> AGG AGG --> LLM[claude-sonnet-4-6\n~8k context\n$0.03-0.05/call] LLM --> OUT([Response]) OUT --> MW[Memory Writer\nasync, non-blocking] MW -.-> VS MW -.-> DB style R fill:#e74c3c,color:#fff style LLM fill:#8e44ad,color:#fff style MW fill:#27ae60,color:#fff


The Debugging Story You Should Learn From

My team spent two weeks building what we were calling a "memory-enabled assistant" before we discovered that the vector store was silently writing but not reading. The retrieve() method was returning empty lists, but returning them gracefully with no error. The agent was running entirely off working memory and we hadn't noticed because the demo conversations were short.

The tell was a production log line I almost ignored:

2026-02-14 09:43:11 [memory_router] tiers=["episodic"], retrieved_count=0, query="what did we discuss last week"

retrieved_count=0 on a query that explicitly asks about last week. We'd run 200 conversations before this. The vector store had to have data.

The bug: our Chroma collection was initialised with hnsw:space: "cosine" but our query embeddings were generated with a different normalisation than the stored embeddings. We'd switched embedding models mid-development and not re-indexed. Every query returned cosine distance > 0.95 (near-random), and our retrieval threshold of 0.7 silently filtered everything out.

The fix was adding one log line and one metric: retrieved_count per query. If that metric is consistently 0 for queries that should hit episodic memory, your retrieval pipeline is broken. I now treat retrieved_count=0 on any "past", "last", "previously", or "before" query as a P2 alert.

import logging

logger = logging.getLogger(__name__)

def retrieve(self, query: str, n_results: int = 5) -> list[dict]:
    results = self.collection.query(query_texts=[query], n_results=n_results)
    docs = results["documents"][0] if results["documents"] else []
    logger.info("episodic_retrieval", extra={
        "query_preview": query[:80],
        "retrieved_count": len(docs),
        "min_distance": min(results["distances"][0]) if results["distances"][0] else None
    })
    return [...]  # as before

Comparison: When to Use What

Context strategy comparison: long context, RAG, and tiered memory side by side

quadrantChart title Context Strategy Selection x-axis Low Session Count --> High Session Count y-axis Short Document Corpus --> Large Document Corpus quadrant-1 Tiered Memory Required quadrant-2 Long Context + Episodic quadrant-3 Long Context Sufficient quadrant-4 RAG + Semantic Memory Single-session doc QA: [0.1, 0.65] Code review assistant: [0.35, 0.4] Customer support bot: [0.75, 0.55] Legal document analysis: [0.2, 0.85] Personal AI assistant: [0.85, 0.7] In-context few-shot: [0.15, 0.2]

The selection framework:

Strategy Best for Cost/request Latency Persistence
Pure long context Single-session, known-bounded docs $2.50+ (800k tokens) 40-60s None
RAG only Large static corpora, single-turn Q&A $0.05 300-500ms None
Tiered memory Multi-session agents, user history $0.05-0.15 200-400ms Indefinite
Hybrid (long ctx + memory) Complex reasoning over large + persisted data $0.50-1.00 20-40s Session

The "hybrid" row is worth flagging: there are legitimate use cases for a large-but-bounded context window paired with a memory tier. Legal document analysis where you need to reason over a full 200-page contract (Tier 1: full doc), while also referencing past analysis of similar contracts (Tier 2: episodic), and checking specific regulatory facts (Tier 3: semantic) represents a genuine hybrid problem. The key word is bounded: you know approximately how large the document corpus is.

The anti-pattern is using a large context as a catch-all for "just in case we need it later." That's where you get the $2.50/request bills and the lost-in-the-middle recall errors.


Production Considerations

Token budget enforcement

Add a hard cap on context size in your agent loop. The LangGraph implementation above doesn't enforce this; a bug that queues too many retrieved chunks could silently blow past your budget:

MAX_CONTEXT_TOKENS = 12_000  # conservative limit

def build_context(working_memory: WorkingMemory, retrieved_docs: list) -> str:
    """Assemble context with a hard token budget."""
    parts = [working_memory.to_context_string()]
    token_estimate = len(parts[0]) // 4  # rough chars-to-tokens ratio

    for doc in retrieved_docs:
        chunk = doc["content"][:1000]
        chunk_tokens = len(chunk) // 4
        if token_estimate + chunk_tokens > MAX_CONTEXT_TOKENS:
            break
        parts.append(f"[Memory] {chunk}")
        token_estimate += chunk_tokens

    return "\n\n".join(parts)

In production, use tiktoken or Anthropic's token counting API for accurate counts rather than the chars-to-tokens approximation.

Memory consolidation

Vector stores grow indefinitely without pruning. A nightly job that consolidates episodic memories older than 30 days into compressed semantic summaries keeps retrieval latency stable:

# Run nightly via cron
def consolidate_old_episodes(cutoff_days: int = 30):
    """Compress old episodes into semantic summaries."""
    old_episodes = episodic.retrieve_older_than(cutoff_days)
    if not old_episodes:
        return

    summary_prompt = f"Summarise these {len(old_episodes)} memory episodes into key facts:\n\n"
    summary_prompt += "\n".join(ep["content"][:300] for ep in old_episodes)

    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=500,
        messages=[{"role": "user", "content": summary_prompt}]
    )
    semantic.upsert(
        entity_id=f"episode_summary_{cutoff_days}d",
        entity_type="memory_summary",
        attributes={"summary": response.content[0].text, "episode_count": len(old_episodes)}
    )
    episodic.delete_older_than(cutoff_days)

Benchmarking your retrieval quality

Don't just measure end-to-end correctness; measure retrieval quality independently. At least monthly, run this check:

python3 scripts/eval_memory_retrieval.py \
  --test-set data/memory_eval_set.jsonl \
  --collection agent_episodes \
  --metrics recall@5 mrr ndcg@10

A retrieval recall@5 below 0.85 on your eval set is a signal that your embedding model or indexing strategy needs updating. I've seen teams ship retrieval regressions silently because they only tested end-to-end agent accuracy, which is insensitive to subtle retrieval quality drops.


Conclusion

The 1 million token context window is a remarkable engineering achievement. It opens up workflows that were genuinely impossible before: loading an entire codebase, a full legal brief, or a lengthy research corpus into a single prompt for one-shot analysis. For those bounded, single-session use cases, it's the right tool.

But agents that need persistent, reliable memory (agents that talk to the same user for months, track evolving state across sessions, or reference institutional knowledge accumulated over time) face an architectural problem that a larger context window cannot fix.

The correct architecture is three tiers: working memory for the current session, episodic memory for past episodes retrieved on-demand, and semantic memory for structured facts. Each tier has a different cost, latency, and persistence profile. Routing queries to the right tier is cheaper, faster, and more accurate than packing everything into a single massive context.

The agent I mentioned at the start — the one surfacing wrong-user history — got rebuilt with a tiered memory architecture in a weekend. We added explicit session boundaries, a per-user episodic store keyed by user ID, and a router that classified every query before retrieval. The cross-user contamination disappeared. Retrieval latency averaged 180ms. Cost per request dropped from $0.85 to $0.07.

The million-token context window didn't solve our problem. Understanding what the context window is for did.


Sources

  1. Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172.

  2. Anthropic. (2025). Claude 3.5 Sonnet model card, RULER benchmark results. Anthropic Technical Report. anthropic.com/claude

  3. Google DeepMind. (2025). Gemini 2.5 Pro technical report, 1M context evaluation. deepmind.google/technologies/gemini

  4. Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020.

  5. Park, J. S., et al. (2023). Generative Agents: Interactive Simulacra of Human Behavior. CHI 2023. (First-published tiered memory design for LLM agents.)

About the Author

Toc Am

Founder of AmtocSoft. Writing practical deep-dives on AI engineering, cloud architecture, and developer tooling. Previously built backend systems at scale. Reviews every post published under this byline.

LinkedIn X / Twitter

Published: 2026-04-24 · Written with AI assistance, reviewed by Toc Am.

Buy Me a Coffee · 🔔 YouTube · 💼 LinkedIn · 🐦 X/Twitter

Comments

Popular posts from this blog

29 Million Secrets Leaked: The Hardcoded Credentials Crisis

What is an LLM? A Beginner's Guide to Large Language Models

What Is Voice AI? TTS, STT, and Voice Agents Explained