Context Engineering: The Skill That Replaced Prompt Engineering

Prompt engineering taught you what to ask — context engineering teaches you what the model needs to know before it answers, and that difference is what separates toy demos from production AI systems.
For the past few years, the AI community obsessed over prompt engineering. Craft the perfect instruction. Use the magic words. Add "think step by step." Chain your prompts. There were courses, certifications, and job titles built around the art of asking AI the right question. And it worked — to a point.
But production AI systems kept failing in ways that better prompts couldn't fix. Chatbots would forget important details mid-conversation. RAG pipelines would retrieve the wrong chunks and confidently hallucinate. Agents would lose track of their task state. The model wasn't stupid; it was blind. It was answering the question you asked while missing the information it needed.
That gap between what the model knows and what it needs to know has a name now: context engineering. And in 2026, it has become the core competency separating developers who build AI systems that actually work from those who are still fighting with their system prompts.
> This post is part of the [AI Agent Engineering: Complete 2026 Guide](/2026/04/ai-agent-engineering-complete-2026-guide.html). Context engineering is one of the core layers of a production agent stack — see the guide for how it fits into the full picture.
The Problem: Why 70% of LLM Failures Aren't the Model's Fault
According to a 2026 analysis by The New Stack, over 70% of production LLM application failures trace back to context problems — not model limitations. The model was capable of answering correctly. It simply didn't have what it needed in its context window to do so.
This finding reshapes how we should think about debugging AI systems. When your chatbot gives a bad answer, the instinct is to improve the prompt. Rewrite the instruction. Add more examples. Make it more explicit. But if the real problem is that the model is missing relevant background information, has too much irrelevant noise competing for attention, or is working from stale data — no amount of prompt refinement will help.
Consider a customer support bot. A user asks: "Why was my last order delayed?" The model has a beautiful system prompt explaining its role, its tone, its capabilities. But the user's order history isn't in the context. The shipping status isn't there. The warehouse disruption from last Tuesday isn't there. The model hallucinates a generic answer about carrier delays. The prompt was fine. The context was empty.
There are four recurring failure modes that context engineering addresses:
Context overflow — The context window fills up with accumulated conversation history, old tool outputs, or verbose retrieved documents. Important information gets pushed out. Attention dilutes. The model starts "forgetting" things that were mentioned earlier.
Stale context — The model is working from information that was accurate when injected but is no longer current. Cached user profiles, outdated product descriptions, old document versions. The model answers confidently based on facts that have changed.
Irrelevant context flooding — RAG retrieval returns chunks that are topically adjacent but not actually useful for the current question. The model's attention gets pulled toward noise. Answer quality degrades not because information is missing but because too much irrelevant information is present.
Missing context — The simplest failure. The model simply doesn't have information it needs. No retrieval was triggered, the conversation history was truncated, or the relevant data was never surfaced into the window in the first place.
All four of these are engineering problems. They require systems thinking, not prompt tweaking.

Technical Breakdown: What Context Engineering Actually Is
Context engineering is the practice of strategically managing what goes into an LLM's context window — what information is included, how it's structured, in what order, at what level of compression, and when it's refreshed.
The context window is not a dump. It's a carefully curated working memory. Every token you put in the window costs attention and inference compute. Every irrelevant token competes with relevant ones. The model's ability to reason over a context degrades as that context becomes noisier, longer, or more redundant.
Context engineering operates across five primary dimensions:
1. Retrieval Strategy
Not all retrieval is equal. Naive RAG concatenates the top-K chunks by cosine similarity and calls it done. Engineered context retrieval asks: is similarity the right signal here? Maybe recency matters more. Maybe the user's stated intent should weight the retrieval differently than the literal query terms. Maybe you need to retrieve context from multiple knowledge bases and interleave them intelligently.
Hybrid retrieval (dense + sparse), re-ranking with a cross-encoder, query expansion, and HyDE (Hypothetical Document Embeddings) are all tools in the context engineer's toolkit. Each controls what information makes it into the window.
2. Context Compression
More tokens in the window does not mean more useful information. Long documents, verbose conversation history, and redundant retrieved chunks all need to be compressed before injection. Summarization, key-point extraction, and chunk deduplication reduce token consumption while preserving semantic density.
The goal is maximum information per token, not maximum tokens.
3. Ordering and Recency Weighting
Where you put information in the context window matters. LLMs exhibit a "lost in the middle" phenomenon — information at the beginning and end of long contexts is better attended to than information buried in the middle. Critical context should be placed close to the query. Background information can go earlier. Ordering is not cosmetic.
Recency weighting applies to conversation history specifically. The last few turns are almost always more relevant than what was said ten turns ago. Selectively compressing old history while preserving recent turns maintains coherent conversation without blowing the token budget.
4. Hierarchical Context Architecture
Production systems benefit from thinking about context in layers:
- Global context: System-level information that applies to every interaction — persona, capabilities, hard constraints, domain knowledge. Typically 200-500 tokens.
- Session context: User-specific information that applies to this conversation — user profile, preferences, prior session summaries, account state. Injected at session start.
- Turn context: Information relevant to this specific query — retrieved documents, tool outputs, recent conversation history. Dynamically assembled per turn.
Each layer has different freshness requirements, different token budgets, and different strategies for compression and retrieval.
5. Relevance Filtering
Before injecting any retrieved content, filter it for actual relevance. A relevance score of 0.72 from your vector database doesn't tell you whether that chunk actually helps answer the current question. Post-retrieval filtering using LLM-as-judge, semantic similarity to the query intent (not just the query terms), or rule-based filters (e.g., date ranges, entity matching) can dramatically reduce context noise.
How Context Engineering Differs from Prompt Engineering
The confusion between the two is understandable — both deal with what you send to the model. But the distinction is fundamental.
Prompt engineering is about the instruction: the task description, the output format request, the few-shot examples, the chain-of-thought nudge. It assumes the model has what it needs and focuses on directing how the model should process and respond. Prompt engineering asks: "How do I phrase this?"
Context engineering is about the information: what background knowledge, retrieved documents, conversation history, user state, and domain data the model has available when it processes the prompt. It assumes the instruction is clear and focuses on ensuring the model has the right inputs. Context engineering asks: "What does the model need to know?"
In a well-architected system, both matter. But they have different leverage points. A mediocre prompt with excellent context often outperforms an excellent prompt with mediocre context. The model is fundamentally a reasoning engine — it reasons over what's in its window. The quality of the window determines the ceiling on answer quality.

| Dimension | Prompt Engineering | Context Engineering |
|-----------|-------------------|---------------------|
| Focus | How the model is instructed | What information the model has |
| Scope | System prompt + task framing | Retrieved docs, history, user state, dynamic injections |
| When it matters most | Simple tasks, instruction-following benchmarks | Multi-turn systems, RAG, agents, personalization |
| Primary failure mode | Unclear instructions, wrong format | Missing context, context overflow, stale data |
| Core skill | Writing clear instructions, few-shot examples | Retrieval design, compression, token budgeting |
| Tooling | Prompt templates, prompt versioning | Vector DBs, chunking pipelines, context managers |
| Iteration speed | Fast (text edits) | Slower (pipeline changes, eval frameworks) |
| Ceiling | Limited by information available | Limited by retrieval quality and token budget |
| 2026 relevance | Necessary but insufficient | The differentiating skill in production AI |
Implementation Guide
Building a Production ContextManager
The following Python class implements a complete context management system. It handles adding messages, compressing history, retrieving relevant documents, and assembling the final context window within a token budget.
import tiktoken
from dataclasses import dataclass, field
from typing import Optional
from openai import OpenAI
# Token estimation using tiktoken (works for GPT-4o, Claude approximation)
enc = tiktoken.get_encoding("cl100k_base")
def count_tokens(text: str) -> int:
"""Estimate token count for a string."""
return len(enc.encode(text))
@dataclass
class Message:
"""Represents a single turn in conversation history."""
role: str # "system", "user", or "assistant"
content: str
tokens: int = field(init=False)
def __post_init__(self):
self.tokens = count_tokens(self.content)
@dataclass
class ContextConfig:
"""Configuration for context window management."""
max_tokens: int = 8000 # Hard limit for assembled context
system_budget: int = 500 # Tokens reserved for system prompt
session_budget: int = 400 # Tokens reserved for session context (user profile etc.)
history_budget: int = 2000 # Tokens for conversation history
retrieval_budget: int = 4000 # Tokens for retrieved documents
recency_turns: int = 4 # Number of recent turns to always keep verbatim
class ContextManager:
"""
Manages LLM context window assembly for production systems.
Handles:
- Sliding window conversation history with compression
- Relevance-filtered document injection
- Token budget enforcement across context layers
- Global → Session → Turn context hierarchy
"""
def __init__(
self,
system_prompt: str,
config: Optional[ContextConfig] = None,
llm_client: Optional[OpenAI] = None,
session_context: Optional[str] = None,
):
self.system_prompt = system_prompt
self.config = config or ContextConfig()
self.client = llm_client # Used for summarization compression
self.session_context = session_context or ""
self.history: list[Message] = []
self.compressed_summary: str = "" # Rolling summary of old history
self.retrieved_docs: list[str] = []
def add(self, role: str, content: str) -> None:
"""
Add a new message to conversation history.
Automatically triggers compression if history budget is exceeded.
"""
msg = Message(role=role, content=content)
self.history.append(msg)
# Check if we've exceeded the history budget
total_history_tokens = sum(m.tokens for m in self.history)
if total_history_tokens > self.config.history_budget:
self._compress_history()
def _compress_history(self) -> None:
"""
Compress older history turns into a rolling summary.
Always preserves the most recent `recency_turns` verbatim.
The rest gets summarized via LLM call and stored as compressed_summary.
"""
# Split: keep recent turns verbatim, compress the rest
recent = self.history[-self.config.recency_turns:]
to_compress = self.history[:-self.config.recency_turns]
if not to_compress:
return
# Build a text block from older turns for summarization
history_text = "\n".join(
f"{m.role.upper()}: {m.content}" for m in to_compress
)
if self.compressed_summary:
# Append to existing summary rather than replacing it
summary_input = (
f"Previous summary:\n{self.compressed_summary}\n\n"
f"New turns to incorporate:\n{history_text}"
)
else:
summary_input = history_text
if self.client:
# Use LLM to generate a high-quality summary
response = self.client.chat.completions.create(
model="gpt-4o-mini", # Use a cheap model for compression
messages=[
{
"role": "system",
"content": (
"Summarize this conversation history concisely. "
"Preserve: key decisions, important facts stated by the user, "
"unresolved questions, and any commitments made. "
"Output 2-4 sentences maximum."
),
},
{"role": "user", "content": summary_input},
],
max_tokens=200,
)
self.compressed_summary = response.choices[0].message.content
else:
# Fallback: simple truncation (use in testing / no LLM available)
self.compressed_summary = f"[Earlier conversation compressed. Key context: {history_text[:300]}...]"
# Replace history with just the recent turns
self.history = recent
def retrieve(self, docs: list[str], max_tokens: Optional[int] = None) -> None:
"""
Inject retrieved documents into the context.
Enforces token budget — drops lowest-priority docs if over budget.
Args:
docs: List of document strings, ordered by relevance (most relevant first).
max_tokens: Override the configured retrieval budget if provided.
"""
budget = max_tokens or self.config.retrieval_budget
self.retrieved_docs = []
tokens_used = 0
for doc in docs:
doc_tokens = count_tokens(doc)
if tokens_used + doc_tokens <= budget:
self.retrieved_docs.append(doc)
tokens_used += doc_tokens
else:
# Once we're over budget, skip remaining docs
# (they're lower relevance anyway since docs are ranked)
break
def get_window(self) -> list[dict]:
"""
Assemble the final context window as a list of messages ready for the LLM API.
Returns messages in this order:
1. System prompt (global context)
2. Session context (user profile, preferences) — injected as system message
3. Compressed history summary (if any)
4. Retrieved documents (if any)
5. Recent verbatim history
(The caller appends the current user query as the final message.)
"""
messages = []
# Layer 1: Global system context
system_content = self.system_prompt
if self.session_context:
# Append session-specific context to system message
system_content += f"\n\n## User Context\n{self.session_context}"
messages.append({"role": "system", "content": system_content})
# Layer 2: Compressed history summary (if exists)
if self.compressed_summary:
messages.append({
"role": "system",
"content": f"## Earlier Conversation Summary\n{self.compressed_summary}",
})
# Layer 3: Retrieved documents
if self.retrieved_docs:
docs_block = "\n\n---\n\n".join(self.retrieved_docs)
messages.append({
"role": "system",
"content": f"## Relevant Context\n{docs_block}",
})
# Layer 4: Recent verbatim history
for msg in self.history:
messages.append({"role": msg.role, "content": msg.content})
return messages
def token_usage(self) -> dict:
"""
Return a breakdown of current token usage across context layers.
Useful for monitoring and debugging context budget allocation.
"""
return {
"system": count_tokens(self.system_prompt + self.session_context),
"compressed_summary": count_tokens(self.compressed_summary),
"retrieved_docs": sum(count_tokens(d) for d in self.retrieved_docs),
"history": sum(m.tokens for m in self.history),
"total": (
count_tokens(self.system_prompt + self.session_context)
+ count_tokens(self.compressed_summary)
+ sum(count_tokens(d) for d in self.retrieved_docs)
+ sum(m.tokens for m in self.history)
),
}
This class encapsulates the four core operations of context engineering: add() for managing history with automatic compression, _compress_history() for rolling summarization, retrieve() for budget-aware document injection, and get_window() for assembling the final layered context payload.
The key insight in this implementation is the separation of concerns. History compression is triggered automatically — the caller doesn't need to think about it. Retrieved documents are prioritized by order (most relevant first) and cut at the budget boundary. The final assembly follows a strict hierarchy that puts global context first, session context second, and turn-specific content last.
Engineered RAG Context Assembly vs Naive Concatenation
The second critical pattern is the difference between how naive RAG systems and engineered context systems assemble retrieved content. This example shows both approaches side by side, then demonstrates the quality gap.
import numpy as np
from sentence_transformers import SentenceTransformer, CrossEncoder
from typing import NamedTuple
# Models for embedding and re-ranking
embedder = SentenceTransformer("all-MiniLM-L6-v2")
cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
class Chunk(NamedTuple):
text: str
source: str
date: str # ISO date string for recency weighting
score: float # Embedding similarity score
def naive_rag_context(query: str, chunks: list[Chunk], top_k: int = 5) -> str:
"""
NAIVE APPROACH: Embed query, take top-K by cosine similarity, concatenate.
Problems:
- No re-ranking for actual query relevance
- No deduplication of similar chunks
- No token budget enforcement
- No ordering strategy (important context may land in the "lost in the middle" zone)
- All chunks treated equally regardless of recency
"""
query_emb = embedder.encode(query)
chunk_embs = embedder.encode([c.text for c in chunks])
# Cosine similarity
similarities = np.dot(chunk_embs, query_emb) / (
np.linalg.norm(chunk_embs, axis=1) * np.linalg.norm(query_emb)
)
# Take top-K and concatenate — that's it
top_indices = np.argsort(similarities)[-top_k:][::-1]
top_chunks = [chunks[i] for i in top_indices]
# Naive: just dump them all in order of similarity
return "\n\n".join(c.text for c in top_chunks)
def engineered_rag_context(
query: str,
chunks: list[Chunk],
top_k_retrieval: int = 20, # Retrieve more, then filter down
top_k_final: int = 5,
token_budget: int = 3000,
recency_boost_days: int = 30,
) -> tuple[str, dict]:
"""
ENGINEERED APPROACH: Multi-stage retrieval with re-ranking, deduplication,
recency weighting, and budget-aware ordering.
Returns the assembled context string plus metadata for monitoring.
"""
from datetime import datetime, timedelta
# Stage 1: Broad retrieval — get more candidates than we need
query_emb = embedder.encode(query)
chunk_embs = embedder.encode([c.text for c in chunks])
similarities = np.dot(chunk_embs, query_emb) / (
np.linalg.norm(chunk_embs, axis=1) * np.linalg.norm(query_emb)
)
# Get top_k_retrieval candidates
top_indices = np.argsort(similarities)[-top_k_retrieval:][::-1]
candidates = [(chunks[i], float(similarities[i])) for i in top_indices]
# Stage 2: Cross-encoder re-ranking for actual relevance
# Cross-encoders are slower but measure true query-document relevance
pairs = [[query, c.text] for c, _ in candidates]
rerank_scores = cross_encoder.predict(pairs)
# Combine embedding similarity (0.3) with cross-encoder score (0.7)
combined = []
for (chunk, embed_score), rerank_score in zip(candidates, rerank_scores):
combined_score = 0.3 * embed_score + 0.7 * (rerank_score / 10.0) # Normalize
combined.append((chunk, combined_score))
# Stage 3: Recency boost — reward recent documents
cutoff = datetime.now() - timedelta(days=recency_boost_days)
boosted = []
for chunk, score in combined:
try:
chunk_date = datetime.fromisoformat(chunk.date)
if chunk_date > cutoff:
# Apply a 15% boost for recent content
score = score * 1.15
except (ValueError, AttributeError):
pass
boosted.append((chunk, score))
# Stage 4: Sort by final score
boosted.sort(key=lambda x: x[1], reverse=True)
# Stage 5: Deduplicate similar chunks (cosine sim > 0.92 = near-duplicate)
selected: list[Chunk] = []
selected_embs = []
for chunk, score in boosted:
chunk_emb = embedder.encode(chunk.text)
if selected_embs:
sims = np.dot(selected_embs, chunk_emb) / (
np.linalg.norm(selected_embs, axis=1) * np.linalg.norm(chunk_emb)
)
if np.max(sims) > 0.92:
# Near-duplicate of an already-selected chunk — skip
continue
selected.append(chunk)
selected_embs.append(chunk_emb)
if len(selected) >= top_k_final:
break
# Stage 6: Token budget enforcement + ordering strategy
# Most relevant at the END (recency bias in LLM attention — recent = bottom)
# Background/supporting context at the START
token_count = 0
final_chunks: list[Chunk] = []
for chunk in reversed(selected): # Less relevant first (will appear earlier)
chunk_tokens = len(chunk.text.split()) * 1.3 # Rough token estimate
if token_count + chunk_tokens > token_budget:
break
final_chunks.append(chunk)
token_count += chunk_tokens
final_chunks.reverse() # Restore: background first, most relevant last
# Assemble with source attribution (helps model weight information)
assembled_parts = []
for chunk in final_chunks:
assembled_parts.append(
f"[Source: {chunk.source} | Date: {chunk.date}]\n{chunk.text}"
)
assembled_context = "\n\n---\n\n".join(assembled_parts)
# Return context + metadata for monitoring
metadata = {
"chunks_retrieved": top_k_retrieval,
"chunks_after_rerank": len(candidates),
"chunks_after_dedup": len(selected),
"chunks_final": len(final_chunks),
"estimated_tokens": int(token_count),
"sources": [c.source for c in final_chunks],
}
return assembled_context, metadata
The difference in output quality between these two functions is significant in practice. The naive approach frequently returns redundant chunks (three slightly different paragraphs from the same document saying the same thing) and misses the actual most-relevant content because embedding similarity and true relevance diverge for complex queries. The engineered approach routes through re-ranking, removes near-duplicates, rewards fresh content, and places the highest-relevance material where LLM attention is strongest.
Production Considerations
Token Budget Management
Every production context engineering system needs a token budget framework. Define hard limits per layer, build monitoring to track actual token consumption per request, and set up alerts when budgets are consistently exceeded or when retrieval is returning too few results within budget.
Practical budget allocation for a general-purpose assistant on a 16K context model:
- System prompt: 400-600 tokens
- Session context (user profile, preferences): 300-500 tokens
- Compressed history summary: 200-400 tokens
- Retrieved documents: 5,000-8,000 tokens (the largest budget)
- Recent verbatim history (last 4-6 turns): 1,500-2,500 tokens
- Current query: 50-500 tokens
- Output buffer: 1,000-2,000 tokens
The output buffer is easy to forget. Your context window is shared between input and output — if you fill 15,900 tokens of a 16K context with input, the model has 100 tokens to respond. Build in headroom.
Context Cache Warming
For latency-sensitive applications, context cache warming is a significant optimization. Many LLM providers (Anthropic's prompt caching, OpenAI's cached tokens) allow you to cache a prefix of the context and pay reduced rates for cache hits. Design your context assembly so that the static portions (system prompt, global knowledge base content that rarely changes) appear early and consistently — they'll get cached across requests. Dynamic, per-query content (retrieved documents, recent history) goes after the cache boundary.
This can reduce per-request latency by 30-60% and cost by 50-80% for cache hits on the static prefix.
Monitoring Context Quality
You can't improve what you don't measure. Build these metrics into your context engineering pipeline:
Retrieval precision — Of the chunks injected into context, what percentage were actually cited or used in the model's response? Low precision indicates over-retrieval (too much irrelevant noise going in).
Context utilization rate — What fraction of your token budget is being used? Consistently at 95%+ suggests compression is insufficient. Consistently at 30-40% suggests over-conservative retrieval that's leaving relevant information on the table.
Answer grounding rate — For RAG systems, what percentage of factual claims in the response can be traced back to a specific injected chunk? Low grounding rates indicate hallucination despite good retrieval — often caused by poor ordering or context flooding.
Compression ratio — How much does your history compression reduce tokens while preserving the information the model needs to answer subsequent questions? Evaluate by testing how often the model answers questions that require information from compressed history correctly.
Context freshness lag — For systems with dynamic data, how old is the newest piece of context on average? A high lag indicates your retrieval system isn't surfacing recent enough data.
The Context Refresh Problem
Static context goes stale. A user profile injected at the start of a long session may be outdated by turn 30 if the user has updated their preferences mid-session. Product documentation injected at session start may reference prices or features that have changed. Build explicit context refresh triggers: after N turns, re-fetch session context; before answering questions about pricing or availability, always retrieve fresh data rather than relying on session-start injection.
The architectural principle is: treat context like a cache with TTLs. Every piece of injected context has an implicit freshness guarantee. When that guarantee expires, refresh it.
Conclusion
The shift from prompt engineering to context engineering reflects a maturation in how the industry builds AI systems. Prompt engineering was the right first skill — you had to learn to communicate with these models at all, and that took work. But it's a necessary precondition, not a sufficient one.
Context engineering is where the real leverage lives in 2026. The models are capable. GPT-4o, Claude 3.5 Sonnet, Gemini 2.0 — these are genuinely powerful reasoning systems. The bottleneck in almost every production failure isn't the model's intelligence; it's the quality of the information it's reasoning over.
If you're building AI systems today, audit your context before you audit your prompts. Ask: is the model seeing everything it needs? Is it being flooded with irrelevant noise? Is critical information being pushed out by conversation history overflow? Is the retrieved content actually fresh and relevant, or is it a semantic similarity score that doesn't translate to real-world usefulness?
The answers to those questions will tell you where your system is failing — and the techniques in this post give you the tools to fix it. Sliding window compression, hierarchical context layers, multi-stage retrieval with re-ranking, token budget management, and context quality monitoring are no longer advanced topics. They are table stakes for any AI application that needs to work reliably in production.
Start with the ContextManager class. Instrument your retrieval pipeline with the metadata logging from the engineered_rag_context function. Add the five monitoring metrics to your dashboards. You'll have a clearer picture of your system's context health within a week, and actionable improvements to make within two.
The model is not the problem. What you feed it is.
Sources
- The New Stack, "Why LLM Applications Fail in Production" (2026) — context quality accounts for 70%+ of production failures
- Anthropic Documentation: Prompt Caching with Claude — context cache warming patterns
- OpenAI Cookbook: RAG Best Practices — retrieval pipeline design
- "Lost in the Middle: How Language Models Use Long Contexts" — Liu et al., position effects in long-context LLMs
- sentence-transformers documentation —
cross-encoder/ms-marco-MiniLM-L-6-v2for re-ranking - tiktoken library — token counting for OpenAI-compatible models
Enjoyed this post? Follow AmtocSoft for AI tutorials from beginner to professional.
☕ Buy Me a Coffee | 🔔 YouTube | 💼 LinkedIn | 🐦 X/Twitter
Comments
Post a Comment