
Introduction
I lost a full day last quarter to a bug that turned out to be a sorting problem. Our support agent had started giving subtly stale answers, quoting a refund policy we had retired months earlier. The retrieval was fine. The policy doc in the vector store was current. The model was the same one that had worked the week before. The bug was that our context assembler appended retrieved chunks in similarity order, and a high-similarity but outdated changelog snippet kept landing in the last few hundred tokens before the question, right where the model pays the most attention. The model was not wrong. It was answering the context we actually gave it, which was not the context I thought we were giving it.
That day reframed how I think about this work. I had spent weeks treating the prompt as the thing to tune, when the real artifact was the pipeline that decided what went into the prompt. That pipeline is what the field now calls context engineering, and in 2026 it has become the defining discipline of building with LLMs, the practice of architecting the entire information environment for a model rather than wordsmithing a single instruction (Sombra, AI Context Engineering 2026). Context quality, not context volume, is the limiting factor now (The New Stack, 2026).
This is a field guide to treating context as infrastructure: a pipeline you build, test, and monitor, with the same rigor you give any other production system.
The Problem: The Prompt Was Never the Artifact
Prompt engineering treated the model's input as a string to be crafted. That worked when the input was small and static. It stops working the moment the input is assembled at runtime from many sources: retrieved documents, conversation history, tool outputs, user profile, system rules. At that point the interesting decisions are no longer about wording. They are about selection, ordering, compression, and provenance.
Three failure modes show up once you cross that line, and none of them are fixable by editing the prompt text:
-
Position blindness. Models attend unevenly across their window. Critical facts buried in the middle of a long context get underweighted, a pattern robust enough that retrieval order materially changes answers. My stale-refund bug was exactly this.
-
Context dilution. Stuffing more into the window feels safer but is not. Every irrelevant token competes with the relevant ones for attention and pushes up cost and latency. Beyond a point, more context makes answers worse, not better.
-
Untraceable answers. When something goes wrong, you need to know which tokens produced the answer. If your assembly step keeps no record of what it put in the window and why, every incident becomes an archaeology dig instead of a log query.

The shift is from asking what I should say to the model toward asking what information environment I should construct for it, and how I know I constructed the right one. That second question is an engineering question, and it has engineering answers.
How It Works: The Assembly Pipeline
Treating context as infrastructure means there is a pipeline with named stages between your raw sources and the model call. Here is the shape of it.
The stage that earns its keep first is curation, because it is where you remove the noise that would otherwise dilute everything downstream. Deduplication and filtering before ranking mean the ranker is choosing among genuinely distinct, plausibly-relevant candidates rather than near-duplicate chunks that crowd each other out. Smart summarization that keeps the critical content while pruning redundancy is what separates a system that stays usable over long sessions from one that degrades (Digital Applied, Agent Reliability Playbook 2026).
The second load-bearing stage is assembly with hierarchy. Headers segment context into addressable units, and a model working through clearly-sectioned context navigates to what is relevant for the task (Packmind, Context Engineering Best Practices 2026). Order matters too: put the most decision-relevant material where the model attends most, which in practice means near the question, not buried in the middle.
Implementation Guide: Building the Pipeline
Let us build a small, real context assembler that respects a token budget, deduplicates, ranks, and keeps provenance. Start with the budget, because every other decision is a negotiation against it.
from dataclasses import dataclass, field
@dataclass
class Chunk:
source: str
text: str
score: float # relevance, 0..1
tokens: int
@dataclass
class AssemblyResult:
blocks: list[Chunk]
used_tokens: int
dropped: list[str] = field(default_factory=list)
def estimate_tokens(text: str) -> int:
# Rough heuristic: ~4 chars per token. Swap for a real tokenizer in prod.
return max(1, len(text) // 4)
Next, deduplicate near-identical chunks before ranking. The cheap, effective approach is shingled Jaccard similarity: if two chunks share most of their word-shingles, keep the higher-scored one.
def shingles(text: str, n: int = 5) -> set[str]:
words = text.lower().split()
return {" ".join(words[i:i + n]) for i in range(len(words) - n + 1)}
def dedupe(chunks: list[Chunk], threshold: float = 0.8) -> list[Chunk]:
kept: list[Chunk] = []
for c in sorted(chunks, key=lambda x: x.score, reverse=True):
c_sh = shingles(c.text)
dup = False
for k in kept:
k_sh = shingles(k.text)
if c_sh and k_sh:
jac = len(c_sh & k_sh) / len(c_sh | k_sh)
if jac >= threshold:
dup = True
break
if not dup:
kept.append(c)
return kept
Now the assembler: dedupe, rank, then greedily fill the budget with the highest-scoring chunks, recording what was dropped so the decision is auditable.
def assemble(chunks: list[Chunk], budget_tokens: int) -> AssemblyResult:
deduped = dedupe(chunks)
ranked = sorted(deduped, key=lambda c: c.score, reverse=True)
blocks: list[Chunk] = []
used = 0
dropped: list[str] = []
for c in ranked:
if used + c.tokens <= budget_tokens:
blocks.append(c)
used += c.tokens
else:
dropped.append(f"{c.source} (score={c.score:.2f}, {c.tokens} tok)")
# Position the highest-scoring block LAST, nearest the question.
blocks.sort(key=lambda c: c.score)
return AssemblyResult(blocks=blocks, used_tokens=used, dropped=dropped)
Run it against a mixed candidate set with a tight budget and the provenance falls out for free:
$ python assemble.py --budget 800
[assemble] 11 candidates -> 7 after dedupe -> 5 fit in 800 tokens
kept:
policy/refunds-v3.md score=0.94 120 tok (placed nearest question)
faq/refund-window.md score=0.88 140 tok
policy/shipping.md score=0.71 160 tok
kb/returns-process.md score=0.66 180 tok
chat/turn-14.md score=0.61 190 tok
dropped (over budget):
changelog/2025-q3.md score=0.83 220 tok <-- the stale snippet, correctly dropped
faq/refund-window.md (duplicate of kept chunk)
used 790/800 tokens
That changelog/2025-q3.md line is the bug from my introduction, now visible and handled. Because dedupe and the budget log every decision, the stale snippet either gets dropped or, if it does sneak in, shows up in a log I can grep instead of a mystery I have to reproduce.
Decision Flow: What Goes in the Window
Not every available token should be spent. The assembler needs a policy for what is worth including, and that policy is itself a guardrail against dilution.
The rule that does the most work is the relevance floor. A chunk that scores below the floor never enters the window even if there is budget to spare, because empty budget is cheaper than diluted budget. This is the counterintuitive heart of context engineering: leaving the window partly empty is often the right call. More tokens are not more help.
A Gotcha: When Compression Ate the Answer
The first compression stage I shipped was too clever and it cost us a wrong answer in front of a customer. To fit more into the budget, I summarized each retrieved chunk with a small model before assembly, on the theory that a 50-token summary of a 200-token doc let me fit four times as much. It worked in testing and then failed on a precise question.
The customer asked whether refunds applied to digital goods specifically. The relevant doc spelled out that refunds apply to all physical goods within the standard return window, and that digital goods are non-refundable. My summarizer compressed that down to a generic line about refunds applying within the return window, which is true in spirit and catastrophically wrong for this question. The summary dropped the exact qualifier the question hinged on.
$ python debug_answer.py --q "are digital goods refundable?"
retrieved: policy/refunds-v3.md (full): physical goods within return window;
digital goods are non-refundable.
assembled: policy/refunds-v3.md (summary): refunds apply within return window.
model answer: Yes, you can request a refund. <-- WRONG for digital goods
root cause: lossy summarization dropped the 'digital goods' exclusion
The fix was to stop summarizing eagerly and instead summarize only when a chunk exceeds a size threshold, and even then to preserve named entities and explicit exclusions verbatim. Better still, for high-stakes factual chunks, I now pass them through whole and spend the budget I save by dropping low-score chunks entirely. The lesson: compression is a tradeoff against fidelity, and the tokens you save mean nothing if you compress away the one clause the answer depended on. Test your compressor against precise, qualifier-heavy questions, not just broad ones.
Scoring Beyond Similarity
The pipeline so far treats score as a given, but where that number comes from is itself a context-engineering decision, and raw vector similarity is rarely the right answer on its own. Cosine similarity tells you a chunk is semantically near the query. It does not tell you the chunk is fresh, authoritative, or the kind of source this question needs. A high-similarity but stale changelog, the exact villain of my refund bug, scores well on similarity and badly on everything that actually matters.
A more honest score blends similarity with signals you already have. Recency, source authority, and a light penalty for length all push the ranker toward chunks that are not just topically close but actually trustworthy for the task.
import math
def blended_score(similarity: float, age_days: float,
authority: float, tokens: int) -> float:
# Decay relevance for stale docs; reward authoritative, concise sources.
recency = math.exp(-age_days / 180.0) # half-life ~6 months
length_penalty = 1.0 / (1.0 + tokens / 500) # gently disfavor bloat
return 0.6 * similarity + 0.25 * recency + 0.15 * authority * length_penalty
The weights are not sacred; they are a starting point you tune against your own eval set. What matters is that the score the assembler ranks on encodes more than topical nearness. Re-running the earlier example with blended scoring, the stale changelog falls below the relevance floor on its own, before the budget stage ever has to drop it.
$ python rank.py --query "are digital goods refundable?" --blended
policy/refunds-v3.md sim=0.91 age=12d auth=1.0 -> 0.93 keep
faq/refund-window.md sim=0.88 age=40d auth=0.8 -> 0.85 keep
changelog/2025-q3.md sim=0.83 age=240d auth=0.4 -> 0.61 below floor (0.65), dropped
floor=0.65: 1 stale chunk dropped before budget stage
This is the deeper point about context as infrastructure: the relevance floor and the scoring function are policy knobs, and like any policy they deserve to be explicit, versioned, and tested. A team that hardcodes top-k cosine similarity has made a scoring decision by accident. A team that writes blended_score has made one on purpose, and can change it deliberately when the data shifts. The difference shows up months later, when a stale source starts creeping into answers and one team can adjust a weight while the other is reverse-engineering why retrieval "suddenly got worse."
The same discipline extends to negative signals. If a source has been flagged as deprecated, the cleanest fix is not to delete it from the store but to give it an authority of zero so it can never outrank a live document, while still being available if a user explicitly asks about historical policy. Encoding that as a score is far more robust than hoping it never gets retrieved.
Comparison and Tradeoffs
How do the common context strategies compare in practice? Here is my scoring after a year of running this pipeline.
| Strategy | Controls dilution | Handles position | Traceable | Latency cost | Verdict |
|---|---|---|---|---|---|
| Stuff everything in the window | No | No | No | High | Feels safe, degrades quality |
| Tune the prompt wording only | No | No | No | None | Necessary, not the real lever |
| Top-k retrieval, raw order | Weak | No | Weak | Medium | The common default, leaves wins on the table |
| Dedupe + rank + budget | Yes | Partial | Yes | Low | The baseline worth building |
| Eager summarize-everything | Partial | No | Weak | Medium | Risks dropping the key clause |
| Curate + rank + position + provenance | Yes | Yes | Yes | Low | The pipeline you actually want |

The core tradeoff is fidelity versus density. Every compression and every dropped chunk buys you room and risks losing something. The discipline is to make those tradeoffs explicit and logged rather than implicit and invisible. A pipeline that records what it dropped and why turns a class of silent quality bugs into visible, debuggable events, which is the whole reason to treat context as infrastructure in the first place.
Production Considerations
A few things that matter once the pipeline is live.
Log provenance on every call. Record which chunks went into each window, their scores, and what was dropped. This is your single most useful artifact when an answer goes wrong, and it is nearly free to produce. Treat the context window like any other request you would trace.
Monitor budget utilization and drop rates. If you are constantly dropping high-score chunks, your budget is too small or your retrieval is too noisy. If your window is half empty on hard questions, your relevance floor may be too high. Both are dashboards, not guesses.
Version your assembly logic. Changing the ranker or the compressor changes every answer the system gives. Treat assembly changes like schema migrations: version them, and be able to replay old questions against a new pipeline to catch regressions before users do.
Test against qualifier-heavy questions. The questions that break context pipelines are the precise ones, where a single dropped clause flips the answer. Keep a suite of these and run it on every pipeline change.
Exploit the cache by ordering for stability. Most providers cache a common prefix of the input, so the layout of your window has a direct cost consequence. Put the stable material first, the system rules and long-lived reference docs that rarely change between requests, and the volatile material last, the retrieved chunks and the user turn. A pipeline that reshuffles its whole window on every request throws away the cache and pays full price each time; one that keeps a stable prefix can see large reductions in cost and latency on repeat traffic. This is a place where the context-as-infrastructure framing pays off directly: the same provenance log that tells you what went into the window also tells you how much of it was cacheable, which turns a vague sense that the LLM bill is high into a specific diagnosis: prefix stability is low, and here is the chunk that keeps invalidating it.
Conclusion
The prompt was never the real artifact. The pipeline that assembles what the model sees is, and in 2026 building that pipeline well is the skill that separates reliable LLM systems from flaky ones. Context engineering is infrastructure work: selection, ordering, compression, and provenance, each a stage you can build, test, and monitor.
Start with a budget and a provenance log, because together they make every assembly decision explicit and auditable. Add deduplication and a relevance floor to fight dilution. Position your strongest material where the model attends most. Compress carefully, and never compress away the clause the answer depends on. Do that, and the next time an answer goes stale you will find the cause in a log line instead of losing a day to it, which is exactly the trade I wish I had made before that refund bug.
Working code for the full assembler, the deduper, and a provenance-logging harness lives in the companion repo: github.com/amtocbot-droid/amtocbot-examples/tree/main/262-context-engineering.
Get the next one
I send a weekly engineering note with one production failure, the debug trail, and the code or checklist that came out of it. No spam, unsubscribe anytime.
Reader challenge: inspect one LLM request path in your own system and write down which chunks entered the context window, which chunks were dropped, and why. Reply to the email or comment with the failure mode you found.
Revision History
| Date | Summary | Old Version |
|---|---|---|
| 2026-06-07 | Added the newsletter signup and reader-challenge block so this recent context-engineering post feeds the owned audience funnel. | View previous version |
Sources
- AI Context Engineering in 2026: Why Prompt Engineering Is No Longer Enough — Sombra
- Context Engineering Best Practices for AI-Powered Dev Teams 2026 — Packmind
- Context Engineering: Agent Reliability Playbook 2026 — Digital Applied
- 5 Key Trends Shaping Agentic Development in 2026 — The New Stack
- Context Engineering: Complete 2026 Field Guide — Taskade
About the Author
Toc Am
Founder of AmtocSoft. Writing practical deep-dives on AI engineering, cloud architecture, and developer tooling. Previously built backend systems at scale. Reviews every post published under this byline.
Published: 2026-06-03 · Updated: 2026-06-07 · Written with AI assistance, reviewed by Toc Am.
Get These In Your Inbox
Weekly deep-dives on AI engineering, no fluff. Join the newsletter →
Or grab the book ($39, ~100 pages) · Buy me a coffee
☕ Buy Me a Coffee · 🔔 YouTube · 💼 LinkedIn · 🐦 X/Twitter
No comments:
Post a Comment