
I have learned to distrust replay systems that only answer whether the old run can be reproduced. Reproduction is the starting point. The harder production question is whether an old admission decision still deserves to participate in a new tool contract after policy, evidence, and runtime boundaries have moved. Blog 253 built the archive receipt that keeps registry metadata, artifact binding, verifier evidence, policy decision, and retention receipt together. Blog 254 is the next move: run that receipt through a replay rubric without rewriting the old decision.
The incident pattern is ordinary. An MCP server was admitted in a prior cycle. Its namespace authentication looked good. Its package digest was bound. Its attestation satisfied the policy the platform used at the time. Months later the policy changes because the tool gains access to a more sensitive system, a registry adapter changes its metadata shape, or the platform adopts a stricter provenance requirement. The archive receipt can still prove what happened then. It cannot answer by itself what should happen now.
That is the reason for a receipt-bound replay-rubric run. The run reads the old receipt as historical evidence and produces a new review result. It does not mutate the old receipt. It does not pretend a new policy was active at the old timestamp. It asks a narrower question: given the archived evidence packet and the current review policy, should the tool contract continue, re-admit with stronger evidence, quarantine, or retire?
This post continues the MCP server supply-chain integrity thread from blogs 249 through 253. Blog 249 opened signed-manifest discipline. Blog 250 added acknowledgement. Blog 251 added retention. Blog 252 added verification. Blog 253 added an archival spanning set. Blog 254 adds the replay run that turns the archive into a living governance surface.
Why Replay Is Not Update
The first design rule is that replay is not update. An update edits a current object. A replay reads an old object and emits a new result. If a platform lets a replay process patch the old admission decision in place, it loses the ability to explain what was true at the time of admission. That loss is subtle until an incident review asks why a tool was allowed during the old window and the only remaining row reflects a later policy.
The clean split is historical receipt plus review result. The receipt keeps its original registry snapshot, artifact digest, evidence bundle, policy digest, decision, and retention class. The review result has its own timestamp, replay policy digest, recheck findings, and disposition. The receipt is evidence. The review result is judgment.

flowchart LR
O[Original archive receipt] --> R[Receipt-bound replay-rubric run]
P[Current replay policy digest] --> R
E[Evidence recheck result] --> R
C[Tool contract impact surface] --> R
R --> D{Replay disposition}
D -- continue --> OK[Keep current contract]
D -- re-admit --> RA[Require stronger evidence]
D -- quarantine --> Q[Block pending review]
D -- retire --> X[Retire contract]
The distinction also keeps auditors honest. A stronger future policy is allowed to say an old admission would not pass now. It should not say the old admission did not pass then. The platform learns by comparing policies, not by laundering history.
The Four Inputs of a Receipt-Bound Replay Run
The replay run has four inputs. The first is the original archive receipt from blog 253. The receipt gives the replay worker stable joins into registry snapshot, artifact binding, evidence bundle, and policy decision. The second input is the current replay policy digest. This can differ from the original admission policy. The digest makes that difference explicit.
The third input is the evidence recheck result. A recheck may verify that an old artifact digest still has retrievable evidence, that a signature identity remains acceptable under current rules, or that an attestation predicate now fails a stricter gate. The recheck result should distinguish unavailable evidence from evidence that is available but no longer acceptable. Those states have different operational responses.
The fourth input is the tool contract impact surface. A low-risk tool that reads documentation and a privileged connector that mutates production records do not need the same replay disposition. The replay run should read the current contract capability class, not only the historical artifact evidence. A tool can become higher risk because its contract changed even if its package evidence did not.
flowchart TB
A[Archive receipt] --> H[Replay input envelope]
B[Current policy digest] --> H
C[Evidence recheck] --> H
D[Tool contract impact] --> H
H --> V[Replay evaluator]
V --> O[Review result with reason codes]
This envelope is small enough to test. It is also explicit enough that a replay job can run in batch without asking a model to infer which parts mattered.
A Minimal Replay Evaluator
The evaluator below is intentionally small. It does not replace package verification. It composes verifier outputs and contract state into a review disposition. That boundary matters because a replay evaluator should not secretly become a second verifier with weaker parsing.
from dataclasses import dataclass
from typing import Literal
Disposition = Literal["continue", "re_admit", "quarantine", "retire"]
@dataclass(frozen=True)
class ReplayInput:
original_receipt_digest: str
original_policy_digest: str
current_policy_digest: str
evidence_recheck: str
contract_impact: str
evidence_available: bool
def replay_disposition(item: ReplayInput) -> tuple[Disposition, list[str]]:
reasons: list[str] = []
if not item.evidence_available:
return "quarantine", ["receipt_evidence_unavailable"]
if item.evidence_recheck == "fail":
reasons.append("current_policy_evidence_fail")
if item.contract_impact == "privileged":
return "retire", reasons + ["privileged_contract"]
return "re_admit", reasons
if item.original_policy_digest != item.current_policy_digest:
reasons.append("policy_drift_detected")
if item.contract_impact == "privileged":
return "re_admit", reasons + ["privileged_contract_requires_fresh_receipt"]
return "continue", reasons
return "continue", ["receipt_still_within_policy"]
The important part is not the exact disposition table. It is the shape. Evidence unavailability blocks the shortcut. Evidence failure under current policy produces a remediation decision. Policy drift can produce either continue or re-admit depending on contract impact. The original receipt remains intact in every branch.
Here is a tiny output fixture from the same evaluator:
$ python3 replay_rubric_demo.py
case=doc-helper disposition=continue reasons=['policy_drift_detected']
case=prod-write-tool disposition=re_admit reasons=['policy_drift_detected', 'privileged_contract_requires_fresh_receipt']
case=missing-evidence disposition=quarantine reasons=['receipt_evidence_unavailable']
The output shows why a single "stale" flag is not enough. The same policy drift can be acceptable for one contract and unacceptable for another.
The Gotcha: Rechecking the Registry Instead of the Receipt
The bug I hit while testing this shape was a classic optimistic shortcut. My first replay worker re-fetched registry metadata and compared it with the current policy. That seemed useful. It was also the wrong primary read. The replay question was not whether the current registry listing looked healthy. The replay question was whether the original receipt-bound evidence packet could support a current review disposition.
The failure appeared when a registry listing had been cleaned up after the original admission. The current metadata looked better than the old metadata because documentation fields had improved. My worker almost emitted a clean continue result. The archived receipt still pointed at an older package digest whose attestation did not satisfy the new policy. I had let a current discovery read shadow the historical artifact binding.
The fix was to make current registry discovery optional context and receipt-bound recheck the primary path. If current discovery disagrees with the old receipt, the run records drift. It does not substitute the new record for the old evidence. That one rule keeps replay from becoming a silent re-admission pipeline.
sequenceDiagram
participant J as Replay job
participant A as Archive receipt store
participant V as Verifier
participant R as Registry
participant C as Contract registry
J->>A: Load original receipt-bound evidence
J->>V: Recheck artifact and attestation under current policy
J->>C: Read current tool contract impact
J->>R: Optional current metadata context
R-->>J: Metadata drift note
J-->>A: Append review result, do not mutate receipt
That ordering feels fussy until the first time it saves a review from a false clean result.
Decision Rubric
I use four replay dispositions.
| Disposition | Meaning | Typical action |
|---|---|---|
| Continue | Receipt-bound evidence still supports the current contract | Keep contract and append review result |
| Re-admit | Evidence is present, but current policy or contract class needs a fresh admission | Require new receipt before privileged use |
| Quarantine | Evidence is unavailable or incomplete for review | Block new invocations until evidence is restored or replaced |
| Retire | Evidence fails current policy for a contract class that cannot safely continue | Remove or replace the tool contract |
The difference between re-admit and quarantine is operationally important. Re-admit means the platform has enough old evidence to make a bounded transition decision, but wants a fresh current receipt. Quarantine means the platform cannot support the review question from retained evidence. Retire means the current policy and contract impact make continued use indefensible.

The disposition table should produce reason codes, not just labels. A reason code lets platform teams report why re-admission is increasing: policy drift, missing evidence, contract impact changes, or actual verifier failure. Without reason codes, the replay program turns into another dashboard with a red count and no repair path.
Storage Schema for Review Results
The review result deserves its own schema rather than a note appended to the original receipt. I usually model it as a small append-only record with five groups. The first group identifies the original receipt. The second identifies the replay policy. The third records the evidence recheck summary. The fourth records the tool contract impact class at review time. The fifth records disposition and reason codes.
That schema keeps the review result from becoming a second archive. The full original evidence remains in the archive receipt bundle. The review result only needs stable references and the replay outcome. A compact review record is easier to query, easier to retain for a longer policy-history window, and less likely to expose sensitive verifier logs to every dashboard reader.
from dataclasses import dataclass
from typing import Literal
@dataclass(frozen=True)
class ReplayReviewResult:
receipt_digest: str
replay_policy_digest: str
contract_digest: str
contract_impact: Literal["low", "standard", "privileged"]
evidence_recheck_digest: str
evidence_recheck_state: Literal["pass", "fail", "unavailable"]
disposition: Literal["continue", "re_admit", "quarantine", "retire"]
reason_codes: tuple[str, ...]
The contract_digest is as important as the receipt digest. A tool that stayed byte-identical can still become riskier when the application layer routes it into a broader contract. A replay result that only names the original package evidence will miss that risk expansion. The contract digest gives the review result a current application-layer anchor.
I also keep the evidence recheck digest separate from the original evidence bundle digest. That prevents a reader from confusing original evidence with current recheck evidence. The original digest says what admission used. The recheck digest says what the replay worker observed under the current verifier and current policy. If those values diverge, the review result can explain the divergence without pretending one digest replaced the other.
Here is the operational shape I want from a query:
receipt=sha256:3e0c...6920
contract=sha256:bb94...11af
replay_policy=sha256:7aa1...04c2
evidence_recheck=fail
disposition=re_admit
reasons=policy_drift_detected,privileged_contract_requires_fresh_receipt
That output is short enough for an incident ticket and precise enough for an engineer to open the right receipt, policy, and contract.
Failure Modes Worth Testing
The first test case is missing historical evidence. Delete or hide one original evidence bundle from a fixture archive and verify that replay emits quarantine, not continue. The goal is to prove that the replay worker does not replace missing archive fields with current registry data just because current registry data is available.
The second test case is policy drift without evidence failure. Change the replay policy digest while leaving the recheck state at pass. A low-impact contract can continue with a reason code that records policy drift. A privileged contract should require re-admission. This test catches evaluators that treat policy drift as either harmless everywhere or fatal everywhere.
The third test case is evidence failure with low-impact contract. That should usually produce re-admit, not immediate retire. The platform has evidence that the old receipt no longer satisfies the current policy, but the blast radius may allow a controlled migration path. For a privileged contract, the same evidence failure should retire or quarantine depending on policy. The contract impact class keeps the response proportional.
The fourth test case is current registry improvement. Improve the registry metadata after the original receipt, then replay the old receipt. The review can note current metadata improvement, but the primary disposition should still be driven by the receipt-bound artifact and evidence recheck. This is the regression test for the bug from the gotcha section.
The fifth test case is contract expansion. Keep the receipt and evidence recheck unchanged, but move the tool contract from low-impact read-only use to privileged write access. A replay result should change because the application layer changed. That test proves the replay rubric is not only a supply-chain verifier. It is a federation rule that composes supply-chain evidence with current tool-contract impact.
These tests sound repetitive, and that is exactly why they belong in a replay suite. Supply-chain replay bugs rarely announce themselves with novel syntax errors. They show up when one join is accidentally treated as optional.
How This Fits the Content Waterfall and Metrics Layer
There is also a product-side reason to preserve replay reason codes. A content automation platform eventually needs to explain why a piece of content, a social variant, a video upload helper, or a publishing tool was blocked. If every block becomes a generic "automation failed" status, the metrics loop learns the wrong lesson. The content strategy may blame topic choice when the actual failure was a tool contract that needed re-admission.
For AmtocSoft's own pipeline, the same principle appears in the tracker URL policy. A real post URL is evidence. A profile URL is not. Writing FAILED when publication cannot be verified is more useful than writing a comforting placeholder. The MCP replay rubric follows the same discipline at a lower layer. If the replay worker cannot prove current admissibility from receipt-bound evidence, it should emit quarantine or re-admit with reason codes, not a reassuring green field.
That status vocabulary feeds future prioritization. If re-admission failures cluster around missing evidence, improve evidence retention. If they cluster around policy drift, schedule publisher outreach or automated re-verification. If they cluster around contract expansion, review who is granting broader tool permissions. The replay rubric gives the learning loop a cause surface instead of a pile of failed jobs.
Operating Cadence
A receipt-bound replay program should have event-driven runs and scheduled runs. Event-driven replay triggers when policy changes, a tool contract changes impact class, a verifier dependency changes behavior, or an incident names a specific receipt. Scheduled replay catches the quieter failures: stale evidence, disappearing artifacts, and contracts whose risk class no longer matches their real use.
I would start scheduled replay in report-only mode. Report-only does not mean toothless. It means the first output is a ranked remediation queue. A platform team can inspect which tools would quarantine, which would require re-admission, and which can continue. Once the reason-code distribution is understood, enforcement can start with privileged contracts.
The cadence should also include a replay-budget guard. Verification can be expensive if every run tries to fetch every external artifact and every attestation at once. A federation can batch by contract impact, last review age, and policy-change relevance. The archive receipt makes that scheduling possible because the replay worker can select candidates by receipt metadata before opening every evidence bundle.
The human review cadence matters too. A weekly report that only lists counts will be ignored. A useful report lists top reason codes, new privileged-contract re-admission candidates, oldest quarantines, and evidence classes that repeatedly go unavailable. That report gives security, platform, and application teams a shared work queue.
Boundaries With Runtime Observability
Runtime observability and receipt-bound replay should cooperate, but neither should impersonate the other. Traces can show that an agent invoked a tool, how long it took, what route it selected, and which application rule emitted the call. The archive receipt shows why the tool was admitted. A replay review result shows whether that admission still composes with current policy and current contract impact.
If those layers collapse, incident reports get muddy. A trace attribute can point to a receipt digest, but a trace should not be treated as the authoritative admission record. A receipt can point to a contract digest, but it should not pretend to know every future runtime route. A replay result can point to both, but it should remain a review result. Keeping those roles separate makes cross-layer debugging easier because each layer can answer its own question.
OpenTelemetry's generative AI semantic conventions are useful for the runtime side of that join. MCP's registry and specification materials are useful for discovery and tool identity. Sigstore, in-toto, and SLSA are useful for evidence and provenance. The receipt-bound replay rubric is the federation layer that composes those inputs into a governed re-admission decision.
Production Rollout
The safest rollout path is to start with read-only review results. Run the replay rubric against existing receipts, append review results, and do not block execution on the first pass. That lets the team measure which tools would be affected before the policy becomes enforcement. The measure should be partitioned by contract impact class and evidence gap type, not just total affected tools.
The second step is enforcement for privileged contracts. Once the team has a stable reason-code distribution, require re-admission or quarantine for tools that can write production data, read secrets, or call expensive external systems. Low-risk tools can remain in report-only mode a little longer while the publisher experience improves.
The third step is scheduled replay. A replay run should happen when policy changes, when a contract changes impact class, when an evidence-retention class changes, and on a normal review cadence. A scheduled run without policy change is still useful because it catches evidence availability failures before an incident asks for the same receipt under stress.
The final step is linking replay results back into the application-execution layer. A task failure that depends on a quarantined MCP tool should not be summarized as a generic agent failure. It should point to the replay review result that blocked the tool. That join gives the application layer a precise cause and gives the federation layer a reason to improve evidence retention.
Conclusion
The archive receipt from blog 253 gives a federation memory. The replay-rubric run in blog 254 gives that memory a review discipline. It reads the old receipt, current policy, evidence recheck, and tool contract impact together. It emits a new result without mutating the old decision.
That separation is the difference between learning and rewriting. A federation can say the old admission passed under the old policy and also say the current contract now requires re-admission. Both statements can be true. The replay rubric exists so the platform can keep both truths visible while agents keep discovering and using tools at production speed.
Sources
- Model Context Protocol, "The MCP Registry," https://modelcontextprotocol.io/registry/about
- Model Context Protocol, "How to Authenticate When Publishing to the Official MCP Registry," https://modelcontextprotocol.io/registry/authentication
- Sigstore, "Verifying Signatures," https://docs.sigstore.dev/cosign/verifying/verify/
- in-toto, "Specifications," https://in-toto.io/docs/specs/
- SLSA, "SLSA Specification v1.2," https://slsa.dev/spec/latest/
- OpenTelemetry, "Semantic conventions for generative AI," https://opentelemetry.io/docs/specs/semconv/gen-ai/
About the Author
Toc Am
Founder of AmtocSoft. Writing practical deep-dives on AI engineering, cloud architecture, and developer tooling. Previously built backend systems at scale. Reviews every post published under this byline.
Published: 2026-05-22 · Written with AI assistance, reviewed by Toc Am.
Get These In Your Inbox
Weekly deep-dives on AI engineering, no fluff. Join the newsletter →
Or grab the book ($39, ~100 pages) · Buy me a coffee
☕ Buy Me a Coffee · 🔔 YouTube · 💼 LinkedIn · 🐦 X/Twitter
No comments:
Post a Comment