Sunday, May 31, 2026

Guardrails-First: Making AI Agents Reliable at 3am

A pager going off next to a terminal showing an AI agent stuck in a retry loop

Introduction

At 3:14am on a Tuesday I got paged because our deployment agent had spent forty minutes "fixing" a failing migration. It had not fixed anything. It had run the same ALTER TABLE eleven times, each time getting the same lock-timeout error, each time deciding the right move was to try again with a slightly reworded SQL comment. The model was not broken. Every single step it took was locally reasonable. The system around it had no concept of "you have already tried this and it did not work," so it cheerfully kept going, convinced that attempt eleven would be different. We measured roughly 80,000 wasted tokens on a task that never moved an inch.

I sat there watching the log scroll and felt something flip in how I think about agents. I had spent weeks tuning the prompt. I had A/B tested system messages. I had picked the strongest model we could afford. None of it mattered, because the failure had nothing to do with the model's reasoning. The failure was that the loop around the model had no brakes. That is the realization this whole post is built on.

That night taught me the thing this post is about: a model that scores 87% on SWE-Bench Verified (Datadog State of AI Engineering, 2026) is not the same as an agent you can trust to run unattended for an hour. The gap between "works in the notebook" and "works reliably at 3am under load" has become the defining engineering problem of 2026 (The AI Agent Reliability Gap, DEV, 2026). Closing it is not about a smarter model. It is about the scaffolding you wrap around the model: the guardrails that decide what the agent is allowed to do, when it must stop, and how it recovers when a step fails.

This is a guardrails-first playbook. We will build up the patterns that turned our flaky overnight agents into ones I can actually sleep through.

The Problem: Local Reasonableness, Global Chaos

An LLM agent is a loop. It observes state, picks an action, executes it, observes the result, and repeats until it decides the task is done. Each iteration the model sees a prompt and emits the next step. The trouble is that the model optimizes one step at a time. It has no built-in memory of the trajectory unless you give it one, and no built-in sense of a budget unless you enforce one.

That produces three failure modes I see over and over in production logs:

  1. The retry spiral. A step fails for a reason the model cannot fix (a lock, a permission, a rate limit). The model retries, because retrying is usually a reasonable thing to do. Without a circuit breaker, "usually reasonable" becomes an infinite loop.

  2. Silent drift. The agent slowly wanders off the task. It was asked to update one config value and forty steps later it is refactoring an unrelated module because each small step seemed like an improvement. Roughly two thirds of production agent failures are this quiet kind, not loud crashes (New Stack, Agentic Development Trends 2026).

  3. Unbounded blast radius. The agent has a tool that can delete files or call an API, and nothing constrains which files or which API calls. One hallucinated argument and you are restoring from backups.

The common thread: none of these are model intelligence problems. They are systems problems. A guardrails-first design treats the model as a powerful but unreliable component and builds the reliability in the layer you control.

System diagram showing the model wrapped by budget, validation, and recovery layers

How It Works: The Guardrail Layers

Think of guardrails as concentric layers around the model call. The model proposes; the guardrails dispose. Here is the loop with the four layers that matter most.

flowchart TD A[Observe state] --> B{Budget check} B -->|exceeded| Z[Halt + escalate] B -->|ok| C[Model proposes action] C --> D{Validate action} D -->|invalid| E[Reject, feed error back] E --> C D -->|valid| F[Execute in sandbox] F --> G{Result ok?} G -->|yes| H{Task done?} G -->|no| I{Seen this failure before?} I -->|yes, twice| Z I -->|no| A H -->|no| A H -->|yes| Y[Return result]

The first layer is the budget. Every agent run gets a hard ceiling on iterations, tokens, wall-clock time, and money. This is the single highest-value guardrail, because it converts every other failure mode from "infinite" to "bounded." My 3am incident would have been a 6-minute annoyance instead of a 40-minute one if a budget had been in place.

from dataclasses import dataclass, field
import time

@dataclass
class Budget:
    max_steps: int = 25
    max_tokens: int = 200_000
    max_seconds: float = 300.0
    max_usd: float = 1.50
    started_at: float = field(default_factory=time.monotonic)
    steps: int = 0
    tokens: int = 0
    usd: float = 0.0

    def charge(self, tokens: int, usd: float) -> None:
        self.steps += 1
        self.tokens += tokens
        self.usd += usd

    def exceeded(self) -> str | None:
        if self.steps >= self.max_steps:
            return f"step limit {self.max_steps} reached"
        if self.tokens >= self.max_tokens:
            return f"token limit {self.max_tokens} reached"
        if time.monotonic() - self.started_at >= self.max_seconds:
            return f"time limit {self.max_seconds}s reached"
        if self.usd >= self.max_usd:
            return f"cost limit ${self.max_usd} reached"
        return None

When the budget trips, the agent does not silently die. It escalates: it writes a structured handoff (what it was doing, what it tried, why it stopped) and pages a human or falls back to a safe default. Here is what that escalation looks like in our logs when it works:

$ tail -f agent.log
[15:11:02] step=11 action=run_sql tokens=78201 usd=0.59
[15:11:02] GUARDRAIL halt: step limit 25 reached? no | repeat-failure: run_sql x3 identical error
[15:11:02] circuit_breaker tripped on signature 9f2c: 'lock timeout on ALTER TABLE orders'
[15:11:02] escalating: wrote handoff to /var/run/agent/handoff-9f2c.json, paged #oncall
[15:11:02] run halted cleanly after 11 steps, 0 destructive actions taken

The second layer is action validation. Before any tool runs, the proposed call is checked against a schema and a policy. Wrong shape, disallowed tool, argument outside the allowlist: rejected, with the reason fed back to the model so it can correct. Critically, a rejected action does not count as progress, and three rejections of the same kind trip the breaker.

Implementation Guide: Building the Guardrails

Let us assemble the pieces into something you can actually run. The first real guardrail beyond the budget is the circuit breaker on repeated failure. This is what would have saved me at 3am. The idea: hash the (action, error) pair into a signature, and if the same signature recurs, stop. Repeating an action that already failed identically is the clearest signal an agent is stuck.

import hashlib

class RepeatFailureBreaker:
    def __init__(self, threshold: int = 2):
        self.threshold = threshold
        self.counts: dict[str, int] = {}

    def signature(self, action: str, error: str) -> str:
        raw = f"{action}|{error}".encode()
        return hashlib.sha256(raw).hexdigest()[:4]

    def record(self, action: str, error: str) -> bool:
        """Returns True if the breaker should trip."""
        sig = self.signature(action, error)
        self.counts[sig] = self.counts.get(sig, 0) + 1
        return self.counts[sig] > self.threshold

Notice the breaker keys on the error, not just the action. An agent legitimately calls run_sql many times in one task. What it must never do is call run_sql and get the identical lock-timeout three times. Keying on the pair lets normal work proceed while catching the spiral.

The second piece is the action validator with an allowlist. Never give an agent a raw shell tool in production. Give it narrow, typed tools whose arguments you can validate.

from typing import Callable

ALLOWED_TABLES = {"orders", "customers", "line_items"}

def validate_run_sql(args: dict) -> str | None:
    sql = args.get("sql", "").strip().lower()
    if not sql.startswith(("select", "update", "insert")):
        return "only SELECT/UPDATE/INSERT permitted, no DDL or DROP"
    if not any(t in sql for t in ALLOWED_TABLES):
        return f"query must target an allowed table: {ALLOWED_TABLES}"
    if "where" not in sql and sql.startswith("update"):
        return "UPDATE without WHERE clause is blocked"
    return None

VALIDATORS: dict[str, Callable[[dict], str | None]] = {
    "run_sql": validate_run_sql,
}

def validate(tool: str, args: dict) -> str | None:
    if tool not in VALIDATORS:
        return f"tool '{tool}' is not on the allowlist"
    return VALIDATORS[tool](args)

That UPDATE without WHERE check is not hypothetical. The first week we ran an unattended data-cleanup agent, it proposed exactly that, an UPDATE orders SET status = 'archived' with no WHERE clause, which would have archived every order in the table. The validator caught it, fed back the error, and the model corrected to a scoped query on its next step. No drama, because the guardrail did its job before the tool ran, not after.

Now the agent loop that ties budget, validation, and the breaker together:

def run_agent(task: str, propose, execute, budget: Budget) -> dict:
    breaker = RepeatFailureBreaker(threshold=2)
    history: list[dict] = []

    while True:
        halt = budget.exceeded()
        if halt:
            return escalate(task, history, reason=halt)

        step = propose(task, history)          # model call
        budget.charge(step["tokens"], step["usd"])

        err = validate(step["tool"], step["args"])
        if err:
            history.append({"rejected": step, "error": err})
            if breaker.record(step["tool"], err):
                return escalate(task, history, reason=f"repeated invalid: {err}")
            continue

        result = execute(step["tool"], step["args"])
        if not result["ok"]:
            history.append({"action": step, "error": result["error"]})
            if breaker.record(step["tool"], result["error"]):
                return escalate(task, history, reason=f"repeated failure: {result['error']}")
            continue

        history.append({"action": step, "result": result})
        if result.get("task_done"):
            return {"status": "done", "steps": budget.steps, "history": history}

Run it against the 3am scenario and the behavior is now bounded:

$ python run_migration_agent.py --task "apply pending migration"
step 1  run_sql        ok      (begin)
step 2  run_sql        FAIL    lock timeout on ALTER TABLE orders
step 3  run_sql        FAIL    lock timeout on ALTER TABLE orders
step 4  run_sql        FAIL    lock timeout on ALTER TABLE orders
breaker tripped: signature 9f2c seen 3x
ESCALATE: apply pending migration -> paged oncall after 4 steps (12s, $0.09)

Four steps and twelve seconds instead of eleven steps and forty minutes. Same model, same prompt. The only thing that changed is the scaffolding decided when to quit.

A Gotcha: When the Guardrail Fights the Model

The first version of the circuit breaker I shipped was too aggressive, and it broke a working agent in a way that took me an embarrassing afternoon to diagnose. I had keyed the breaker on the action name alone, not the (action, error) pair. The logic was that if the agent called the same tool three times in a row, it was probably stuck. It sounded sensible in my head.

It was wrong. A legitimate file-editing agent calls write_file dozens of times in a single task, once per file it touches. My over-eager breaker tripped on the fourth file every single time, halted the run, and paged on-call for an agent that was doing exactly what it was supposed to. The symptom in the logs was maddening, because each individual write_file succeeded:

$ grep breaker agent.log
[09:02:11] write_file ok  path=src/a.py
[09:02:14] write_file ok  path=src/b.py
[09:02:17] write_file ok  path=src/c.py
[09:02:20] breaker tripped: write_file called 3x  <-- WRONG, these all succeeded
[09:02:20] ESCALATE: refactor module -> paged oncall (false alarm)

The fix was the one-line change you saw earlier: key the signature on action|error, not action. A successful call produces no error, so it never contributes to a breaker count. Three identical failures trip it; three successes do not. The lesson generalizes past this one bug: a guardrail that fires on healthy behavior is worse than no guardrail, because it trains your team to ignore the pager. Tune guardrails against your real trajectories, watch the false-positive rate, and treat a guardrail that cries wolf as a production incident in its own right.

There is a subtler version of this trap. Once the breaker keys on the error string, near-identical errors with different row IDs or timestamps can dodge it. lock timeout on row 4471 and lock timeout on row 4472 hash to different signatures, so the spiral slips through. The fix is to normalize the error before hashing: strip digits, UUIDs, and timestamps down to a stable template. We run errors through a small normalizer so that "lock timeout on row N" collapses to one signature regardless of which row triggered it.

import re

def normalize_error(error: str) -> str:
    error = re.sub(r"\b[0-9a-f]{8}-[0-9a-f-]{27,}\b", "<uuid>", error)
    error = re.sub(r"\b\d{4}-\d{2}-\d{2}[t ][\d:.]+\b", "<ts>", error)
    error = re.sub(r"\d+", "N", error)
    return error.strip().lower()

With normalization in place, the breaker sees the spiral for what it is rather than being fooled by cosmetic variation. This is the kind of detail that never shows up in a demo and always shows up at 3am.

Decision Flow: Recover, Retry, or Escalate

Not every failure should trip the breaker immediately. A rate limit wants a backoff and retry. A validation error wants a corrective hint. A repeated identical failure wants escalation. The recovery policy is itself a guardrail, and getting it right is the difference between an agent that is resilient and one that is either brittle or runaway.

flowchart TD F[Step failed] --> T{Failure type} T -->|transient: rate limit, 5xx| R[Backoff + retry, max 2] T -->|correctable: bad args, schema| C[Feed error to model, re-propose] T -->|repeated identical| E[Trip breaker, escalate] T -->|destructive blocked| C R -->|still failing| E C -->|breaker threshold hit| E E --> H[Write handoff + page human]

The rule of thumb I use: transient failures get a bounded retry with exponential backoff, correctable failures get fed back to the model as context, and anything that repeats identically gets escalated. The model is good at the correctable case and useless at the repeated-identical case, so the system handles the latter on its behalf.

Comparison and Tradeoffs

How do the common approaches to agent reliability stack up? Here is how I weigh them after a year of running agents in production.

Approach Stops retry spirals Bounds blast radius Catches drift Cost overhead Verdict
Bigger / smarter model only No No No High Necessary, never sufficient
Prompt "be careful" instructions Weak Weak Weak None Comfort blanket, not a guardrail
Budget + circuit breaker Yes Partial Partial Negligible Highest value per line of code
Tool allowlist + arg validation No Yes No Low Essential for any write access
Typed recovery policy Yes No Partial Low Turns brittle agents resilient
Full guardrails-first stack Yes Yes Yes Low What you actually want
flowchart LR subgraph Before["Before: model-only"] M1[Model] --> M2[Tools] --> M3[Prod] end subgraph After["After: guardrails-first"] N1[Model] --> N2[Validate] --> N3[Budget] --> N4[Sandbox] --> N5[Recover] --> N6[Prod] end Before -.40 min runaway.-> After After -.12 sec halt.-> Done[Predictable]
Side-by-side comparison of a model-only stack versus a guardrails-first stack

The headline tradeoff is honesty versus theater. Prompt-level "be careful, do not delete anything" instructions feel like guardrails and cost nothing, which is exactly why they are dangerous. They work in the demo and evaporate under the one trajectory you did not test. Real guardrails live in code you control, where a blocked action is blocked by a function call, not by the model's good intentions.

The cost overhead of the real stack is genuinely small. A budget check is a few comparisons. A validator is a function call. The circuit breaker is a dictionary lookup. None of this competes with the model call for latency or cost. The DeepSeek "AI harness" team made the same bet in 2026 when they hired systems engineers to build deterministic scaffolding around their models rather than only training bigger ones (New Stack, 2026). The reliability is in the harness.

Production Considerations

A few things I learned the expensive way once these guardrails were in place.

Make escalation a first-class output. An agent that halts cleanly and hands off is more valuable than one that occasionally finishes a hard task but sometimes runs wild. Treat "I stopped and asked for help" as success, not failure, and your on-call rotation will trust the system.

Log every guardrail decision. When the breaker trips or the validator rejects, emit a structured event with the signature, the reason, and the trajectory so far. This is your debugging lifeline and your training data for tightening the policies. We feed rejected-action logs back into the validator rules weekly.

Scope the domain tightly. The narrower the agent's task and tool surface, the more reliable it is. A migration agent that can only touch three tables and run three statement types is far safer than a general "database assistant." Reliability and scope move together.

Test the failure paths, not just the happy path. Most agent test suites check that the agent completes the task. The guardrails-first suite checks that the agent stops correctly when the migration is locked, when the API is down, when the model proposes something destructive. Those are the trajectories that page you at 3am.

Observability: Making Guardrail Decisions Visible

A guardrail you cannot see is a guardrail you cannot trust. Once we had the budget, breaker, and validator in place, the next problem was understanding why a given run halted, especially across hundreds of unattended runs a day. The answer was to emit one structured event per guardrail decision and ship them to the same place we keep application traces.

Each event carries the run ID, the step number, the guardrail that fired, the signature, and a compact slice of the trajectory. That last field matters: when an on-call engineer opens a handoff at 3am, the first question is always "what was it trying to do," and the trajectory answers it without making anyone replay the run.

import json

def guardrail_event(run_id: str, step: int, kind: str,
                    signature: str, reason: str, trajectory: list[dict]) -> None:
    event = {
        "run_id": run_id,
        "step": step,
        "guardrail": kind,           # budget | validate | breaker | recover
        "signature": signature,
        "reason": reason,
        "recent": trajectory[-3:],   # last three steps for context
    }
    print(json.dumps(event))         # ship to your log pipeline

With those events flowing, a single query answers the question that used to take an afternoon of log spelunking: which guardrail is firing most, and on what. Here is the weekly rollup from one of our agent fleets:

$ agent-stats --since 7d --group-by guardrail
guardrail   count   top_signature              top_reason
budget        312   -                          step limit 25 reached
breaker        47   9f2c                        lock timeout on ALTER TABLE orders
validate       29   c1a0                        UPDATE without WHERE clause is blocked
recover        18   -                           transient 5xx, retried and recovered

That table is gold for tightening the system. The 47 breaker trips on the same 9f2c signature told us the migration agent kept hitting the same lock, which was a real infrastructure problem, not an agent problem. We fixed the lock contention upstream and the breaker trips dropped to near zero the following week. The guardrail did not just keep the agent safe; it surfaced a bug we would otherwise have never seen, because the agent had been quietly papering over it with retries.

This is the part people miss about guardrails-first design. The guardrails are not only a safety mechanism. They are an observability surface. Every time a guardrail fires, the system is telling you something true about where the agent and its environment disagree. Log those disagreements, aggregate them, and they become the highest-signal backlog you have for making the whole system more reliable.

Conclusion

The model is not the reliability bottleneck. The scaffolding is. A guardrails-first agent treats the LLM as a strong, fallible component and wraps it in four cheap layers: a hard budget, action validation, a repeat-failure circuit breaker, and a typed recovery policy. None of these require a smarter model, and together they convert every failure mode from unbounded to bounded.

Start with the budget, because it is one dataclass and it turns infinite into finite. Add the circuit breaker next, because repeated-identical failure is the clearest signal an agent is stuck. Then validate every tool call and give your recovery logic real types. Do that, and the difference shows up exactly where it matters: a 12-second clean halt instead of a 40-minute runaway, and a night where the pager stays quiet.

Working code for every snippet here, including the full agent loop and a test harness that simulates the 3am migration, lives in the companion repo: github.com/amtocbot-droid/amtocbot-examples/tree/main/260-guardrails-first.


Get the guardrails starter guide

This post now has a short companion PDF: a five-page Guardrails-First starter guide with the budget, breaker, validator, and handoff checklist in one place.

👉 Get it by joining the free weekly note

Reader challenge: take one agent loop you already run and add only the hard budget first. Reply to the email or comment with what the budget exposed, especially if it surfaced a repeated failure you had stopped noticing.


Revision History

Date Summary Old Version
2026-06-07 Added the lead-magnet signup CTA and reader-challenge block so this Guardrails-First post feeds the owned audience funnel. View previous version

Sources

About the Author

Toc Am

Founder of AmtocSoft. Writing practical deep-dives on AI engineering, cloud architecture, and developer tooling. Previously built backend systems at scale. Reviews every post published under this byline.

LinkedIn X / Twitter

Published: 2026-06-01 · Updated: 2026-06-07 · Written with AI assistance, reviewed by Toc Am.

Get These In Your Inbox

Weekly deep-dives on AI engineering, no fluff. Join the newsletter →

Subscribe (free)

Or grab the book ($39, ~100 pages) · Buy me a coffee

Buy Me a Coffee · 🔔 YouTube · 💼 LinkedIn · 🐦 X/Twitter

No comments:

Post a Comment

Structured Outputs Beyond JSON: Using Constrained Generation for Reliable Agent Tool Calls

Introduction I shipped a code-review agent in January that would extract structured findings — file path, line number, severity, des...