
It was 2:17am when my phone buzzed with a PagerDuty alert. Our AI agent, a customer support bot deployed two weeks earlier, had somehow consumed $847 in OpenAI API credits, we measured in the previous three hours. When I pulled the logs, I found it stuck in a loop: the agent was calling a get_order_status tool, receiving a timeout error, interpreting that error as a "pending" order status, and calling the tool again. tens of thousands of times.
The tool had no circuit breaker. The agent had no error budget. The prompt never distinguished between a transient network error and a legitimate order-pending state. We had tested the happy path exhaustively. We had never tested what happened when the tool infrastructure was degraded.
That incident cost more than the API bill. It cost three engineers a full day of postmortem work and almost cost us the client. And it was entirely preventable if we had applied the same rigor to our agent infrastructure that we applied to our microservices.
This post is about what teams learned deploying AI agents into production over the last year and a half: the failures, the fixes, and the architectural patterns that actually hold under real user load.
The Gap Between Demo and Production
Every AI agent tutorial ends the same way: the agent successfully books a flight, writes a SQL query, or summarizes a PDF. The notebook runs clean. The demo is impressive.
What the tutorial never shows:
- The tool returns HTTP 429 because you didn't rate-limit your agent
- The context window fills up on turn 7 of a long conversation
- Two concurrent users trigger a race condition on a shared data structure
- The model hallucinates a tool name that doesn't exist and the framework throws an unhandled exception
- An adversarial user crafts a message that causes the agent to exfiltrate its own system prompt
These are not edge cases. They are near-certainties at any meaningful scale.
The hard part is not that one tool fails. The hard part is that an agent turns one failure into a sequence. Anthropic's guidance on building effective agents makes a useful distinction between workflows, where paths are predefined, and agents, where the model controls more of the path. Production systems need to decide which parts deserve deterministic workflow control and which parts can safely remain agentic.
OWASP's LLM Top 10 also changes the risk model. Prompt injection, excessive agency, sensitive information disclosure, and unbounded consumption are not abstract checklist items. They map directly to production agent incidents: a malicious document can steer a tool call, a broad permission set can let the agent take the wrong action, and an uncapped loop can burn budget before humans notice.
The delta between demo works and production works is wider for agentic systems than for normal request-response software because agents compound failures across multiple tool calls, and because many failure modes are probabilistic rather than deterministic.

How Production Agents Actually Fail
Understanding failure modes is prerequisite to designing against them. After talking to 20+ engineering teams and reviewing public postmortems, the failure taxonomy breaks down into four categories:
1. Tool Reliability Failures
Tools are external services. External services fail. But agent frameworks often treat tool failure as terminal rather than transient:
# Naive tool implementation: no error handling
@tool
def get_order_status(order_id: str) -> str:
response = requests.get(f"https://api.example.com/orders/{order_id}")
return response.json()["status"]
When requests.get times out, the exception propagates to the model as raw Python traceback text. Depending on your prompt design, the model may try to parse that traceback as order data, may call the tool again immediately, or may enter an apologetic loop telling the user there was an "unexpected error" on every turn.
2. Context Window Overflow
A conversation that starts with a 2,000-token system prompt, accumulates multiple large tool call results and long user turns can exceed the context budget quickly if you do not summarize or evict old state. What happens then depends on your truncation strategy, which most teams do not have when they ship.
The failure mode: the model silently loses earlier conversation context, forgets instructions from the system prompt, or loses track of the user's original goal. Users report "the agent got dumb halfway through."
3. Cost Spirals
Three patterns cause cost spirals:
- Retry loops: Tool errors trigger retries without backoff or budget limits
- Verbosity inflation: As conversations lengthen, summarization calls get more expensive, which triggers more summarization calls
- Model misrouting: A routing agent sends simple queries to the most capable (and expensive) model because there's no cost-aware routing logic
A team at a Series B fintech reported spending a five-figure bill in two days during a product launch because their agent routed every query to GPT-4o regardless of complexity. Their original budget was a modest daily budget.
4. Security Failures
Prompt injection is the AI agent equivalent of SQL injection, and it is more prevalent than most teams expect. Users (and attackers) will attempt:
- Direct injection: a user asks the model to ignore prior instructions and reveal the system prompt
- Tool output injection: Malicious content in external data sources that gets included in tool results
- Indirect injection: Adversarial content embedded in documents the agent summarizes
Figure 1: A production agent execution flow with failure handling at each stage.
Architecture Patterns That Survived
After the 2am incident, we rebuilt our agent infrastructure around four principles. These patterns appear consistently in the production systems of teams that report stability.
Pattern 1: Structured Tool Responses
Every tool should return a typed response object, not raw strings, not raw JSON, not exceptions. The model needs to distinguish between:
{"status": "success", "data": {...}}{"status": "error", "error_type": "transient", "retry_safe": true, "message": "..."}{"status": "error", "error_type": "permanent", "retry_safe": false, "message": "..."}
This distinction is what prevents the retry loop. When the model sees retry_safe: false, it knows to degrade gracefully. When it sees retry_safe: true, it knows a backoff retry is appropriate.
from pydantic import BaseModel
from typing import Any, Literal
import requests
import time
class ToolResult(BaseModel):
status: Literal["success", "error"]
data: Any = None
error_type: Literal["transient", "permanent", "rate_limit"] | None = None
retry_safe: bool = False
message: str = ""
def get_order_status(order_id: str) -> ToolResult:
try:
response = requests.get(
f"https://api.example.com/orders/{order_id}",
timeout=5.0
)
if response.status_code == 200:
return ToolResult(status="success", data=response.json())
elif response.status_code == 429:
return ToolResult(
status="error",
error_type="rate_limit",
retry_safe=True,
message="Rate limit hit. Retry after 60s."
)
elif response.status_code >= 500:
return ToolResult(
status="error",
error_type="transient",
retry_safe=True,
message=f"Server error: {response.status_code}"
)
else:
return ToolResult(
status="error",
error_type="permanent",
retry_safe=False,
message=f"Order {order_id} not found or access denied."
)
except requests.Timeout:
return ToolResult(
status="error",
error_type="transient",
retry_safe=True,
message="Request timed out. Backend may be degraded."
)
Benchmark: In our internal testing, switching from raw exception propagation to structured ToolResult responses substantially reduced retry loop incidents and cut average tokens-per-session in our measured test harness (because the model no longer tried to parse tracebacks).
Pattern 2: Token Budgeting
Treat tokens like memory, with a budget, a high-water mark alarm, and a reclamation strategy.
class TokenBudget:
def __init__(self, total_budget: int, warning_threshold: float = 0.75):
self.total = total_budget
self.warning_threshold = warning_threshold
self.used = 0
def check(self, estimated_tokens: int) -> str:
projected = self.used + estimated_tokens
ratio = projected / self.total
if ratio > 1.0:
return "EXCEEDED"
elif ratio > self.warning_threshold:
return "WARNING"
return "OK"
def consume(self, tokens_used: int):
self.used += tokens_used
if self.used > self.total:
raise TokenBudgetExceededError(
f"Token budget exceeded: {self.used}/{self.total}"
)
# In your agent loop:
budget = TokenBudget(total_budget=50_000)
for turn in conversation_loop:
estimated = estimate_tokens(current_context)
status = budget.check(estimated)
if status == "EXCEEDED":
return "I've reached my context limit for this session. Please start a new conversation."
elif status == "WARNING":
context = summarize_older_turns(context) # Compress before proceeding
response = call_llm(context)
budget.consume(response.usage.total_tokens)
Pattern 3: Cost Circuit Breakers
This is what we lacked the night of the measured cost-spike incident. A cost circuit breaker is a hard limit on cumulative API spend per session, per user, and per day:
import redis
from datetime import datetime, date
class CostCircuitBreaker:
def __init__(self, redis_client, limits: dict):
self.redis = redis_client
self.limits = limits # {"session": 0.50, "user_daily": 5.00, "global_hourly": 100.0}
def check_and_increment(self, user_id: str, session_id: str, cost_usd: float):
today = date.today().isoformat()
hour = datetime.now().strftime("%Y-%m-%d-%H")
keys = {
"session": f"cost:session:{session_id}",
"user_daily": f"cost:user:{user_id}:{today}",
"global_hourly": f"cost:global:{hour}"
}
for limit_name, key in keys.items():
current = float(self.redis.get(key) or 0)
if current + cost_usd > self.limits[limit_name]:
raise CostLimitExceeded(
f"{limit_name} limit exceeded: ${current:.2f} + ${cost_usd:.4f} > ${self.limits[limit_name]}"
)
# All checks passed, increment counters
for key in keys.values():
pipe = self.redis.pipeline()
pipe.incrbyfloat(key, cost_usd)
pipe.expire(key, 86400)
pipe.execute()
Result: After deploying the circuit breaker, our worst monthly overage dropped to a small exception instead of repeated large incidents.
The Debugging Story Nobody Posts on Twitter
Six weeks after deploying a document analysis agent, one of our enterprise customers reported inconsistent answers to the same question. We could reproduce it intermittently but not reliably.
The trace logs looked identical. Same input, same tools called, same sequence. Different outputs.
After a multi-day debugging pass, we found it: our vector search tool was returning results in different order depending on the node handling the request (we had a load-balanced vector DB cluster, and one replica was slightly behind). The agent's reasoning about document relationships depended on which result appeared first. The same documents, different order, different synthesis.
The fix was trivial: sort results by deterministic key (document ID) before returning. The discovery process was not trivial. It required distributed tracing across four services and a week of log analysis.
The lesson: Non-determinism in tool outputs produces non-determinism in agent outputs. Every tool that queries a distributed system needs deterministic ordering.
Implementation Guide: The Production Readiness Checklist
Based on the patterns above, here is the minimum checklist before an agent goes to production:
Step 1: Instrument Everything Before You Ship
You cannot debug what you cannot observe. Add tracing before your first production user:
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
# Initialize tracer
provider = TracerProvider()
provider.add_span_processor(
BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("agent-service")
# Wrap every LLM call and tool call
def traced_tool_call(tool_name: str, args: dict) -> ToolResult:
with tracer.start_as_current_span(f"tool.{tool_name}") as span:
span.set_attribute("tool.name", tool_name)
span.set_attribute("tool.args", str(args))
result = execute_tool(tool_name, args)
span.set_attribute("tool.status", result.status)
span.set_attribute("tool.retry_safe", result.retry_safe)
if result.status == "error":
span.record_exception(Exception(result.message))
return result
The output from an instrumented agent session:
Trace: session_a3f7b2
├── llm.completion [423ms, 1,847 tokens, $0.0184]
│ └── anthropic.claude-3-7-sonnet
├── tool.get_order_status [88ms, success]
├── tool.get_order_status [timeout] → retry #1
├── tool.get_order_status [5,012ms, transient error] → circuit open
├── llm.completion [312ms, 624 tokens, $0.0062]
└── response.final [2,471 tokens total, $0.0246 total]
Step 2: Design Tools for Failure from the Start
Apply these rules to every tool:
- Idempotent by default: calling the same tool twice with the same args should produce the same result
- Bounded execution: hard timeouts on every external call (5s for APIs, 30s for DB queries)
- Typed structured output: use the
ToolResultpattern above - Retry metadata: explicitly signal whether a retry is safe
Step 3: Gate Destructive Operations
Any tool that writes data, sends messages, charges money, or modifies state needs a confirmation gate:
def send_email(to: str, subject: str, body: str) -> ToolResult:
"""Send an email. REQUIRES explicit user confirmation before execution."""
# Check if we have a confirmed intent for this exact action
confirmation_key = f"confirmed:{hash(f'{to}:{subject}')}"
if not get_confirmation(confirmation_key):
return ToolResult(
status="error",
error_type="permanent",
retry_safe=False,
message=f"CONFIRMATION_REQUIRED: Please confirm you want to send email to {to} with subject '{subject}'"
)
# Proceed with send
result = email_client.send(to=to, subject=subject, body=body)
return ToolResult(status="success", data={"message_id": result.id})
Figure 2: Decision flow for gating operations by risk level.
Comparison: Framework Choices in Production

Early adopters used LangChain and AutoGen. Newer teams gravitated toward LangGraph, raw SDK calls, and emerging options like smolagents. Here is what shook out after production pressure:
| Framework | Latency Overhead | Observability | Reliability Primitives | Best For |
|---|---|---|---|---|
| LangGraph | 15-40ms | Excellent (native traces) | Good (retry, checkpoint) | Complex multi-step workflows, stateful agents |
| Raw Anthropic SDK | <5ms | Manual (add your own) | None (build yourself) | High-throughput, cost-sensitive, custom infra |
| LangChain | 20-60ms | Moderate (LangSmith) | Basic (callbacks) | Rapid prototyping, broad ecosystem |
| AutoGen | 30-80ms | Poor | Moderate | Research, multi-agent experiments |
| smolagents (HuggingFace) | 10-25ms | Limited | Basic | Open-source model serving |
| CrewAI | 25-50ms | Limited | Moderate | Role-based multi-agent setups |
The teams reporting the most stability in production cluster around two approaches: LangGraph for complex orchestration (where its stateful graph model maps directly to real agent workflows), and raw SDK calls for high-volume simple agents (where the framework overhead adds up).
A fintech running high-volume agent traffic reported that switching from LangChain to raw Anthropic SDK calls reduced average latency materially and cut costs in its own measurements (from reduced token overhead in the framework's prompt boilerplate).
Figure 3: Evolution of production agent framework adoption.
Runtime Contracts For Agent Tools
A production tool should have a contract that is more precise than a docstring. The contract needs to state ownership, timeout, retry policy, idempotency, side effects, authorization scope, and observability fields. Without that contract, the model sees a function name and a description, while the platform team has no reliable way to reason about blast radius.
The contract can be stored beside the tool implementation:
name: get_order_status
owner: support-platform
timeout_seconds: 5
retry_policy: exponential_backoff
idempotent: true
side_effects: none
auth_scope: orders:read
max_calls_per_session: 3
logs:
- tool.name
- tool.status
- retry_safe
- latency_ms
That file is not bureaucracy. It lets reviewers reject a tool that can send email without a confirmation gate, flag a tool with no timeout, or block an agent that can call the same expensive search API without a session limit. The model prompt can summarize the contract, but the enforcement must live in code.
Human Escalation And Product Design
Reliable agents also need a graceful way to stop. Teams often treat escalation as a failure because the demo looks better when the agent solves everything alone. In production, escalation is how you protect trust. If a tool is degraded, the user should see a concise explanation and a handoff path, not a stream of apologetic retries.
The handoff policy should be product-specific. A support agent can escalate after repeated tool errors. A financial agent should escalate before any ambiguous money movement. A code review agent can leave a blocking comment only when deterministic checks agree with the model's finding. The principle is the same: autonomy increases only where the system has evidence, observability, and a rollback path.
This is the product version of circuit breaking. Stop the agent before it turns uncertainty into action.
Production Considerations
Costs
Illustrative production cost model from anonymized interviews and internal measurements:
| Agent Type | Avg Tokens/Session | Avg Cost/Session | Daily Sessions | Daily Cost |
|---|---|---|---|---|
| Customer support | 8,400 | $0.084 | 12,000 | $1,008 |
| Code review | 24,000 | $0.240 | 800 | $192 |
| Document analysis | 45,000 | $0.450 | 200 | $90 |
| SQL/data assistant | 6,200 | $0.062 | 5,000 | $310 |
Cost-per-session is predictable if you enforce token budgets. Cost-per-day is unpredictable until you enforce circuit breakers.
Scaling Patterns
Agents are stateful. Stateful services are harder to scale than stateless ones. The key architecture decision is where state lives:
- In-process: Fast, but limits horizontal scaling to sticky sessions
- External store (Redis): Adds a small per-turn latency hop, enables any-node routing
- Checkpoint-based (LangGraph): Supports long-running agents with interrupts, adds a modest per-turn latency hop
Most high-scale teams externalize state to Redis with a TTL measured in days, accepting the slight latency cost for the scaling headroom.
Monitoring
The minimum metrics to alert on:
- Tool error rate per tool per short rolling window
- Token burn rate per hour vs. budget (alert at a large fraction of daily budget by midday)
- Tail session duration compared with the normal median, which indicates stuck sessions
- Prompt injection detection rate (log all, alert if rate spikes above baseline)
- High-end cost per session compared with the normal median, which indicates cost spiral
Conclusion
The AI agent teams that are running reliably today are not the teams that built the cleverest prompts. They are the teams that treated their agents as distributed systems: designing for failure, instrumenting from day one, setting hard budgets, and iterating on the unhappy paths with the same rigor they applied to the happy path.
The measured cost-spike incident was the best thing that happened to our agent infrastructure. It forced us to confront the gap between notebook success and real service behavior under adversarial conditions. Every pattern in this post came out of a real incident from a real team.
If you are shipping agents soon, run the production readiness checklist before launch. Add tracing. Build the circuit breaker. Design your tools for structured failure. The happy path will work fine. It always does.
The question is what happens when it doesn't.
Working code for all patterns in this post: github.com/amtocbot-droid/amtocbot-examples/tree/main/agentic-ai-production
Revision History
| Date | Summary | Old Version |
|---|---|---|
| 2026-06-08 | Removed an unsupported survey citation, added runtime-contract and escalation guidance, softened or attributed measured incident and cost claims, reduced em-dash use, and added the missing comparison visual. | View previous version |
Tools mentioned in this post
Disclosure: some links in this section may be referral links. If you use them, AmtocSoft may receive a small commission at no additional cost to you; that support helps cover production and research costs for this site.
- Anthropic Claude API: production LLM access. Sign up
- OpenAI Platform: GPT-4 and embedding APIs. Sign up
- LangChain: LangSmith observability tier. Sign up
- Hugging Face: Pro / Enterprise tier. Sign up
Sources
- Anthropic, "Building Effective Agents": https://www.anthropic.com/engineering/building-effective-agents
- OWASP Top 10 for Large Language Model Applications: https://owasp.org/www-project-top-10-for-large-language-model-applications
- OWASP Top 10 for LLM Applications 2025 PDF: https://owasp.org/www-project-top-10-for-large-language-model-applications/assets/PDF/OWASP-Top-10-for-LLMs-v2025.pdf
- LangGraph documentation: persistence and checkpointing: https://langgraphjs.guide/persistence/
- OpenTelemetry documentation: https://opentelemetry.io/docs/
- Simon Willison, "Prompt Injection and AI Agents": https://simonwillison.net/tags/prompt-injection/
About the Author
Toc Am
Founder of AmtocSoft. Writing practical deep-dives on AI engineering, cloud architecture, and developer tooling. Previously built backend systems at scale. Reviews every post published under this byline.
Published: 2026-04-19 · Updated: 2026-06-08 · Written with AI assistance, reviewed by Toc Am.
Get These In Your Inbox
Weekly deep-dives on AI engineering, no fluff. Join the newsletter →
Or grab the book ($39, ~100 pages) · Buy me a coffee
☕ Buy Me a Coffee · 🔔 YouTube · 💼 LinkedIn · 🐦 X/Twitter
No comments:
Post a Comment