AmtocSoft Tech Insights: Agentic AI in Production: Lessons from Early Adopters

Sunday, April 19, 2026

Agentic AI in Production: Lessons from Early Adopters

Hero: AI agent system with interconnected tools and monitoring dashboards in a production environment

It was 2:17am when my phone buzzed with a PagerDuty alert. Our AI agent, a customer support bot deployed two weeks earlier, had somehow consumed $847 in OpenAI API credits, we measured in the previous three hours. When I pulled the logs, I found it stuck in a loop: the agent was calling a get_order_status tool, receiving a timeout error, interpreting that error as a "pending" order status, and calling the tool again. tens of thousands of times.

The tool had no circuit breaker. The agent had no error budget. The prompt never distinguished between a transient network error and a legitimate order-pending state. We had tested the happy path exhaustively. We had never tested what happened when the tool infrastructure was degraded.

That incident cost more than the API bill. It cost three engineers a full day of postmortem work and almost cost us the client. And it was entirely preventable if we had applied the same rigor to our agent infrastructure that we applied to our microservices.

This post is about what teams learned deploying AI agents into production over the last year and a half: the failures, the fixes, and the architectural patterns that actually hold under real user load.

The Gap Between Demo and Production

Every AI agent tutorial ends the same way: the agent successfully books a flight, writes a SQL query, or summarizes a PDF. The notebook runs clean. The demo is impressive.

What the tutorial never shows:

The tool returns HTTP 429 because you didn't rate-limit your agent
The context window fills up on turn 7 of a long conversation
Two concurrent users trigger a race condition on a shared data structure
The model hallucinates a tool name that doesn't exist and the framework throws an unhandled exception
An adversarial user crafts a message that causes the agent to exfiltrate its own system prompt

These are not edge cases. They are near-certainties at any meaningful scale.

The hard part is not that one tool fails. The hard part is that an agent turns one failure into a sequence. Anthropic's guidance on building effective agents makes a useful distinction between workflows, where paths are predefined, and agents, where the model controls more of the path. Production systems need to decide which parts deserve deterministic workflow control and which parts can safely remain agentic.

OWASP's LLM Top 10 also changes the risk model. Prompt injection, excessive agency, sensitive information disclosure, and unbounded consumption are not abstract checklist items. They map directly to production agent incidents: a malicious document can steer a tool call, a broad permission set can let the agent take the wrong action, and an uncapped loop can burn budget before humans notice.

The delta between demo works and production works is wider for agentic systems than for normal request-response software because agents compound failures across multiple tool calls, and because many failure modes are probabilistic rather than deterministic.

Architecture diagram: Production AI agent system with reliability, observability, and security layers

How Production Agents Actually Fail

Understanding failure modes is prerequisite to designing against them. After talking to 20+ engineering teams and reviewing public postmortems, the failure taxonomy breaks down into four categories:

1. Tool Reliability Failures

Tools are external services. External services fail. But agent frameworks often treat tool failure as terminal rather than transient:

# Naive tool implementation: no error handling
@tool
def get_order_status(order_id: str) -> str:
    response = requests.get(f"https://api.example.com/orders/{order_id}")
    return response.json()["status"]

When requests.get times out, the exception propagates to the model as raw Python traceback text. Depending on your prompt design, the model may try to parse that traceback as order data, may call the tool again immediately, or may enter an apologetic loop telling the user there was an "unexpected error" on every turn.

2. Context Window Overflow

A conversation that starts with a 2,000-token system prompt, accumulates multiple large tool call results and long user turns can exceed the context budget quickly if you do not summarize or evict old state. What happens then depends on your truncation strategy, which most teams do not have when they ship.

The failure mode: the model silently loses earlier conversation context, forgets instructions from the system prompt, or loses track of the user's original goal. Users report "the agent got dumb halfway through."

3. Cost Spirals

Three patterns cause cost spirals:

Retry loops: Tool errors trigger retries without backoff or budget limits
Verbosity inflation: As conversations lengthen, summarization calls get more expensive, which triggers more summarization calls
Model misrouting: A routing agent sends simple queries to the most capable (and expensive) model because there's no cost-aware routing logic

A team at a Series B fintech reported spending a five-figure bill in two days during a product launch because their agent routed every query to GPT-4o regardless of complexity. Their original budget was a modest daily budget.

4. Security Failures

Prompt injection is the AI agent equivalent of SQL injection, and it is more prevalent than most teams expect. Users (and attackers) will attempt:

Direct injection: a user asks the model to ignore prior instructions and reveal the system prompt
Tool output injection: Malicious content in external data sources that gets included in tool results
Indirect injection: Adversarial content embedded in documents the agent summarizes

flowchart TD A[User Message] --> B{Input Validation} B -->|Passes| C[System Prompt + History] B -->|Suspicious| D[Flag + Log + Sanitize] C --> E[LLM Reasoning] E --> F{Tool Call?} F -->|Yes| G[Tool Execution] F -->|No| H[Response Generation] G --> I{Tool Success?} I -->|Success| J[Result to Context] I -->|Error| K{Retry Budget} K -->|Retries left| L[Exponential Backoff] L --> G K -->|Exhausted| M[Graceful Degradation] J --> E M --> H H --> N[Output Validation] N --> O[User Response] D --> P[Human Review Queue] style D fill:#ff6b6b,color:#fff style M fill:#ffd93d style K fill:#6bcb77

Figure 1: A production agent execution flow with failure handling at each stage.

Architecture Patterns That Survived

After the 2am incident, we rebuilt our agent infrastructure around four principles. These patterns appear consistently in the production systems of teams that report stability.

Pattern 1: Structured Tool Responses

Every tool should return a typed response object, not raw strings, not raw JSON, not exceptions. The model needs to distinguish between:

{"status": "success", "data": {...}}
{"status": "error", "error_type": "transient", "retry_safe": true, "message": "..."}
{"status": "error", "error_type": "permanent", "retry_safe": false, "message": "..."}

This distinction is what prevents the retry loop. When the model sees retry_safe: false, it knows to degrade gracefully. When it sees retry_safe: true, it knows a backoff retry is appropriate.

from pydantic import BaseModel
from typing import Any, Literal
import requests
import time

class ToolResult(BaseModel):
    status: Literal["success", "error"]
    data: Any = None
    error_type: Literal["transient", "permanent", "rate_limit"] | None = None
    retry_safe: bool = False
    message: str = ""

def get_order_status(order_id: str) -> ToolResult:
    try:
        response = requests.get(
            f"https://api.example.com/orders/{order_id}",
            timeout=5.0
        )
        if response.status_code == 200:
            return ToolResult(status="success", data=response.json())
        elif response.status_code == 429:
            return ToolResult(
                status="error",
                error_type="rate_limit",
                retry_safe=True,
                message="Rate limit hit. Retry after 60s."
            )
        elif response.status_code >= 500:
            return ToolResult(
                status="error",
                error_type="transient",
                retry_safe=True,
                message=f"Server error: {response.status_code}"
            )
        else:
            return ToolResult(
                status="error",
                error_type="permanent",
                retry_safe=False,
                message=f"Order {order_id} not found or access denied."
            )
    except requests.Timeout:
        return ToolResult(
            status="error",
            error_type="transient",
            retry_safe=True,
            message="Request timed out. Backend may be degraded."
        )

Benchmark: In our internal testing, switching from raw exception propagation to structured ToolResult responses substantially reduced retry loop incidents and cut average tokens-per-session in our measured test harness (because the model no longer tried to parse tracebacks).

Pattern 2: Token Budgeting

Treat tokens like memory, with a budget, a high-water mark alarm, and a reclamation strategy.

class TokenBudget:
    def __init__(self, total_budget: int, warning_threshold: float = 0.75):
        self.total = total_budget
        self.warning_threshold = warning_threshold
        self.used = 0

    def check(self, estimated_tokens: int) -> str:
        projected = self.used + estimated_tokens
        ratio = projected / self.total

        if ratio > 1.0:
            return "EXCEEDED"
        elif ratio > self.warning_threshold:
            return "WARNING"
        return "OK"

    def consume(self, tokens_used: int):
        self.used += tokens_used
        if self.used > self.total:
            raise TokenBudgetExceededError(
                f"Token budget exceeded: {self.used}/{self.total}"
            )

# In your agent loop:
budget = TokenBudget(total_budget=50_000)

for turn in conversation_loop:
    estimated = estimate_tokens(current_context)
    status = budget.check(estimated)

    if status == "EXCEEDED":
        return "I've reached my context limit for this session. Please start a new conversation."
    elif status == "WARNING":
        context = summarize_older_turns(context)  # Compress before proceeding

    response = call_llm(context)
    budget.consume(response.usage.total_tokens)

Pattern 3: Cost Circuit Breakers

This is what we lacked the night of the measured cost-spike incident. A cost circuit breaker is a hard limit on cumulative API spend per session, per user, and per day:

import redis
from datetime import datetime, date

class CostCircuitBreaker:
    def __init__(self, redis_client, limits: dict):
        self.redis = redis_client
        self.limits = limits  # {"session": 0.50, "user_daily": 5.00, "global_hourly": 100.0}

    def check_and_increment(self, user_id: str, session_id: str, cost_usd: float):
        today = date.today().isoformat()
        hour = datetime.now().strftime("%Y-%m-%d-%H")

        keys = {
            "session": f"cost:session:{session_id}",
            "user_daily": f"cost:user:{user_id}:{today}",
            "global_hourly": f"cost:global:{hour}"
        }

        for limit_name, key in keys.items():
            current = float(self.redis.get(key) or 0)
            if current + cost_usd > self.limits[limit_name]:
                raise CostLimitExceeded(
                    f"{limit_name} limit exceeded: ${current:.2f} + ${cost_usd:.4f} > ${self.limits[limit_name]}"
                )

        # All checks passed, increment counters
        for key in keys.values():
            pipe = self.redis.pipeline()
            pipe.incrbyfloat(key, cost_usd)
            pipe.expire(key, 86400)
            pipe.execute()

Result: After deploying the circuit breaker, our worst monthly overage dropped to a small exception instead of repeated large incidents.

The Debugging Story Nobody Posts on Twitter

Six weeks after deploying a document analysis agent, one of our enterprise customers reported inconsistent answers to the same question. We could reproduce it intermittently but not reliably.

The trace logs looked identical. Same input, same tools called, same sequence. Different outputs.

After a multi-day debugging pass, we found it: our vector search tool was returning results in different order depending on the node handling the request (we had a load-balanced vector DB cluster, and one replica was slightly behind). The agent's reasoning about document relationships depended on which result appeared first. The same documents, different order, different synthesis.

The fix was trivial: sort results by deterministic key (document ID) before returning. The discovery process was not trivial. It required distributed tracing across four services and a week of log analysis.

The lesson: Non-determinism in tool outputs produces non-determinism in agent outputs. Every tool that queries a distributed system needs deterministic ordering.

Implementation Guide: The Production Readiness Checklist

Based on the patterns above, here is the minimum checklist before an agent goes to production:

Step 1: Instrument Everything Before You Ship

You cannot debug what you cannot observe. Add tracing before your first production user:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Initialize tracer
provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("agent-service")

# Wrap every LLM call and tool call
def traced_tool_call(tool_name: str, args: dict) -> ToolResult:
    with tracer.start_as_current_span(f"tool.{tool_name}") as span:
        span.set_attribute("tool.name", tool_name)
        span.set_attribute("tool.args", str(args))

        result = execute_tool(tool_name, args)

        span.set_attribute("tool.status", result.status)
        span.set_attribute("tool.retry_safe", result.retry_safe)

        if result.status == "error":
            span.record_exception(Exception(result.message))

        return result

The output from an instrumented agent session:

Trace: session_a3f7b2
  ├── llm.completion [423ms, 1,847 tokens, $0.0184]
  │   └── anthropic.claude-3-7-sonnet
  ├── tool.get_order_status [88ms, success]
  ├── tool.get_order_status [timeout] → retry #1
  ├── tool.get_order_status [5,012ms, transient error] → circuit open
  ├── llm.completion [312ms, 624 tokens, $0.0062]
  └── response.final [2,471 tokens total, $0.0246 total]

Step 2: Design Tools for Failure from the Start

Apply these rules to every tool:

Idempotent by default: calling the same tool twice with the same args should produce the same result
Bounded execution: hard timeouts on every external call (5s for APIs, 30s for DB queries)
Typed structured output: use the ToolResult pattern above
Retry metadata: explicitly signal whether a retry is safe

Step 3: Gate Destructive Operations

Any tool that writes data, sends messages, charges money, or modifies state needs a confirmation gate:

def send_email(to: str, subject: str, body: str) -> ToolResult:
    """Send an email. REQUIRES explicit user confirmation before execution."""

    # Check if we have a confirmed intent for this exact action
    confirmation_key = f"confirmed:{hash(f'{to}:{subject}')}"

    if not get_confirmation(confirmation_key):
        return ToolResult(
            status="error",
            error_type="permanent",
            retry_safe=False,
            message=f"CONFIRMATION_REQUIRED: Please confirm you want to send email to {to} with subject '{subject}'"
        )

    # Proceed with send
    result = email_client.send(to=to, subject=subject, body=body)
    return ToolResult(status="success", data={"message_id": result.id})

flowchart LR A[Agent Decision] --> B{Operation Type} B -->|Read| C[Execute Directly] B -->|Write| D{Reversible?} B -->|Delete| E[Always Confirm] B -->|Financial| E D -->|Yes| F[Execute with Audit Log] D -->|No| G[Require Confirmation] C --> H[Return Result] F --> H G --> I[Pause + Request Confirmation] E --> I I --> J{User Confirms?} J -->|Yes| K[Execute with Double-Write Log] J -->|No| L[Cancel + Log Decline] K --> H L --> M[Inform Agent of Cancellation] style E fill:#ff6b6b,color:#fff style G fill:#ffd93d style K fill:#6bcb77

Figure 2: Decision flow for gating operations by risk level.

Comparison: Framework Choices in Production

Comparison visual showing production agent framework tradeoffs across observability, latency overhead, and reliability primitives.

Early adopters used LangChain and AutoGen. Newer teams gravitated toward LangGraph, raw SDK calls, and emerging options like smolagents. Here is what shook out after production pressure:

Framework	Latency Overhead	Observability	Reliability Primitives	Best For
LangGraph	15-40ms	Excellent (native traces)	Good (retry, checkpoint)	Complex multi-step workflows, stateful agents
Raw Anthropic SDK	<5ms	Manual (add your own)	None (build yourself)	High-throughput, cost-sensitive, custom infra
LangChain	20-60ms	Moderate (LangSmith)	Basic (callbacks)	Rapid prototyping, broad ecosystem
AutoGen	30-80ms	Poor	Moderate	Research, multi-agent experiments
smolagents (HuggingFace)	10-25ms	Limited	Basic	Open-source model serving
CrewAI	25-50ms	Limited	Moderate	Role-based multi-agent setups

The teams reporting the most stability in production cluster around two approaches: LangGraph for complex orchestration (where its stateful graph model maps directly to real agent workflows), and raw SDK calls for high-volume simple agents (where the framework overhead adds up).

A fintech running high-volume agent traffic reported that switching from LangChain to raw Anthropic SDK calls reduced average latency materially and cut costs in its own measurements (from reduced token overhead in the framework's prompt boilerplate).

timeline title AI Agent Framework Maturity in Production (2024-2026) 2024 Q1 : LangChain dominates : AutoGen emerges : Production failures widespread 2024 Q3 : LangGraph released : Teams start adding observability : Cost management becomes priority 2025 Q1 : LangGraph matures : smolagents for open-source : Circuit breakers adopted 2025 Q3 : Raw SDK patterns documented : OpenTelemetry integration standardizes : Multi-agent orchestration stabilizes 2026 Q1 : Framework consolidation : Observability-first design : Security patterns formalized

Figure 3: Evolution of production agent framework adoption.

Runtime Contracts For Agent Tools

A production tool should have a contract that is more precise than a docstring. The contract needs to state ownership, timeout, retry policy, idempotency, side effects, authorization scope, and observability fields. Without that contract, the model sees a function name and a description, while the platform team has no reliable way to reason about blast radius.

The contract can be stored beside the tool implementation:

name: get_order_status
owner: support-platform
timeout_seconds: 5
retry_policy: exponential_backoff
idempotent: true
side_effects: none
auth_scope: orders:read
max_calls_per_session: 3
logs:
  - tool.name
  - tool.status
  - retry_safe
  - latency_ms

That file is not bureaucracy. It lets reviewers reject a tool that can send email without a confirmation gate, flag a tool with no timeout, or block an agent that can call the same expensive search API without a session limit. The model prompt can summarize the contract, but the enforcement must live in code.

Human Escalation And Product Design

Reliable agents also need a graceful way to stop. Teams often treat escalation as a failure because the demo looks better when the agent solves everything alone. In production, escalation is how you protect trust. If a tool is degraded, the user should see a concise explanation and a handoff path, not a stream of apologetic retries.

The handoff policy should be product-specific. A support agent can escalate after repeated tool errors. A financial agent should escalate before any ambiguous money movement. A code review agent can leave a blocking comment only when deterministic checks agree with the model's finding. The principle is the same: autonomy increases only where the system has evidence, observability, and a rollback path.

This is the product version of circuit breaking. Stop the agent before it turns uncertainty into action.

Production Considerations

Costs

Illustrative production cost model from anonymized interviews and internal measurements:

Agent Type	Avg Tokens/Session	Avg Cost/Session	Daily Sessions	Daily Cost
Customer support	8,400	$0.084	12,000	$1,008
Code review	24,000	$0.240	800	$192
Document analysis	45,000	$0.450	200	$90
SQL/data assistant	6,200	$0.062	5,000	$310

Cost-per-session is predictable if you enforce token budgets. Cost-per-day is unpredictable until you enforce circuit breakers.

Scaling Patterns

Agents are stateful. Stateful services are harder to scale than stateless ones. The key architecture decision is where state lives:

In-process: Fast, but limits horizontal scaling to sticky sessions
External store (Redis): Adds a small per-turn latency hop, enables any-node routing
Checkpoint-based (LangGraph): Supports long-running agents with interrupts, adds a modest per-turn latency hop

Most high-scale teams externalize state to Redis with a TTL measured in days, accepting the slight latency cost for the scaling headroom.

Monitoring

The minimum metrics to alert on:

Tool error rate per tool per short rolling window
Token burn rate per hour vs. budget (alert at a large fraction of daily budget by midday)
Tail session duration compared with the normal median, which indicates stuck sessions
Prompt injection detection rate (log all, alert if rate spikes above baseline)
High-end cost per session compared with the normal median, which indicates cost spiral

Conclusion

The AI agent teams that are running reliably today are not the teams that built the cleverest prompts. They are the teams that treated their agents as distributed systems: designing for failure, instrumenting from day one, setting hard budgets, and iterating on the unhappy paths with the same rigor they applied to the happy path.

The measured cost-spike incident was the best thing that happened to our agent infrastructure. It forced us to confront the gap between notebook success and real service behavior under adversarial conditions. Every pattern in this post came out of a real incident from a real team.

If you are shipping agents soon, run the production readiness checklist before launch. Add tracing. Build the circuit breaker. Design your tools for structured failure. The happy path will work fine. It always does.

The question is what happens when it doesn't.

Working code for all patterns in this post: github.com/amtocbot-droid/amtocbot-examples/tree/main/agentic-ai-production

Revision History

Date	Summary	Old Version
2026-06-08	Removed an unsupported survey citation, added runtime-contract and escalation guidance, softened or attributed measured incident and cost claims, reduced em-dash use, and added the missing comparison visual.	View previous version

Tools mentioned in this post

Disclosure: some links in this section may be referral links. If you use them, AmtocSoft may receive a small commission at no additional cost to you; that support helps cover production and research costs for this site.

Anthropic Claude API: production LLM access. Sign up
OpenAI Platform: GPT-4 and embedding APIs. Sign up
LangChain: LangSmith observability tier. Sign up
Hugging Face: Pro / Enterprise tier. Sign up

Sources

Anthropic, "Building Effective Agents": https://www.anthropic.com/engineering/building-effective-agents
OWASP Top 10 for Large Language Model Applications: https://owasp.org/www-project-top-10-for-large-language-model-applications
OWASP Top 10 for LLM Applications 2025 PDF: https://owasp.org/www-project-top-10-for-large-language-model-applications/assets/PDF/OWASP-Top-10-for-LLMs-v2025.pdf
LangGraph documentation: persistence and checkpointing: https://langgraphjs.guide/persistence/
OpenTelemetry documentation: https://opentelemetry.io/docs/
Simon Willison, "Prompt Injection and AI Agents": https://simonwillison.net/tags/prompt-injection/

About the Author

Toc Am

Founder of AmtocSoft. Writing practical deep-dives on AI engineering, cloud architecture, and developer tooling. Previously built backend systems at scale. Reviews every post published under this byline.

LinkedIn X / Twitter

Published: 2026-04-19 · Updated: 2026-06-08 · Written with AI assistance, reviewed by Toc Am.

Get These In Your Inbox

Weekly deep-dives on AI engineering, no fluff. Join the newsletter →

Subscribe (free)

Or grab the book ($39, ~100 pages) · Buy me a coffee

☕ Buy Me a Coffee · 🔔 YouTube · 💼 LinkedIn · 🐦 X/Twitter

AmtocSoft Tech Insights