Building Production AI Agents: Tool Use, Memory, and Multi-Agent Orchestration

Introduction

If you have been paying attention to the AI engineering landscape in 2026, you have noticed a dramatic shift. Agents are no longer conference demos or weekend hackathon projects. They are running in production at scale, handling real workloads, and generating real revenue. The transition happened faster than most predicted, driven by a convergence of mature SDKs, better tool-use protocols, and hard-won lessons from early adopters who burned through millions in token costs learning what not to do.

The ecosystem has exploded. Anthropic shipped the Claude Agent SDK. OpenAI released the Agents SDK with built-in tracing and handoffs. Google launched the Agent Development Kit (ADK) with tight Vertex AI integration. Microsoft continued iterating on AutoGen, now in its third major version. LangGraph matured into a serious orchestration framework. CrewAI found its niche in role-based multi-agent setups. The tooling is finally catching up to the ambition.

But here is the thing that does not show up in the launch blog posts: building a production agent is fundamentally different from building a production API or a production web app. Agents are non-deterministic by nature. They make decisions at runtime about which tools to call, how to decompose tasks, and when to stop. This makes them powerful, but it also makes them unpredictable, expensive, and difficult to test.

This post is a deep technical guide to the three pillars that separate toy agents from production agents: tool use, memory, and multi-agent orchestration. We will cover how tool calling actually works under the hood, how to architect memory systems that give agents the context they need without blowing through your token budget, and how to coordinate multiple agents to handle complex workflows. Along the way, we will build real, working code using Python and the Anthropic SDK, compare the major frameworks head-to-head, and share the production patterns that the industry has converged on after two years of trial and error.

Whether you are an engineering lead evaluating whether agents are ready for your use case, or a senior developer about to build your first production agent system, this guide will give you the technical foundation to make sound architectural decisions.

The Problem: From Demo to Production

Every engineer who has built an agent demo has experienced the same arc. Day one: the agent answers questions, calls tools, and produces impressive results. Day two: you show it to your team and everyone is excited. Day three: you try to run it on real data at real scale, and everything falls apart.

The gap between a working demo and a production system is enormous, and it manifests in predictable ways.

Hallucinated tool calls are the most common failure mode. The LLM decides to call a tool that does not exist, or passes arguments that do not match the schema, or invents parameter values that look plausible but are completely wrong. In a demo, you catch these immediately and fix your prompt. In production, they happen at 3 AM on the 847th request of the day, and your error handling either catches them gracefully or your system crashes.

Infinite loops happen when the agent gets stuck in a cycle: it calls a tool, gets a result it does not understand, decides it needs to call the tool again with slightly different parameters, gets another confusing result, and repeats until you hit your token limit or your budget alarm fires. Without explicit loop detection and maximum iteration counts, this will happen eventually.

Cost explosions are the silent killer. A single agent interaction might require 5-10 LLM calls with tool use, each consuming thousands of tokens. Multiply that by thousands of requests per day, and you are looking at serious infrastructure costs. The problem is compounded by context window accumulation: each turn in the agent loop adds the previous tool results to the context, so later turns are exponentially more expensive than earlier ones.

Context window limits create a hard ceiling on agent capability. Even with 200K token context windows, a complex multi-step agent task can fill that window surprisingly quickly. When you hit the limit, you either truncate history (losing important context) or fail the request entirely. Neither is acceptable in production.

Lack of observability might be the most dangerous problem because you do not know you have it until something goes wrong. In a traditional API, you can trace a request through your system and understand exactly what happened. In an agent system, the decision path is emergent: the LLM chose to call these tools in this order with these arguments for reasons that are not always transparent. Without proper tracing, debugging a production agent failure is like debugging a distributed system with no logs.

The path to production requires solving all five of these problems simultaneously, and that is what the rest of this post is about.

How Tool Use Actually Works

Tool use (sometimes called function calling) is the mechanism that transforms an LLM from a text generator into an agent that can take actions in the world. Understanding how it works at a technical level is essential for building reliable agent systems.

The Tool Definition Schema

When you send a request to an LLM with tools enabled, you include a list of tool definitions alongside your messages. Each tool definition is a JSON Schema object that describes the tool's name, purpose, and parameters. The LLM uses these definitions to decide when and how to call tools.

Here is what a tool definition looks like for the Anthropic API:

tools = [
    {
        "name": "search_web",
        "description": (
            "Search the web for current information on a topic. "
            "Use this when the user asks about recent events, current data, "
            "or anything that may have changed after your training cutoff."
        ),
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "The search query to execute"
                },
                "max_results": {
                    "type": "integer",
                    "description": "Maximum number of results to return (1-10)",
                    "default": 5
                }
            },
            "required": ["query"]
        }
    },
    {
        "name": "read_url",
        "description": (
            "Fetch and read the content of a specific URL. "
            "Returns the main text content of the page."
        ),
        "input_schema": {
            "type": "object",
            "properties": {
                "url": {
                    "type": "string",
                    "description": "The full URL to fetch"
                }
            },
            "required": ["url"]
        }
    },
    {
        "name": "store_finding",
        "description": (
            "Store a research finding in the agent's memory for later synthesis. "
            "Use this to save important facts, quotes, or data points."
        ),
        "input_schema": {
            "type": "object",
            "properties": {
                "key": {
                    "type": "string",
                    "description": "A short label for this finding"
                },
                "content": {
                    "type": "string",
                    "description": "The finding content to store"
                },
                "source": {
                    "type": "string",
                    "description": "URL or reference where this was found"
                }
            },
            "required": ["key", "content"]
        }
    }
]

The quality of your tool descriptions directly impacts how reliably the LLM uses them. Vague descriptions lead to hallucinated calls. Overly specific descriptions lead to tools never being used. The sweet spot is clear, action-oriented descriptions that explain both what the tool does and when to use it.

The Tool-Use Loop

Animated flow diagram

The fundamental pattern of tool use is a loop. You send messages to the LLM, it responds with either a final text answer or a request to use one or more tools, you execute those tools, send the results back, and repeat until the LLM produces a final answer.

Here is a complete, production-ready implementation of the tool-use loop:

import anthropic
import json
from typing import Any

client = anthropic.Anthropic()

# Maximum iterations to prevent infinite loops
MAX_ITERATIONS = 15
MODEL = "claude-sonnet-4-20250514"


def execute_tool(name: str, args: dict) -> Any:
    """
    Route tool calls to their implementations.
    In production, each tool would be its own module with
    error handling, retries, and timeouts.
    """
    if name == "search_web":
        return search_web(args["query"], args.get("max_results", 5))
    elif name == "read_url":
        return read_url(args["url"])
    elif name == "store_finding":
        return store_finding(args["key"], args["content"], args.get("source"))
    else:
        return {"error": f"Unknown tool: {name}"}


def run_agent(user_message: str, system_prompt: str, tools: list) -> str:
    """
    Execute the full agent loop with tool use.
    
    Returns the final text response from the agent.
    Raises RuntimeError if max iterations exceeded.
    """
    messages = [{"role": "user", "content": user_message}]
    
    for iteration in range(MAX_ITERATIONS):
        # Call the LLM with current message history and tools
        response = client.messages.create(
            model=MODEL,
            max_tokens=4096,
            system=system_prompt,
            tools=tools,
            messages=messages,
        )
        
        # Check if the model wants to use tools
        if response.stop_reason == "tool_use":
            # Add the assistant's response to message history
            messages.append({
                "role": "assistant",
                "content": response.content,
            })
            
            # Process each tool use block in the response
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    print(f"  [Tool Call] {block.name}({json.dumps(block.input)[:100]}...)")
                    
                    # Execute the tool with error handling
                    try:
                        result = execute_tool(block.name, block.input)
                        tool_results.append({
                            "type": "tool_result",
                            "tool_use_id": block.id,
                            "content": json.dumps(result) if not isinstance(result, str) else result,
                        })
                    except Exception as e:
                        # Return errors to the LLM so it can adapt
                        tool_results.append({
                            "type": "tool_result",
                            "tool_use_id": block.id,
                            "content": f"Error executing {block.name}: {str(e)}",
                            "is_error": True,
                        })
            
            # Send tool results back to the LLM
            messages.append({"role": "user", "content": tool_results})
        
        elif response.stop_reason == "end_turn":
            # Extract the final text response
            text_blocks = [b.text for b in response.content if hasattr(b, "text")]
            return "\n".join(text_blocks)
        
        else:
            # Handle unexpected stop reasons
            return f"Agent stopped unexpectedly: {response.stop_reason}"
    
    raise RuntimeError(
        f"Agent exceeded maximum iterations ({MAX_ITERATIONS}). "
        "This usually indicates a loop in the agent's reasoning."
    )

Parallel vs Sequential Tool Calls

Modern LLMs can request multiple tool calls in a single response. For example, if the agent decides it needs to search for three different queries, it can emit all three tool_use blocks at once rather than waiting for each result sequentially. This is a significant performance optimization: three parallel web searches complete in the time of one.

Your agent loop needs to handle this correctly. The code above already does: it iterates over all tool_use blocks in the response and returns all results together. In production, you would execute these tool calls concurrently using asyncio.gather or a thread pool.

Error Handling Strategy

The critical insight for production tool use is this: tool errors should be returned to the LLM, not raised as exceptions. When a tool fails, the LLM can often adapt by trying a different approach, using a different tool, or asking the user for clarification. Hard-crashing on tool errors throws away the LLM's ability to reason about failures.

The is_error: True flag in the tool result tells the LLM that something went wrong, and it should factor that into its next decision.

Memory Architectures for Agents

Without memory, every agent interaction starts from zero. The agent has no knowledge of previous conversations, no accumulated context, and no ability to build on past work. Memory is what transforms a stateless tool-calling loop into something that feels like an intelligent collaborator.

Animated flow diagram

Three Tiers of Agent Memory

Short-term memory is the conversation context itself: the messages array that you send to the LLM on each turn. This is the simplest form of memory and the one every agent has by default. The limitation is the context window: once you exceed the model's token limit, you must start dropping older messages. Strategies for managing short-term memory include sliding window (drop the oldest messages), summarization (periodically compress the conversation into a summary), and selective retention (keep tool results but drop intermediate reasoning).

Working memory is a scratchpad that the agent uses during a single task. Think of it as the agent's notepad: a place to store intermediate results, track progress on multi-step tasks, and maintain state between tool calls. Working memory is typically implemented as a structured object (dictionary or class instance) that persists for the duration of the task but is discarded afterward.

Long-term memory is persistent storage that survives across conversations and tasks. This is where the agent stores learned facts, user preferences, past research results, and any other information that should be available in future sessions. Long-term memory is typically implemented using a vector database (for semantic search) or a traditional database (for structured data).

Comparison of Memory Approaches

| Approach | Persistence | Retrieval | Capacity | Latency | Cost | Best For |

|----------|-------------|-----------|----------|---------|------|----------|

| Context Window | None (per-turn) | Automatic | 100-200K tokens | None | Per-token | Short conversations |

| Sliding Window | None (per-session) | Automatic | Configurable | None | Per-token | Long conversations |

| Summarization | Per-session | Automatic | Compressed | LLM call | Moderate | Multi-hour sessions |

| Vector DB | Persistent | Semantic search | Unlimited | 10-50ms | Storage + embedding | Knowledge bases |

| SQL/KV Store | Persistent | Exact match | Unlimited | 1-10ms | Storage only | User prefs, structured data |

| Hybrid (Vector + KV) | Persistent | Both | Unlimited | 10-50ms | Combined | Production agents |

Implementation: A Memory Manager

Here is a working memory manager that combines all three tiers:

import hashlib
import json
import time
from dataclasses import dataclass, field
from typing import Optional


@dataclass
class MemoryEntry:
    """A single memory entry with metadata."""
    key: str
    content: str
    source: Optional[str] = None
    timestamp: float = field(default_factory=time.time)
    access_count: int = 0
    
    def to_context_string(self) -> str:
        """Format this memory entry for inclusion in the LLM context."""
        parts = [f"[{self.key}]: {self.content}"]
        if self.source:
            parts.append(f"  Source: {self.source}")
        return "\n".join(parts)


class AgentMemory:
    """
    Three-tier memory system for production agents.
    
    - Short-term: managed externally via the messages array
    - Working memory: in-memory scratchpad for the current task
    - Long-term: persistent storage (vector DB or KV store)
    
    This implementation uses an in-memory dict for long-term storage
    as a demonstration. In production, replace with your vector DB
    client (Pinecone, Weaviate, ChromaDB, pgvector, etc).
    """
    
    def __init__(self, max_working_memory: int = 50):
        # Working memory: scratchpad for current task
        self.working: dict[str, MemoryEntry] = {}
        self.max_working = max_working_memory
        
        # Long-term memory: persistent store
        # Replace with vector DB in production
        self._long_term_store: dict[str, MemoryEntry] = {}
    
    def store_working(self, key: str, content: str, source: str = None) -> str:
        """
        Store a finding in working memory for the current task.
        Evicts least-recently-accessed entries if at capacity.
        """
        if len(self.working) >= self.max_working:
            # Evict the entry with the lowest access count
            evict_key = min(
                self.working, 
                key=lambda k: self.working[k].access_count
            )
            del self.working[evict_key]
        
        entry = MemoryEntry(key=key, content=content, source=source)
        self.working[key] = entry
        return f"Stored in working memory: {key}"
    
    def retrieve_working(self, key: str) -> Optional[str]:
        """Retrieve a specific entry from working memory."""
        if key in self.working:
            self.working[key].access_count += 1
            return self.working[key].to_context_string()
        return None
    
    def get_working_context(self, max_tokens: int = 2000) -> str:
        """
        Get all working memory as a formatted string for
        injection into the LLM context. Respects a rough
        token budget (estimated at 4 chars per token).
        """
        entries = sorted(
            self.working.values(),
            key=lambda e: e.timestamp,
            reverse=True,
        )
        
        context_parts = ["## Current Working Memory"]
        char_budget = max_tokens * 4  # rough chars-per-token estimate
        char_count = 0
        
        for entry in entries:
            entry_str = entry.to_context_string()
            if char_count + len(entry_str) > char_budget:
                context_parts.append("... (older entries truncated)")
                break
            context_parts.append(entry_str)
            char_count += len(entry_str)
        
        return "\n".join(context_parts)
    
    def commit_to_long_term(self, key: str) -> str:
        """
        Move a working memory entry to long-term storage.
        In production, this would generate an embedding and
        upsert into your vector database.
        """
        if key not in self.working:
            return f"Key '{key}' not found in working memory"
        
        entry = self.working[key]
        # Generate a stable ID for deduplication
        content_hash = hashlib.sha256(entry.content.encode()).hexdigest()[:12]
        storage_key = f"{key}_{content_hash}"
        
        self._long_term_store[storage_key] = entry
        return f"Committed to long-term memory: {storage_key}"
    
    def search_long_term(self, query: str, limit: int = 5) -> list[str]:
        """
        Search long-term memory for relevant entries.
        
        This naive implementation does substring matching.
        In production, you would:
        1. Embed the query using your embedding model
        2. Search your vector DB for nearest neighbors
        3. Return the top-k results with similarity scores
        """
        results = []
        query_lower = query.lower()
        
        for entry in self._long_term_store.values():
            if (query_lower in entry.content.lower() 
                    or query_lower in entry.key.lower()):
                results.append(entry.to_context_string())
                if len(results) >= limit:
                    break
        
        return results
    
    def clear_working(self) -> str:
        """Clear all working memory. Call this between tasks."""
        count = len(self.working)
        self.working.clear()
        return f"Cleared {count} entries from working memory"

Memory in the Agent Loop

To integrate memory with the agent loop, inject the working memory context into the system prompt before each LLM call, and expose memory operations as tools. The store_finding tool we defined earlier writes to working memory. You can add recall_memory and search_memory tools that read from it.

The key design principle is that memory retrieval should be automatic for working memory (injected into every prompt) but tool-mediated for long-term memory (the agent decides when to search). This keeps the context window manageable while giving the agent access to its full knowledge base.

Multi-Agent Orchestration Patterns

Once you have a single agent working reliably, the natural next step is composing multiple agents to handle complex workflows. Multi-agent orchestration is where agent systems start to deliver transformative value, but it is also where complexity grows fastest.

Animated flow diagram

Pattern 1: Sequential Pipeline

The simplest multi-agent pattern is a pipeline where each agent processes the output of the previous one. Agent A does research, passes its findings to Agent B for analysis, which passes its analysis to Agent C for writing.

When to use: Linear workflows where each step has a clear input/output contract. Content generation pipelines, data processing chains, review workflows.

Limitation: No parallelism, no feedback loops. If Agent C finds a problem with Agent A's research, there is no mechanism to go back.

Pattern 2: Router / Dispatcher

A lightweight routing agent examines incoming requests and dispatches them to specialized agents. The router does not do the work itself; it classifies the task and hands it off.

When to use: Customer support systems, multi-domain assistants, any system where different types of requests require fundamentally different handling.

Limitation: The router must be highly reliable. A misrouted request fails completely. Router agents should be fast and cheap (small model, few tokens).

Pattern 3: Supervisor / Worker

A supervisor agent breaks complex tasks into subtasks, delegates them to worker agents, collects results, and synthesizes a final output. The supervisor can re-delegate, ask for revisions, and make judgment calls about quality.

When to use: Complex, multi-step tasks where the decomposition is not known in advance. Research projects, code generation with review, any task requiring judgment about completeness.

This is the most common production pattern. Here is a working implementation:

import anthropic
import json
from typing import Any

client = anthropic.Anthropic()


def run_worker_agent(
    worker_name: str,
    task: str,
    tools: list,
    tool_executor: callable,
    model: str = "claude-sonnet-4-20250514",
    max_iterations: int = 10,
) -> str:
    """
    Run a specialized worker agent to completion.
    
    Each worker gets its own system prompt, tools, and message history.
    Workers are isolated from each other and from the supervisor.
    """
    system_prompt = (
        f"You are the {worker_name} agent. Complete the assigned task "
        f"thoroughly and return your findings. Be specific and factual."
    )
    
    messages = [{"role": "user", "content": task}]
    
    for _ in range(max_iterations):
        response = client.messages.create(
            model=model,
            max_tokens=4096,
            system=system_prompt,
            tools=tools,
            messages=messages,
        )
        
        if response.stop_reason == "tool_use":
            messages.append({"role": "assistant", "content": response.content})
            
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    try:
                        result = tool_executor(block.name, block.input)
                        tool_results.append({
                            "type": "tool_result",
                            "tool_use_id": block.id,
                            "content": json.dumps(result) if not isinstance(result, str) else result,
                        })
                    except Exception as e:
                        tool_results.append({
                            "type": "tool_result",
                            "tool_use_id": block.id,
                            "content": f"Error: {str(e)}",
                            "is_error": True,
                        })
            
            messages.append({"role": "user", "content": tool_results})
        else:
            text_blocks = [b.text for b in response.content if hasattr(b, "text")]
            return "\n".join(text_blocks)
    
    return f"Worker {worker_name} exceeded max iterations."


def run_supervisor(user_task: str) -> str:
    """
    Supervisor agent that decomposes a task and delegates to workers.
    
    The supervisor uses tool calls to invoke worker agents,
    review their output, and synthesize a final result.
    """
    supervisor_tools = [
        {
            "name": "delegate_research",
            "description": "Delegate a research subtask to the Research Agent.",
            "input_schema": {
                "type": "object",
                "properties": {
                    "task": {
                        "type": "string",
                        "description": "The research task to delegate"
                    }
                },
                "required": ["task"]
            }
        },
        {
            "name": "delegate_code",
            "description": "Delegate a coding subtask to the Code Agent.",
            "input_schema": {
                "type": "object",
                "properties": {
                    "task": {
                        "type": "string",
                        "description": "The coding task to delegate"
                    }
                },
                "required": ["task"]
            }
        },
        {
            "name": "delegate_review",
            "description": "Delegate a review subtask to the Review Agent.",
            "input_schema": {
                "type": "object",
                "properties": {
                    "task": {
                        "type": "string",
                        "description": "The content or code to review"
                    }
                },
                "required": ["task"]
            }
        },
    ]
    
    system_prompt = (
        "You are a Supervisor agent. Your job is to break complex tasks "
        "into subtasks and delegate them to specialized worker agents. "
        "You have three workers: Research (for information gathering), "
        "Code (for writing and executing code), and Review (for quality checks). "
        "Delegate work, collect results, and synthesize a final answer."
    )
    
    def execute_supervisor_tool(name: str, args: dict) -> str:
        if name == "delegate_research":
            return run_worker_agent(
                "Research",
                args["task"],
                tools=research_tools,       # defined elsewhere
                tool_executor=research_executor,
            )
        elif name == "delegate_code":
            return run_worker_agent(
                "Code",
                args["task"],
                tools=code_tools,
                tool_executor=code_executor,
            )
        elif name == "delegate_review":
            return run_worker_agent(
                "Review",
                args["task"],
                tools=review_tools,
                tool_executor=review_executor,
            )
        return f"Unknown delegation target: {name}"
    
    # Run the supervisor through the standard agent loop
    messages = [{"role": "user", "content": user_task}]
    
    for _ in range(20):  # supervisor gets more iterations
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=4096,
            system=system_prompt,
            tools=supervisor_tools,
            messages=messages,
        )
        
        if response.stop_reason == "tool_use":
            messages.append({"role": "assistant", "content": response.content})
            
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    result = execute_supervisor_tool(block.name, block.input)
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": result,
                    })
            
            messages.append({"role": "user", "content": tool_results})
        else:
            text_blocks = [b.text for b in response.content if hasattr(b, "text")]
            return "\n".join(text_blocks)
    
    return "Supervisor exceeded maximum iterations."

Pattern 4: Peer-to-Peer

Agents communicate directly with each other without a central coordinator. Each agent can send messages to any other agent, creating a collaborative network.

When to use: Debate/adversarial setups, consensus-building, creative brainstorming.

Limitation: Hardest to debug and control. Without a supervisor, there is no single point of accountability. Use sparingly and with strict message budgets.

Orchestration Pattern Comparison

| Pattern | Complexity | Parallelism | Feedback Loops | Debuggability | Best Use Case |

|---------|-----------|-------------|----------------|---------------|---------------|

| Sequential Pipeline | Low | None | None | High | Linear workflows |

| Router / Dispatcher | Low-Medium | Per-request | None | High | Multi-domain classification |

| Supervisor / Worker | Medium | Per-subtask | Via supervisor | Medium | Complex decomposable tasks |

| Peer-to-Peer | High | Full | Direct | Low | Debate, consensus |

Implementation Guide: Building a Research Agent

Let us put everything together and build a complete research agent. This agent takes a question, searches the web, reads relevant pages, stores findings in memory, and synthesizes a final answer.

import anthropic
import json
import httpx
from agent_memory import AgentMemory  # our memory class from earlier

client = anthropic.Anthropic()
memory = AgentMemory(max_working_memory=30)


# --- Tool implementations ---

def search_web(query: str, max_results: int = 5) -> dict:
    """
    Search the web using a search API.
    Replace with your preferred search provider
    (Brave Search, Tavily, SerpAPI, etc).
    """
    # Example using Brave Search API
    resp = httpx.get(
        "https://api.search.brave.com/res/v1/web/search",
        params={"q": query, "count": max_results},
        headers={"X-Subscription-Token": "YOUR_API_KEY"},
        timeout=10.0,
    )
    resp.raise_for_status()
    data = resp.json()
    
    results = []
    for item in data.get("web", {}).get("results", []):
        results.append({
            "title": item.get("title", ""),
            "url": item.get("url", ""),
            "snippet": item.get("description", ""),
        })
    
    return {"results": results, "query": query}


def read_url(url: str) -> dict:
    """
    Fetch and extract text content from a URL.
    Uses a simple approach; in production, use a proper
    content extraction library like trafilatura or
    a headless browser for JS-rendered pages.
    """
    try:
        resp = httpx.get(
            url,
            timeout=15.0,
            follow_redirects=True,
            headers={"User-Agent": "ResearchAgent/1.0"},
        )
        resp.raise_for_status()
        
        # Naive text extraction - replace with proper parser
        from html.parser import HTMLParser
        
        class TextExtractor(HTMLParser):
            def __init__(self):
                super().__init__()
                self.text_parts = []
                self._skip = False
            
            def handle_starttag(self, tag, attrs):
                if tag in ("script", "style", "nav", "header", "footer"):
                    self._skip = True
            
            def handle_endtag(self, tag):
                if tag in ("script", "style", "nav", "header", "footer"):
                    self._skip = False
            
            def handle_data(self, data):
                if not self._skip and data.strip():
                    self.text_parts.append(data.strip())
        
        extractor = TextExtractor()
        extractor.feed(resp.text)
        text = " ".join(extractor.text_parts)
        
        # Truncate to avoid blowing the context window
        max_chars = 8000
        if len(text) > max_chars:
            text = text[:max_chars] + "... [truncated]"
        
        return {"url": url, "content": text, "status": "success"}
    
    except Exception as e:
        return {"url": url, "content": "", "status": f"error: {str(e)}"}


def store_finding(key: str, content: str, source: str = None) -> dict:
    """Store a research finding in working memory."""
    result = memory.store_working(key, content, source)
    return {"status": "stored", "key": key, "message": result}


def recall_findings() -> dict:
    """Retrieve all current working memory as context."""
    context = memory.get_working_context(max_tokens=3000)
    return {"memory": context, "entry_count": len(memory.working)}


# --- Tool definitions for the API ---

RESEARCH_TOOLS = [
    {
        "name": "search_web",
        "description": (
            "Search the web for current information. Use this to find "
            "relevant articles, papers, and sources on a topic."
        ),
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string", "description": "Search query"},
                "max_results": {"type": "integer", "description": "Max results (1-10)", "default": 5},
            },
            "required": ["query"],
        },
    },
    {
        "name": "read_url",
        "description": "Fetch and read the text content of a webpage.",
        "input_schema": {
            "type": "object",
            "properties": {
                "url": {"type": "string", "description": "URL to read"},
            },
            "required": ["url"],
        },
    },
    {
        "name": "store_finding",
        "description": (
            "Store an important finding in memory for later synthesis. "
            "Use this whenever you discover a key fact or data point."
        ),
        "input_schema": {
            "type": "object",
            "properties": {
                "key": {"type": "string", "description": "Short label for this finding"},
                "content": {"type": "string", "description": "The finding to store"},
                "source": {"type": "string", "description": "Source URL"},
            },
            "required": ["key", "content"],
        },
    },
    {
        "name": "recall_findings",
        "description": (
            "Retrieve all stored findings from memory. Use this before "
            "writing your final synthesis to review what you have learned."
        ),
        "input_schema": {
            "type": "object",
            "properties": {},
        },
    },
]


def execute_research_tool(name: str, args: dict):
    """Route tool calls to implementations."""
    dispatch = {
        "search_web": lambda a: search_web(a["query"], a.get("max_results", 5)),
        "read_url": lambda a: read_url(a["url"]),
        "store_finding": lambda a: store_finding(a["key"], a["content"], a.get("source")),
        "recall_findings": lambda a: recall_findings(),
    }
    handler = dispatch.get(name)
    if handler:
        return handler(args)
    return {"error": f"Unknown tool: {name}"}


def research(question: str) -> str:
    """
    Run the full research agent on a question.
    
    The agent will:
    1. Search the web for relevant information
    2. Read promising sources
    3. Store key findings in memory
    4. Recall all findings
    5. Synthesize a comprehensive answer
    """
    memory.clear_working()  # fresh scratchpad for each research task
    
    system_prompt = (
        "You are a thorough research agent. Given a question, you must:\n"
        "1. Search the web for relevant, recent information\n"
        "2. Read at least 2-3 sources to cross-reference facts\n"
        "3. Store each important finding using store_finding\n"
        "4. Before writing your final answer, use recall_findings to review\n"
        "5. Synthesize a comprehensive, well-sourced answer\n\n"
        "Be thorough but efficient. Do not read more than 5 sources. "
        "Always cite your sources in the final answer."
    )
    
    messages = [{"role": "user", "content": question}]
    max_iterations = 15
    
    for iteration in range(max_iterations):
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=4096,
            system=system_prompt,
            tools=RESEARCH_TOOLS,
            messages=messages,
        )
        
        if response.stop_reason == "tool_use":
            messages.append({"role": "assistant", "content": response.content})
            
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    print(f"  [{iteration}] {block.name}: {json.dumps(block.input)[:80]}")
                    try:
                        result = execute_research_tool(block.name, block.input)
                        tool_results.append({
                            "type": "tool_result",
                            "tool_use_id": block.id,
                            "content": json.dumps(result),
                        })
                    except Exception as e:
                        tool_results.append({
                            "type": "tool_result",
                            "tool_use_id": block.id,
                            "content": f"Error: {str(e)}",
                            "is_error": True,
                        })
            
            messages.append({"role": "user", "content": tool_results})
        else:
            text_blocks = [b.text for b in response.content if hasattr(b, "text")]
            final_answer = "\n".join(text_blocks)
            print(f"\n  Research complete after {iteration + 1} iterations")
            print(f"  Findings stored: {len(memory.working)}")
            return final_answer
    
    return "Research agent exceeded maximum iterations."


# --- Entry point ---

if __name__ == "__main__":
    question = "What are the latest developments in AI agent frameworks in 2026?"
    print(f"Researching: {question}\n")
    answer = research(question)
    print(f"\n{'='*60}\n{answer}")

This implementation demonstrates all three pillars working together. Tool use handles the web search and page reading. Memory stores and retrieves findings across multiple tool-use iterations. And the agent loop itself is the simplest form of orchestration: a single agent with a clear task decomposition strategy encoded in its system prompt.

Comparison: Agent Frameworks in 2026

The framework landscape has matured significantly. Here is a head-to-head comparison of the major options as of early 2026:

| Framework | Language | Tool Use | Multi-Agent | Memory | Observability | Production-Ready | Learning Curve |

|-----------|----------|----------|-------------|--------|---------------|-----------------|----------------|

| Claude Agent SDK | Python, TS | Native | Handoffs, delegation | Manual | Built-in tracing | High | Low |

| OpenAI Agents SDK | Python | Native | Handoffs, guardrails | Manual | Built-in tracing | High | Low |

| LangGraph | Python, JS | Via LangChain | Graph-based orchestration | Checkpointing | LangSmith | High | Medium-High |

| CrewAI | Python | Built-in | Role-based crews | Shared memory | Basic logging | Medium | Low |

| AutoGen (v3) | Python | Built-in | Conversation-based | Teachability | Basic | Medium | Medium |

| Google ADK | Python | Native (Vertex) | Agent-to-agent | Session-based | Cloud Trace | High (on GCP) | Medium |

Claude Agent SDK and OpenAI Agents SDK are the most straightforward choices if you are already committed to one provider's models. Both offer clean APIs for tool use, built-in tracing, and simple multi-agent patterns via handoffs. The main trade-off is provider lock-in: switching models later means rewriting your agent code.

LangGraph is the most flexible option for complex orchestration. Its graph-based approach lets you model arbitrary agent workflows with cycles, conditional branching, and persistent state via checkpointing. The trade-off is complexity: LangGraph has a steep learning curve and adds significant abstraction overhead.

CrewAI occupies a unique niche with its role-based approach. You define agents as "roles" (Researcher, Writer, Reviewer) and CrewAI handles the orchestration. It is the fastest path from zero to a working multi-agent system, but the abstraction can be limiting for custom workflows.

AutoGen from Microsoft focuses on conversation-based multi-agent patterns. Agents communicate via structured messages, which makes it natural for debate and review workflows. Version 3 improved production-readiness significantly, but it still lags behind the provider SDKs in observability.

Google ADK is the clear choice if you are building on Google Cloud. Tight integration with Vertex AI, Cloud Trace, and other GCP services makes it powerful in that ecosystem, but it is less portable than the alternatives.

The right choice depends on your constraints. For most teams starting out, the provider SDKs (Claude Agent SDK or OpenAI Agents SDK) offer the best balance of simplicity and capability. Graduate to LangGraph when you need complex orchestration that the simpler frameworks cannot express.

Production Considerations

Building a working agent is the easy part. Keeping it running reliably at scale is where the real engineering happens.

Cost management is the number one operational concern. Every agent interaction involves multiple LLM calls, and costs compound with context length. Implement token budgets per task (hard-fail if exceeded), use prompt caching aggressively (the Anthropic API supports automatic caching of repeated prefixes), and monitor cost per interaction in real time. Consider using smaller, cheaper models for simple subtasks and reserving frontier models for complex reasoning. A supervisor on Claude Sonnet delegating to workers on Haiku can cut costs by 80% with minimal quality impact.

Observability and tracing are non-negotiable. Every agent run should produce a trace that shows the full sequence of LLM calls, tool invocations, and decision points. Both the Claude and OpenAI SDKs ship with built-in tracing. If you are building your own, emit structured logs for each turn: the messages sent, the response received, which tools were called, and the results. Store these traces and build dashboards that show success rates, latency distributions, cost per interaction, and common failure modes.

Error handling and circuit breakers protect your system from cascading failures. When a tool consistently fails (API down, rate limited), a circuit breaker stops calling it and returns a cached or default response. Implement retries with exponential backoff for transient failures, but set a maximum retry count. Distinguish between recoverable errors (tool timeout, rate limit) and unrecoverable errors (invalid schema, permission denied).

Rate limiting applies at multiple levels. Your LLM provider has rate limits on tokens per minute and requests per minute. Your tool endpoints (web search APIs, databases) have their own limits. And you should impose your own limits on agent iterations and concurrent tasks. Build a queuing system that respects all three layers of rate limiting.

Testing agents is fundamentally different from testing deterministic code. You cannot write unit tests that assert exact outputs. Instead, build an evaluation framework that runs your agent against a curated set of tasks and scores the results on criteria like accuracy, completeness, tool efficiency, and cost. Track these eval scores over time and block deployments that regress beyond a threshold. Several open-source eval frameworks have matured in this space, including Braintrust, Promptfoo, and the built-in eval tooling in the provider SDKs.

Security is the dimension most teams underinvest in. Tool sandboxing ensures that a code execution tool cannot access the file system outside its designated directory. Prompt injection defense prevents malicious user inputs from hijacking the agent's tool calls. Input validation on tool arguments catches hallucinated or malicious parameters before they reach your backend. The Model Context Protocol (MCP) is emerging as a standard for secure tool integration, and adopting it early pays dividends as your tool ecosystem grows.

Conclusion

The three pillars of production AI agents — tool use, memory, and multi-agent orchestration — are no longer cutting-edge research topics. They are engineering problems with known solutions, mature tooling, and growing community expertise.

Tool use is the mechanism that gives agents the ability to act. The key to reliability is clear tool definitions, robust error handling, and loop detection. Memory is what gives agents continuity and context. A three-tier architecture (short-term, working, long-term) covers the full spectrum of memory needs. Multi-agent orchestration is what gives agents the ability to handle complex tasks. The supervisor/worker pattern handles most production use cases; reach for more complex patterns only when you need them.

The frameworks are ready. The Claude Agent SDK, OpenAI Agents SDK, and LangGraph each provide solid foundations for building production agent systems. The choice between them is primarily about your existing ecosystem and the complexity of your orchestration needs.

Where is this heading? The industry is converging on a few key trends. MCP is becoming the standard protocol for tool integration, much like REST became the standard for web APIs. Agent-to-agent communication protocols are emerging to enable agents built on different frameworks to collaborate. And evaluation frameworks are getting sophisticated enough to enable continuous deployment of agent systems with confidence.

The gap between demo and production has not disappeared, but it has narrowed dramatically. The patterns in this post represent the current state of the art for building agents that work reliably at scale. The best time to start building was six months ago. The second best time is now.

*What agent architecture are you building? Share your patterns and pain points in the comments below, or find me on [LinkedIn](https://linkedin.com/in/toc-am-b301373b4/) and [X/Twitter](https://x.com/AmToc96282).*


Enjoyed this post? Follow AmtocSoft for AI tutorials from beginner to professional.

Buy Me a Coffee | 🔔 YouTube | 💼 LinkedIn | 🐦 X/Twitter

Comments

Popular posts from this blog

What is an LLM? A Beginner's Guide to Large Language Models

What Is Voice AI? TTS, STT, and Voice Agents Explained

29 Million Secrets Leaked: The Hardcoded Credentials Crisis