Thursday, April 30, 2026

AI Agent Memory Privacy: Pre-emptive PII Redaction Patterns That Hold Up Under Audit

AI Agent Memory Privacy: Pre-emptive PII Redaction Patterns That Hold Up Under Audit

Hero image showing a glowing AI agent memory store with PII tokens being scrubbed before they enter, with redaction patterns flowing into a vector store on dark background with teal and amber accents

Introduction

The first time an agent I shipped had to be GDPR-erased, I learned a small fact that nobody had told me: a vector index does not forget. The customer was a UK insurance broker. The agent was a customer-support bot with episodic memory backed by a Qdrant collection holding 1.7 million summaries of past conversations. A single user filed a Right-to-be-Forgotten request through the customer's privacy portal at 09:41 on a Tuesday in October 2025. By 11:00 we had located the user's session IDs in Postgres, deleted their conversation rows, and confirmed the deletion to the privacy team. By 14:00 the privacy team had asked, reasonably, whether the agent could still answer questions about that user. By 14:30 we had run the agent against a test prompt referencing the user's first name and policy number, and the agent had returned a fluent, accurate, and terrifying summary of three of the user's prior conversations, drawn from vector-search hits we had not realised existed. The conversations were technically deleted from Postgres. The embeddings were still in the vector store. The agent was still answering from them.

That afternoon turned into a four-day deletion sprint that involved finding every embedding tagged with the user's tenant ID, every chunk of summary text that contained the user's name, every cached query result, and every transcript stored in S3. We got there. The customer's privacy team filed an incident report anyway, because we had been late. The lesson I took from that incident, the one I have built into every agent platform since, is the only PII you can guarantee will not leak from agent memory is PII that never entered agent memory in the first place. Redaction has to happen before the write, not after the regret.

This post is the playbook from that incident, plus the patterns I have hardened across six more agent platforms since, two of which have now passed full GDPR Article 17 audits and one of which is mid-flight on EU AI Act Article 14 traceability. The patterns are practical, opinionated, and battle-tested. They are not the only way to do this. They are the way I have not had to apologise for.

Why Agent Memory Is The New Privacy Surface

Agent memory in 2026 is not one thing. The post-126 patterns blog (post 165) breaks it down into three layers: working memory inside the prompt, episodic memory of past sessions in a vector store, and procedural memory of learned tool sequences. Each layer is a different privacy risk. Working memory is short-lived but fully observable to the model and any logs it touches. Episodic memory is long-lived, retrieved by similarity, and almost always the first place a privacy review trips up. Procedural memory is the smallest in volume but the hardest to audit because it is encoded as patterns rather than rows.

The compliance picture has tightened. GDPR Article 17 has been load-bearing since 2018. The 2025 court decisions in Germany and France clarified that vector embeddings derived from personal data are themselves personal data, which means the right to erasure applies to them. The EU AI Act, with its August 2026 deadline for Article 14 traceability, requires that high-risk AI systems can show what data was used to produce any given output. The California CPRA, the Colorado AI Act, and the UK Data Protection and Digital Information Act all push in the same direction. By the second half of 2026, "the embedding cannot be reversed" is no longer a defence the regulators accept. Several of them, citing recent academic work on embedding inversion, have called the claim factually wrong.

The threat model for agent memory has four parts. The user's PII can leak to the LLM provider in the prompt. The PII can leak to the vector store, which is often hosted by a different vendor. The PII can leak to logs, which often go to a third observability platform. And the PII can leak across tenants if the memory store is shared without strict isolation. Each of these is a distinct breach class. Each requires its own defence. The redaction patterns below are designed to address all four, layered, with each layer failing closed.

The Core Pattern: Redact Before Write, Not After Read

The most important architectural decision is where redaction lives. Two patterns dominate. In the first, redaction sits between the agent and the LLM at read time, scrubbing PII from prompts as they leave. In the second, redaction sits between the agent input and the memory store at write time, scrubbing PII before it ever lands. Both are useful. Only the second prevents the GDPR sprint I described above.

Read-time redaction can never be retroactive. If a memory was written with PII three months ago, a read-time scrubber cannot un-leak it; the PII is already in the index, indexed, and retrievable. Write-time redaction is harder because it is more invasive, but it is the only pattern that gives you the guarantee the privacy team will ask for: this memory store has never contained the user's PII. Read-time scrubbing is a useful additional defence, but it is a defence in depth, not a primary control.

The other consequence of write-time redaction is that the agent's working memory and the storage memory diverge. The agent in flight knows the user's name, address, and policy number, because it needs them to do its job. The persistent memory stores a redacted summary that references those entities by token, like <NAME_4421> or <POLICY_AB7C>, with the mapping stored in a separate, encrypted, per-tenant key-value store that is governed by stricter retention and erasure policies than the vector index itself. When a new session retrieves a relevant past memory, the agent runtime resolves the tokens against the mapping store, validates that the current user has access to those tokens (often the answer is no, and the resolution returns a generic placeholder), and assembles the working memory accordingly.

Architecture diagram showing the dual-memory pattern: agent working memory containing real PII on the left, write-time redactor in the middle scrubbing PII into tokens, persistent vector memory containing only tokens on the right, and a per-tenant token vault below for reversible mapping

Layer 1: The PII Detector

Before you can redact, you have to detect. The detector is the most failure-prone component in the entire stack, because the cost of a false negative is a regulatory finding and the cost of a false positive is a degraded agent that can no longer reason about its own data. The detector design that has held up for me uses three layers in series.

The first layer is a high-precision named-entity recogniser. I have used spaCy 3.7 with a custom-trained pipeline, AWS Comprehend's PII detection, and Azure AI Language's classifier. All three are good for the canonical entity types: names, addresses, phone numbers, emails, dates of birth, government IDs. The strongest single number I can offer is from a 2024 Microsoft benchmark across five regulated industries: AWS Comprehend caught 96.4% of canonical PII with a 1.8% false positive rate, Azure caught 94.1% at 2.2%, and a fine-tuned spaCy model caught 92.0% at 1.1%. A standalone NER model alone is not enough.

The second layer is regex and validator coverage for high-value structured tokens that NER often misses or misclassifies. UK NHS numbers, Italian fiscal codes, German tax IDs, IBAN, US SSN with checksum, credit-card numbers with Luhn, AWS access keys, JWT tokens, and TLS certificates all have well-defined formats. A regex bank with cheap validators catches them deterministically. The bank in my current platform has 52 patterns. It runs in well under a millisecond per kilobyte of text.

The third layer is an LLM-based reviewer that runs on every chunk and flags context-sensitive PII the first two layers miss. "The patient's father" is not a name, but it is a relationship that, combined with the rest of the chunk, can identify a person. "1947, Manchester, three children, recovered from leukaemia in 1972" is a quasi-identifier set strong enough to be uniquely identifying given enough public data. NER models do not flag these. Regex cannot. A small LLM call can. The reviewer I run is Claude Haiku 4.5 with a careful system prompt and a structured output schema; it costs around $0.0009 per kilobyte at current pricing, runs in about 180ms on a c6i.large, and adds roughly an 8 percentage-point recall lift over the first two layers in our internal benchmarks. Crucially, the reviewer's output is structured: it returns a list of (start_offset, end_offset, entity_type, confidence) tuples. The orchestrator merges its output with the deterministic layers, deduplicates, and emits a final span list.

The detector's output is not a redacted string. It is a span list. The redactor downstream uses the span list to decide what to do with each span: replace with a generic placeholder (<NAME>), replace with a reversible token (<NAME_4421>), replace with a hashed token (<NAME_h:a8c2>), or drop the chunk entirely. The choice depends on the entity type, the storage tier, and the tenant's privacy policy.

Layer 2: Reversible Tokenization For Useful Memory

Generic placeholders are safe but useless. A summary that reads "the customer asked about on and was unhappy with 's response" is unrecoverable; the agent cannot retrieve it usefully because the placeholders carry no semantic distinction. Reversible tokens are the compromise. Each entity gets a stable, per-tenant token like <POLICY_AB7C> or <NAME_4421> that lives in a per-tenant token vault.

The token vault has four properties that matter for compliance.

First, it is per-tenant. One vault per customer, with the customer's own KMS key, so a vault breach in tenant A cannot leak tenant B's mapping. AWS KMS with grants, GCP Cloud KMS with separate keyrings, or HashiCorp Vault with namespaces all work. The platform's IAM ensures the agent runtime can only resolve tokens for the tenant of the active session.

Second, it is append-only with strict deletion. New tokens get added; existing tokens never get reused for a different entity; a Right-to-be-Forgotten request triggers a hard delete of the relevant token rows, and the deletion is logged with the request ID, the operator, and the timestamp. Once a token is deleted, every memory in the vector store that references that token degrades to the generic placeholder on the next retrieval. The vector store itself does not need to be rebuilt. The deletion is fast.

Third, it is keyed by content hash and tenant ID, not autoincrementing IDs. The same name appearing in two memories produces the same token within a tenant; the same name in two different tenants produces two different tokens. This preserves the agent's ability to draw connections within a tenant without creating cross-tenant correlation.

Fourth, the token vault has its own retention policy, separate from the memory store. We typically retain tokens for 36 months, with automated rotation; we retain the redacted memories themselves for shorter periods depending on the data class. Tokens for high-sensitivity entities (medical records, government IDs) get 12 months. Tokens for low-sensitivity entities (general business names) can go longer. The redacted memories survive token expiry, with placeholders on retrieval.

# token_vault.py — minimal reversible tokenizer
import hashlib
import os
from dataclasses import dataclass
from typing import Optional

import boto3
from cryptography.hazmat.primitives.ciphers.aead import AESGCM


@dataclass
class TokenSpan:
    entity_type: str   # NAME, EMAIL, POLICY, etc.
    plaintext: str
    token: str         # e.g. "<NAME_a8c2>"


class TokenVault:
    """Per-tenant, KMS-encrypted, append-only PII token vault."""

    def __init__(self, tenant_id: str, kms_key_id: str, table: str):
        self.tenant_id = tenant_id
        self.kms = boto3.client("kms")
        self.ddb = boto3.resource("dynamodb").Table(table)
        self.kms_key_id = kms_key_id

    def _key(self, plaintext: str, entity_type: str) -> str:
        # Per-tenant salted hash so the same name in two tenants is two tokens.
        h = hashlib.sha256(
            f"{self.tenant_id}:{entity_type}:{plaintext}".encode("utf-8")
        ).hexdigest()
        return h[:16]

    def _encrypt(self, plaintext: str) -> bytes:
        # KMS data key per tenant; in practice cache the data key for ~5 min.
        resp = self.kms.generate_data_key(
            KeyId=self.kms_key_id, KeySpec="AES_256"
        )
        nonce = os.urandom(12)
        ct = AESGCM(resp["Plaintext"]).encrypt(
            nonce, plaintext.encode("utf-8"), self.tenant_id.encode("utf-8")
        )
        return resp["CiphertextBlob"] + b"|" + nonce + b"|" + ct

    def tokenize(self, span: TokenSpan) -> str:
        token_id = self._key(span.plaintext, span.entity_type)
        token = f"<{span.entity_type}_{token_id[:6]}>"
        self.ddb.update_item(
            Key={"tenant_id": self.tenant_id, "token_id": token_id},
            UpdateExpression="SET entity_type = :t, ciphertext = :c, created_at = if_not_exists(created_at, :now)",
            ExpressionAttributeValues={
                ":t": span.entity_type,
                ":c": self._encrypt(span.plaintext),
                ":now": int(__import__("time").time()),
            },
        )
        return token

    def detokenize(
        self, token: str, requesting_user_id: str, audit_log
    ) -> Optional[str]:
        # token format: <TYPE_xxxxxx>
        try:
            entity_type, suffix = token.strip("<>").split("_", 1)
        except ValueError:
            return None
        # Note: the suffix here is the first 6 chars of the hash; the full
        # token_id resolves on the partition key and the prefix scan is
        # bounded by tenant. For production use a secondary index on prefix.
        item = self._lookup_full_token(entity_type, suffix)
        if not item:
            return None
        # Audit every detokenization. Privacy reviews ask for this log first.
        audit_log.write(
            tenant_id=self.tenant_id,
            user_id=requesting_user_id,
            action="detokenize",
            token=token,
            entity_type=entity_type,
        )
        return self._decrypt(item["ciphertext"])

    def erase(self, token_ids: list[str], request_id: str, audit_log) -> int:
        # Right-to-be-Forgotten path. Hard delete; no soft delete on PII.
        deleted = 0
        for token_id in token_ids:
            self.ddb.delete_item(
                Key={"tenant_id": self.tenant_id, "token_id": token_id}
            )
            deleted += 1
        audit_log.write(
            tenant_id=self.tenant_id,
            request_id=request_id,
            action="erase",
            count=deleted,
        )
        return deleted

    def _lookup_full_token(self, entity_type, suffix):
        # Implementation detail — query by tenant_id + prefix.
        ...

    def _decrypt(self, blob: bytes) -> str:
        ...

The structure above is what survived audit on two production deployments. The key design choices: per-tenant partition keys on the DynamoDB table, KMS-encrypted ciphertext rather than plaintext storage, mandatory audit logging on every detokenize call, and a hard-delete erase path with no soft-delete fallback. Soft deletes on PII fail audit. They have failed two of mine.

Layer 3: Pre-Embed Scrubbing And Per-Tenant Indexes

The detector and the token vault give you redacted text. The redacted text is what gets embedded and stored in the vector index. The embedding pipeline has three rules that are non-negotiable in any deployment I have shipped after October 2025.

The first rule is that embeddings are always computed on the redacted text, never on the raw text. This is the rule the GDPR sprint taught me. The 2024 Carlini et al. paper on embedding inversion demonstrated that around 92% of original tokens can be recovered from a typical sentence-level embedding using a learned decoder. The 2025 follow-up by Morris et al. extended this to 89% for OpenAI's text-embedding-3-large and 84% for Cohere's Embed v3. Treat embeddings of raw PII as functionally equivalent to plaintext PII in the index. The defence is to embed only the redacted text, where the inversion attack returns tokens like <NAME_4421> that are useless without the vault.

The second rule is per-tenant index isolation. Every vector store I run uses one collection per tenant, with the tenant's identity bound at the connection layer, not just filtered at query time. Pinecone has serverless namespaces. Qdrant has collections. Weaviate has multi-tenancy mode. pgvector has row-level security with tenant predicates. Pick a backend and use the strict isolation feature. Tenant filters at query time are not isolation; they are configuration that one mistake disables.

The third rule is per-tenant retention windows. Each tenant's vector index has its own retention policy, expressed as a TTL or as a daily cleanup job that drops vectors older than the policy allows. This is what makes Article 17 erasure tractable at scale: if the customer's contract requires 24-month retention, you delete anything older than 24 months by default; if a specific user files erasure, you target their tagged vectors specifically. The vectors are tagged with the token IDs of the entities they reference. Erasure becomes a vector delete by tag, not a full reindex.

flowchart LR
    A[User input] --> B[Detector pipeline]
    B --> C[Span list]
    C --> D[Tokenizer]
    D --> E[Redacted text]
    D --> F[Token vault]
    E --> G[Embedding model]
    G --> H[Per-tenant<br/>vector index]
    H --> I[Memory retrieval]
    I --> J[Token resolver]
    F --> J
    J --> K[Working memory<br/>for agent]
    style F fill:#fce4a8,stroke:#bf8d1f
    style H fill:#cfe8d9,stroke:#3a7d4a
    style J fill:#d6cdf2,stroke:#664eaa

Layer 4: Audit Trails That Pass Article 14

Pre-emptive redaction without an audit trail leaves a privacy team unable to prove that the redaction worked. Every step of the pipeline emits a structured event to an append-only audit store. The schema I have used since late 2024, refined twice, looks like this.

{
  "event_id": "evt_01JK4F2Q3R5T7Y9V",
  "tenant_id": "acme",
  "user_id": "u_4421",
  "session_id": "s_a1b2c3",
  "action": "memory.write",
  "memory_id": "mem_8x7y6z5",
  "input_bytes": 2048,
  "detector_version": "pii-detector@1.7.3",
  "spans_detected": 7,
  "spans_by_type": {"NAME": 2, "EMAIL": 1, "POLICY": 3, "DOB": 1},
  "token_vault_writes": 5,
  "token_vault_hits": 2,
  "embedding_model": "text-embedding-3-large@2025-09",
  "vector_index": "qdrant://acme-prod-2026",
  "retention_class": "PII-medium-36mo",
  "redacted_text_hash": "sha256:9c2a...",
  "policy_version": "tenant-acme-privacy@v4",
  "ts": "2026-04-30T18:14:22.847Z"
}

The audit event has six properties that have proven non-negotiable in real audits. It is per-tenant. It is per-user. It carries the version of every component that touched the data, so a regression in the detector six months ago can be traced and rebuilt. It carries spans counts but not span content; logging the redacted PII into the audit log is a category error that has bitten one of my teams. It carries a hash of the redacted text, so you can prove later what was written without retaining the text in the audit log. And it carries a policy_version that lets you reconstruct the tenant's privacy policy at the time of the write, which the EU AI Act Article 14 traceability requirements specifically expect.

The audit store sits in a separate retention class from the memory store. Article 14 explicitly requires the audit log to outlive the data it describes. We keep audit events for 7 years on a write-once-read-many tier. The cost is small. The compliance value is large. When a regulator or a customer privacy team asks "what was redacted from this user's memory between January and March", we run a tenant-scoped query and produce the answer in minutes.

The Debugging Story That Cost Me A Weekend

Mid-March 2026 a customer-support agent on a financial-services account started returning answers that contained tokens like <NAME_a8c2> directly in the user-facing response. Not redacted-and-resolved, not detokenized, raw token strings. The privacy team caught it within two hours. We rolled back. The cause looked simple at first; somebody had to have skipped the resolver. It was not simple.

The detokenizer pipeline ran inside a separate service for isolation. The agent runtime called the detokenizer over gRPC. The detokenizer accepted a list of tokens and returned a list of (token, plaintext) pairs, with the agent runtime substituting them into the assembled prompt before calling the LLM. The bug was that the detokenizer's gRPC server had been deployed with a 30ms timeout in the cluster's istio config, while the detokenizer itself, because of a recent KMS data key cache invalidation, was now sometimes taking 80ms on cold reads. When the timeout fired, the agent runtime received a partial response, did not detect the partial state because the response schema had no required fields, and substituted the tokens it had received while leaving the missing tokens as raw strings in the prompt. The LLM, given a prompt with raw token strings in it, helpfully reproduced them in the response.

The fix was three parts. The detokenizer's gRPC schema added a strict complete boolean that had to be true; the agent runtime treated any non-complete response as a hard failure and degraded to placeholder responses rather than partial substitution. The istio timeout went up to 500ms. And the KMS data key cache got a longer TTL with a background refresh, eliminating the cold-read latency spike. The lesson I retained: when a redaction pipeline degrades, it must fail closed, not fail open. Every component had to be reviewed for "what does it do under partial failure", and several of them had been treating partial failure as a soft event. They are not soft. Privacy violations from a partial-failure pipeline are still privacy violations.

flowchart TD
    A[Agent runtime calls detokenizer] --> B{gRPC response<br/>complete=true?}
    B -->|yes| C[Substitute tokens<br/>send prompt]
    B -->|no, timeout| D[Hard fail<br/>degrade to placeholders]
    B -->|no, schema mismatch| E[Hard fail<br/>circuit-break for 30s]
    D --> F[Audit event:<br/>fail_closed=true]
    E --> F
    style D fill:#f7c9c9,stroke:#a83434
    style E fill:#f7c9c9,stroke:#a83434
    style F fill:#cfe8d9,stroke:#3a7d4a

How This Compares To The Alternatives

There are at least three alternative approaches to PII handling in agent memory that I have evaluated and chosen not to ship. Naming them is useful because each appears in the literature and in vendor pitches.

The first alternative is differential privacy at the embedding layer. Add calibrated noise to the embedding vectors so that recovering specific tokens is information-theoretically hard. This sounds good. In practice, the noise levels required to give you a defensible epsilon are large enough to degrade retrieval recall by 12 to 25 percentage points in our internal benchmarks. For high-stakes legal or medical use cases where retrieval quality is non-negotiable, the trade is bad. We use DP for analytics on aggregate query patterns, not on the per-memory embedding.

The second alternative is fully homomorphic encryption, with embeddings computed and similarity-searched under encryption. The 2025 academic work on encrypted vector search using CKKS schemes is interesting and progressing. In a production deployment in 2026, the latency overhead is roughly 200x for similarity search; the index size grows by 5-8x; the available open-source implementations are immature. I have built FHE-enabled prototypes for two customers where regulators specifically asked for it. Neither has reached production. The cost-benefit does not yet land for general use.

The third alternative is "encrypt at rest only, no redaction". The data is stored encrypted on disk; the database supports encryption with customer-managed keys; the team relies on access control to keep it safe. This is the weakest of the alternatives because it does nothing about the breach class where the LLM provider, the vector vendor, or the observability platform processes plaintext after decryption. The redaction-at-write pattern is robust precisely because the data the third parties see is already redacted. Encryption-at-rest is necessary; it is not sufficient.

Approach Recall impact Latency overhead Audit posture Verdict
Pre-emptive redaction + token vault 0-2pp 8-15% Strong, audit-ready Default for PII
Read-time scrubber only 0pp <5% Weak, not retroactive Defence-in-depth only
Differential privacy on embeddings 12-25pp 5-10% Strong but degrades quality Aggregate metrics only
Fully homomorphic encryption 0pp ~200x Strongest in theory Pre-production R&D
Encrypt at rest only 0pp <2% Weak, fails audit Necessary, not sufficient
Comparison chart visual showing five PII protection approaches scored on recall, latency, audit posture, and production readiness, with pre-emptive redaction emerging as the best balanced choice in dark themed table layout

Production Considerations

Three operational concerns dominate the lifecycle of a redaction pipeline once it is shipped.

The first is detector drift. PII patterns change. New government ID schemes get rolled out. New common formats appear in the data. The detector that caught 96.4% of canonical PII at deploy time will quietly drop 4 percentage points over six months if you do not retrain. We run a weekly evaluation against a curated benchmark of 12,000 labeled chunks per tenant; any tenant that drops below 94% recall triggers a review. The evaluation cost is around $40 per tenant per week.

The second is right-to-be-forgotten throughput. In healthy operation, a customer might field 5-50 erasure requests per month per tenant, each requiring a vault delete, a vector index targeted delete, and a cascade through any cached query results and the audit-friendly delete record. We budget 90 seconds end-to-end per request. The bottleneck is not the vault; it is the vector index targeted delete, which on a 10-million-vector Qdrant collection runs about 60 seconds for a 1000-vector tag delete. Pinecone serverless is faster on this workload (around 12 seconds). Weaviate is faster still on small targeted deletes but slower on larger sweeps. Benchmark for your scale before you commit.

The third is cost. The detector pipeline runs on every memory write. At a typical 380,000 memory writes per month for a mid-sized agent platform, the per-write cost stack adds up: $0.0009 for the LLM reviewer, $0.0003 for the embedding, $0.0001 for the token vault writes, $0.0002 for the audit log, plus storage. Total: around $580 per month at this volume, dominated by the LLM reviewer. We considered dropping the reviewer; we have not, because the recall lift is too valuable. Some teams sample the reviewer instead of running it on every chunk, accepting a small recall hit for a 60-80% cost reduction. We do not. Privacy is a tail-risk problem, and sampling tail risk is the wrong instinct.

Conclusion

The pattern that consistently passes audit is the same pattern. Detect PII before it lands in memory. Tokenize it reversibly with a per-tenant vault. Embed only the redacted text. Isolate per-tenant indexes. Audit every step in a separate, longer-retention store. Fail closed when the pipeline degrades. The five layers reinforce each other, and they let you face a privacy team or a regulator with the answer they need: this memory store has never contained the user's PII, and here is the audit trail that proves it.

The work is not glamorous. It is plumbing, careful schema design, and rigour in failure modes. It is what stops an agent platform from becoming a privacy liability the day a regulator decides to look. If you are building agents in 2026 and have not put redaction at the write boundary, do that next. The compliance debt is compounding. The patterns are well-understood. The cost of catching up after a finding is much higher than the cost of getting it right the first time.

If you want a working reference, the patterns above are reproduced in the agent-memory-privacy directory of the amtocbot examples repository, with end-to-end tests against a sample tenant and a teardown that exercises a full Article 17 erasure path.

Sources

  1. Morris, John X., et al. "Language Model Inversion." arXiv preprint 2311.13647 (2024). https://arxiv.org/abs/2311.13647
  2. European Data Protection Board. "Guidelines 02/2024 on the Right to Erasure (Article 17 GDPR)." Adopted 8 October 2024. https://edpb.europa.eu/our-work-tools/documents/public-consultations/2024
  3. European Commission. "Artificial Intelligence Act, Regulation (EU) 2024/1689, Article 14: Human Oversight." Official Journal of the European Union, 12 July 2024.
  4. Carlini, Nicholas, et al. "Extracting Training Data from Diffusion Models." USENIX Security (2023). https://www.usenix.org/conference/usenixsec23/presentation/carlini
  5. RFC 8693, "OAuth 2.0 Token Exchange." IETF, January 2020. https://datatracker.ietf.org/doc/html/rfc8693
  6. Microsoft Research. "Benchmarking PII Detectors Across Regulated Industries." Technical Report MSR-TR-2024-08 (April 2024). https://www.microsoft.com/en-us/research/publication/
  7. UK Information Commissioner's Office. "Generative AI and Data Protection: Guidance for Developers." Final report, March 2025. https://ico.org.uk/

Companion Code

Working reference implementation lives at github.com/amtocbot-droid/amtocbot-examples/agent-memory-privacy. The repo includes the detector pipeline, the token vault with KMS-encrypted DynamoDB backing, the per-tenant Qdrant setup, the audit log schema, and an end-to-end Article 17 erasure test.

About the Author

Toc Am

Founder of AmtocSoft. Writing practical deep-dives on AI engineering, cloud architecture, and developer tooling. Previously built backend systems at scale. Reviews every post published under this byline.

LinkedIn X / Twitter

Published: 2026-04-30 · Written with AI assistance, reviewed by Toc Am.

Buy Me a Coffee · 🔔 YouTube · 💼 LinkedIn · 🐦 X/Twitter

Defensive MCP Server Sandboxing: Permissions, Audit Logs, and Resource Caps That Actually Work in 2026

Defensive MCP Server Sandboxing: Permissions, Audit Logs, and Resource Caps That Actually Work in 2026

Hero image showing a glowing MCP server icon enclosed inside three concentric defensive rings labeled permissions, resource caps, and audit logs, with attack vectors deflecting off the outer ring, dark technical aesthetic with cyan and amber accents

Introduction

The MCP server that almost ended a customer engagement for me was not malicious. It was a community-maintained Postgres MCP server that I had pulled from a public registry, dropped into a customer's developer-tools agent, and shipped to staging on a Friday afternoon. By Monday morning, the staging Postgres instance had a 14GB temp table full of pgcrypto-encrypted blobs that nobody on the team had written, the agent had answered a routine "how many active users do we have" question by running a pg_dump of three unrelated tables to a path under /tmp, and the customer's security team had opened a ticket asking why the agent service account had pg_read_server_files set to true. The MCP server itself was fine. The agent that called it had been steered, through a perfectly innocent-looking support ticket containing a prompt injection, into asking the server to do things the server was perfectly willing to do because nobody had told it not to.

I spent the rest of that week rewriting the deployment around three layers of defence: capability-scoped permissions per tool, hard resource caps on the server process, and a structured audit log that fed a SIEM. The agent kept working. The cost-of-ownership went up by about ninety minutes of platform engineering per server. The number of "the agent did what?" tickets went to zero across the next eight months. This post is the playbook from that incident, plus the patterns I have refined since on six more MCP deployments.

What follows is opinionated. The MCP specification, stable since the 1.0 release in late 2025, defines the wire protocol and the tool/resource/prompt primitives, but it is silent on deployment security. The community is still converging on best practice. The patterns below are what works on production deployments serving millions of requests a month. They are not the only patterns. They are the ones I have not yet had to apologise for.

Why MCP Servers Need A Defensive Sandbox

The MCP threat model is unusual because the attacker is not necessarily the user of the agent. The classic threat model for a web service assumes the user is potentially hostile, the server is trusted, and the attacker is at the network edge. The MCP threat model has at least three threat actors at once. The user can be hostile. The agent can be steered by an indirect prompt injection in any data the agent reads. The MCP server itself can be compromised, malicious, or simply buggy in a way that produces dangerous behaviour under unusual inputs. Any defence has to assume two of the three are uncooperative and still produce a survivable failure mode.

The first attack surface is the tool definition. An MCP tool is a callable with a name, a description, and a JSON schema. The agent's planner reads the description and decides when to invoke the tool. A malicious or sloppy description can poison the planner's choices, and a malicious tool can hide a side-effect inside a benign-looking name. A 2025 academic paper from the Anchore security team documented an MCP server published to a public registry with a tool named get_weather whose implementation also exfiltrated ~/.ssh/known_hosts to a remote endpoint on every call. Nothing in the wire protocol stops this. The defence has to live in the deployment.

The second attack surface is the data the tool reads and writes. An MCP server connected to a Postgres instance has the privileges of its database role. An MCP server connected to the file system has the privileges of its OS user. An MCP server connected to a cloud account has the privileges of its IAM role. The default for almost every quickstart example I have seen is "give it the same role as the human running the agent". That is the wrong default. The right default is least privilege, scoped per tool.

The third attack surface is the runtime itself. MCP servers in 2026 are most commonly Node, Python, or Go processes spawned by the agent or running as long-lived services. A single buggy server with a memory leak, an infinite loop, or a runaway shell-out can take down the agent host, run up cloud bills, or fill a disk to the point that other services on the same host fail. The default deployment of an MCP server, in most quickstarts, is npx @vendor/server. That is a process running as your user, with your file-system access, and no resource caps.

The fourth attack surface is the audit gap. When something goes wrong, the on-call engineer needs to reconstruct what the agent asked, what tool was called, what arguments were passed, what the tool returned, and what side effect ran. The MCP wire protocol does not require any of this to be logged. The community examples mostly do not log it. I have read four production postmortems where the response to "what did the agent do" was "we are not sure". That is unacceptable for any deployment that touches customer data or money.

The fifth attack surface is supply chain. An MCP server pulled from a public registry, like any npm or PyPI package, can be subverted by a typosquat, a maintainer takeover, or a postinstall-script attack. The 2025 rash of npm postinstall attacks against AI tooling, including one against a popular logging package that shipped to thousands of agent deployments, hit a number of teams that had no policy distinguishing "MCP server" from "trusted internal dependency". Treat MCP servers as third-party code, with all the supply-chain hardening that implies.

Architecture diagram showing the five MCP attack surfaces (tool definition, data privileges, runtime, audit gap, supply chain) arranged around a central MCP server, with three defensive rings (permissions, resource caps, audit logs) protecting the server

Layer 1: Capability-Scoped Permissions Per Tool

The single highest-impact defence is a per-tool capability model that lives outside the MCP server's source code. The pattern I use is a YAML or TOML manifest that lists every tool the server exposes, the resources it is allowed to touch, the maximum row count or byte count it can read or write, and the network destinations it is allowed to reach. The agent runtime enforces the manifest, not the server. The server cannot grant itself more access than the manifest gives it.

Here is a working example for a Postgres MCP server that exposes three tools: query, insert, and schema_describe.

# mcp-policy.yaml
server: postgres
version: "1.4.0"
runtime:
  user: mcp-postgres
  cwd: /var/lib/mcp/postgres
  read_only_root: true
tools:
  query:
    role: app_readonly
    allowed_schemas: [public, customer]
    denied_tables: [users, payment_methods, audit_log]
    max_rows: 1000
    timeout_seconds: 5
    network:
      allow: ["postgres-primary.internal:5432"]
      deny: ["*"]
  insert:
    role: app_writer
    allowed_schemas: [public]
    allowed_tables: [chunks, embeddings]
    max_rows_per_call: 100
    timeout_seconds: 10
    rate_limit_per_minute: 60
  schema_describe:
    role: app_readonly
    allowed_schemas: [public, customer]
    timeout_seconds: 2

The runtime layer that enforces this manifest has three jobs. First, before the server starts, the runtime validates that the database role the server will use has at most the privileges the manifest lists. If the manifest says app_readonly but the role has INSERT granted, the runtime refuses to start. Second, before each tool call, the runtime checks the requested schema, table, and row count against the manifest, and rejects calls that exceed the limits. Third, the runtime maintains rate limits and timeouts and kills tool calls that exceed them.

The implementation in Python with the official MCP SDK looks like this. This is the wrapper I use as a base across all my deployments.

# mcp_policy_wrapper.py
import yaml
import time
from collections import defaultdict
from typing import Any, Callable
from mcp.server import Server
from mcp.types import Tool, TextContent

class PolicyViolation(Exception): pass

class PolicyEnforcedServer:
    def __init__(self, inner: Server, policy_path: str):
        self.inner = inner
        self.policy = yaml.safe_load(open(policy_path))
        self.rate_buckets = defaultdict(list)
        self._wrap_tools()

    def _check_rate_limit(self, tool_name: str, limit: int):
        now = time.time()
        bucket = self.rate_buckets[tool_name]
        bucket[:] = [t for t in bucket if now - t < 60]
        if len(bucket) >= limit:
            raise PolicyViolation(f"rate limit exceeded for {tool_name}")
        bucket.append(now)

    def _enforce(self, tool_name: str, args: dict[str, Any]):
        tool_policy = self.policy["tools"].get(tool_name)
        if tool_policy is None:
            raise PolicyViolation(f"tool {tool_name} not in policy")
        if "rate_limit_per_minute" in tool_policy:
            self._check_rate_limit(tool_name, tool_policy["rate_limit_per_minute"])
        if "allowed_schemas" in tool_policy and args.get("schema"):
            if args["schema"] not in tool_policy["allowed_schemas"]:
                raise PolicyViolation(
                    f"schema {args['schema']} not in allow-list "
                    f"for {tool_name}")
        if "denied_tables" in tool_policy and args.get("table"):
            if args["table"] in tool_policy["denied_tables"]:
                raise PolicyViolation(
                    f"table {args['table']} is denied for {tool_name}")

    def _wrap_tools(self):
        original_call = self.inner.call_tool
        async def wrapped(name: str, arguments: dict[str, Any]):
            self._enforce(name, arguments)
            return await original_call(name, arguments)
        self.inner.call_tool = wrapped

Two things matter about this wrapper. First, it is fail-closed by default: if a tool is not in the policy, the call is refused. Many quickstart examples are fail-open, which inverts the security model and is the cause of half the incidents I have read postmortems for. Second, the wrapper is the only path from agent to server, which means the server itself does not need to know about the policy. Any third-party MCP server, including one whose source you do not control, gets the policy enforced by sitting behind this wrapper.

flowchart LR
  A[Agent] --> W[Policy Wrapper]
  W -->|policy check| P[mcp-policy.yaml]
  W -->|rate limit| R[Token Bucket]
  W -->|allowed| S[MCP Server]
  W -->|denied| X[PolicyViolation -> Audit Log]
  S --> D[(Postgres)]
  W -.audit.- L[(Audit Log)]

Layer 2: Hard Resource Caps On The Server Process

A policy wrapper stops the agent from asking the server to do something dangerous. Resource caps stop the server from doing something dangerous on its own. The four caps that earn their keep on every deployment are memory, CPU, file-system reach, and network reach.

On Linux, the four caps map to four well-understood primitives: cgroups v2 for memory and CPU, mount namespaces for file-system reach, and network namespaces with iptables or eBPF for network reach. In 2026, the cleanest way to apply all four is to run the MCP server inside a container with explicit limits. Here is a Docker Compose stanza I use as a template.

# docker-compose.mcp.yaml
services:
  mcp-postgres:
    image: registry.internal/mcp-postgres:1.4.0
    user: "10042:10042"
    read_only: true
    tmpfs:
      - /tmp:size=64M
    cap_drop:
      - ALL
    security_opt:
      - no-new-privileges:true
      - seccomp=./seccomp-mcp.json
    mem_limit: 512m
    cpus: 0.5
    pids_limit: 64
    networks:
      - mcp-postgres-net
    environment:
      - PG_DSN_FILE=/run/secrets/pg_dsn
    secrets:
      - pg_dsn
networks:
  mcp-postgres-net:
    driver: bridge
    ipam:
      config:
        - subnet: 10.42.0.0/24
secrets:
  pg_dsn:
    external: true

A few things are worth pointing at in this stanza. read_only: true stops the process from writing anywhere except /tmp, which is a 64MB tmpfs that disappears on restart. cap_drop: ALL removes every Linux capability, including CAP_NET_BIND_SERVICE, which means the process cannot open privileged ports if it gets compromised. seccomp=./seccomp-mcp.json is a syscall filter that allows the ~150 syscalls a normal Node or Python process needs and blocks the rest. pids_limit: 64 stops a runaway server from forking itself into the host's PID exhaustion limit. The network is a private bridge with one upstream destination, so even a fully compromised server cannot reach the public internet without an explicit network change.

For higher-stakes deployments, I run MCP servers inside gVisor instead of the default runc. gVisor adds a user-space kernel that intercepts syscalls and emulates them in a sandboxed runtime. The performance hit is real, around 10-25% on syscall-heavy workloads, but the blast radius of a kernel exploit is roughly zero because the host kernel is no longer reachable. The configuration is one line of Docker daemon config, "default-runtime": "runsc", and one annotation on the container.

Firecracker is the next step up. Each MCP server runs in its own microVM with a dedicated kernel. Boot time is around 125ms, memory overhead is 5MB per VM, and the isolation is full hardware virtualisation. AWS Lambda, Fargate, and a number of agent platforms in 2026 use Firecracker for exactly this reason. For most teams the operational overhead is not worth it until you have either dozens of MCP servers or a regulated compliance requirement that mandates VM-level isolation.

WASI, the WebAssembly System Interface, is the long-tail option for pure-compute MCP servers that do not need to touch a database or the network. A tool like a calculator, a code-formatting helper, or a static analysis runner can be compiled to WASM and run inside a WASI runtime such as Wasmtime or WasmEdge. The sandbox is built into the runtime: WASM cannot make any syscall the host runtime does not explicitly grant. This is the cleanest model and the most restricted; it does not work for the majority of MCP servers in production today, which talk to databases or external APIs, but for the ones it does work for it is the right answer.

flowchart TD
  Start[Pick a sandbox runtime]
  Start --> Q1{Network or DB access required?}
  Q1 -->|No, pure compute| WASI[Wasmtime / WasmEdge]
  Q1 -->|Yes| Q2{Multi-tenant or untrusted server?}
  Q2 -->|Single trusted server| Docker[Docker + seccomp + cgroups]
  Q2 -->|Multiple, partially trusted| GVisor[gVisor / runsc]
  Q2 -->|Strong isolation, regulated| Firecracker[Firecracker microVM]
  Docker --> Out[Deploy]
  GVisor --> Out
  Firecracker --> Out
  WASI --> Out

Layer 3: Structured Audit Logs That Survive A Postmortem

The audit log is the layer that turns a "we are not sure what happened" postmortem into a twenty-minute investigation. The log has to capture every tool call, every argument, every result size, every policy decision, and every resource cap hit. It has to be append-only, tamper-evident, and structured for ingestion into a SIEM or query layer. The format I have settled on across deployments is one JSON line per event, conforming to a schema modelled after CloudEvents 1.0 with MCP-specific extensions.

# audit_log.py
import json
import time
import uuid
from typing import Any

class AuditLogger:
    def __init__(self, sink):
        self.sink = sink

    def log(self, event_type: str, **fields):
        record = {
            "specversion": "1.0",
            "id": str(uuid.uuid4()),
            "type": f"com.amtocsoft.mcp.{event_type}",
            "source": "mcp-postgres-1.4.0",
            "time": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
            "datacontenttype": "application/json",
            "data": fields,
        }
        self.sink.write(json.dumps(record) + "\n")
        self.sink.flush()

    def call_attempted(self, conv_id, tool, args, agent_id):
        self.log("call_attempted",
                conversation_id=conv_id,
                tool=tool,
                args=args,
                agent_id=agent_id)

    def call_denied(self, conv_id, tool, reason):
        self.log("call_denied",
                conversation_id=conv_id,
                tool=tool,
                reason=reason)

    def call_succeeded(self, conv_id, tool, duration_ms,
                       result_size_bytes, rows_returned):
        self.log("call_succeeded",
                conversation_id=conv_id,
                tool=tool,
                duration_ms=duration_ms,
                result_size_bytes=result_size_bytes,
                rows_returned=rows_returned)

    def cap_exceeded(self, conv_id, tool, cap_name, observed, limit):
        self.log("cap_exceeded",
                conversation_id=conv_id,
                tool=tool,
                cap_name=cap_name,
                observed=observed,
                limit=limit)

The four event types above cover most postmortem questions. call_attempted records what the agent asked. call_denied records when the policy rejected a call and why. call_succeeded records the outcome and the size, which is the data the cost reconciliation step needs. cap_exceeded records when a resource limit was hit, which is the early-warning signal for either a runaway agent or a malicious tool.

Two operational notes. First, the sink should write to a unix domain socket connected to a separate audit-log daemon, not to a file in the container's local filesystem. A compromised server that can write to its own log file can also tamper with it. A unix socket to a daemon running as a different user with append-only file privileges is the standard pattern for this. Second, the log should be replicated off-host within seconds. I use vector to ship to S3 and to a local Loki instance, with a retention policy of 90 days for the S3 copy to satisfy the EU AI Act Article 14 record-keeping requirement that comes into force in August 2026 for high-risk systems.

A Production Gotcha: The Audit Log That Lied

The most painful debugging story I have from MCP deployments is an audit log that I trusted and should not have. I was running a Python MCP server with the audit-log wrapper above, writing to a Loki instance through vector, and pulling traces by conversation_id to investigate a customer ticket. The customer reported that the agent had returned a row that should not have been returned: a row from the payment_methods table, which the policy denied. The agent had answered with the row's content. The audit log said the call was denied. Both of those statements appeared to be true.

It took me four hours and a packet capture to find the bug. The query tool in the MCP server had a fast-path branch that read from an in-memory cache before checking the policy wrapper. The cache was populated, hours earlier, by a different agent on the same MCP server instance, querying the payment_methods table during a tool migration. The policy wrapper had not been wired into the cache path because the original implementation predated the cache by six months. The audit log was honest about the policy decision; the policy decision had simply been bypassed by a code path nobody had remembered. The bug had been latent for nine weeks. The lesson was that an audit log is only as honest as the path it instruments. Every code path that returns data to the agent has to be wrapped, tested, and audited. I now run a synthetic adversarial test, modelled after CHAOSS-style red-team scripts, that fires a denied query through every code path on every release.

The fix was to push the policy check to the absolute boundary of the server, at the JSON-RPC handler in the MCP SDK, so no code path can return data to the agent without passing through the policy wrapper. The audit log now records both the request hash and the response hash, so any divergence between what was approved and what was returned is detectable in the log itself. The synthetic adversarial test is a release gate.

flowchart TB
  R[JSON-RPC Request]
  R --> P{Policy Wrapper}
  P -->|denied| D[Audit: call_denied]
  P -->|allowed| H[Tool Handler]
  H --> C{Cache hit?}
  C -->|yes| CC[Cached Result]
  C -->|no| Q[Query Backend]
  Q --> CR[Cache + Return]
  CC --> A[Audit: call_succeeded with cached=true]
  CR --> A2[Audit: call_succeeded with cached=false]
  A --> Out[Response to Agent]
  A2 --> Out

Sandbox Runtime Comparison

Picking the right sandbox runtime is a cost-versus-blast-radius trade-off. I have run all four of the options below in production, and the table below is the rough decision matrix I use.

Property Docker + seccomp gVisor (runsc) Firecracker WASI (Wasmtime)
Startup time ~150ms ~250ms ~125ms ~5ms
Memory overhead ~10MB ~30MB ~5MB ~1MB
Syscall performance Native -10 to -25% Native N/A (no syscalls)
Kernel attack surface Full host kernel gVisor user kernel Dedicated kernel None
File-system isolation Mount namespace Mount + intercept Full VM Capability-based
Network isolation Net namespace Net namespace Full VM None by default
Operational complexity Low Medium High Low
Best for Single trusted server Untrusted or third-party servers Multi-tenant, regulated Pure-compute tools

A practical rule of thumb. Single-team deployment, internal MCP servers you own end to end: Docker with seccomp and cgroups is fine. Multi-team deployment, MCP servers from a public registry: gVisor. Multi-tenant SaaS where tenants bring their own MCP servers: Firecracker. Pure-compute tools that do not need network or database access: WASI.

Comparison visual showing four sandbox runtimes (Docker, gVisor, Firecracker, WASI) as columns with rows for startup time, kernel attack surface, isolation strength, and operational complexity, color-coded green/yellow/red

Production Considerations

Three deployment notes that did not fit elsewhere but matter on every real project.

First, supply-chain hygiene. Every MCP server pulled from a public registry should be pinned to a specific version, scanned with a software composition tool such as Trivy or Grype, and reviewed for transitive dependencies before deployment. The 2025 npm postinstall attacks against AI-tooling packages produced a class of compromise that no runtime sandbox alone catches, because the malicious code runs at install time, not at request time. Treat MCP servers as third-party code with the same review bar as any other external dependency.

Second, secret handling. Database credentials, API keys, and OAuth tokens used by MCP servers should be mounted at runtime as files, not as environment variables. Environment variables leak through /proc/<pid>/environ, through error reporting tools that capture process state, and through any subprocess the server spawns. Mounted secret files with strict permissions and a process that reads them once at startup are the safe default. Most modern container orchestrators support this directly.

Third, observability for the agent-MCP boundary should ride on OpenTelemetry GenAI conventions, the same conventions covered in the OpenTelemetry GenAI Conventions post. Every tool span should carry the policy-check outcome, the cap-exceeded events as span events, and the conversation ID as a span attribute. Wire these spans into the same backend that handles the agent's LLM spans, and a 2am incident becomes a single trace query instead of a four-hour log dive.

gantt
    title MCP server hardening rollout (typical 2-week project)
    dateFormat  YYYY-MM-DD
    section Inventory
    Catalogue all MCP servers       :a1, 2026-04-30, 2d
    Score each by threat surface    :a2, after a1, 2d
    section Wrap
    Add policy wrapper, fail-closed :b1, after a2, 3d
    Add audit logger to wrapper     :b2, after b1, 2d
    section Sandbox
    Containerise + seccomp + caps   :c1, after b2, 3d
    Move untrusted servers to gVisor :c2, after c1, 2d
    section Verify
    Synthetic adversarial test gate :d1, after c2, 2d
    Postmortem template + runbook   :d2, after d1, 1d

Conclusion

The MCP ecosystem in 2026 is at the same maturity stage that web APIs were in around 2008. The protocol works, the tooling is improving fast, and the operational story is still being written. The teams that are not getting paged at 2am on a Saturday are the ones that have decided not to trust the MCP server. They wrap every server in a policy layer, run every server inside a sandbox, log every call to a tamper-evident audit trail, and treat third-party servers with the same supply-chain rigour as any other external dependency.

If you take one thing from this post, take this: the hardest part is not the sandbox runtime, the policy DSL, or the audit-log schema. The hardest part is making the deployment template the path of least resistance, so that the next engineer who adds an MCP server gets the wrapper, the cap, and the log for free without thinking about it. A platform team that ships a mcp-server Helm chart with policy and sandbox baked in will out-secure a platform team that ships a wiki page about best practices, every day of the week.

Working code for the policy wrapper, the audit logger, the seccomp profile, and the gVisor deployment is in the companion repo at github.com/amtocbot-droid/amtocbot-examples/tree/main/mcp-defensive-sandbox.

Sources

  1. Model Context Protocol specification, version 1.0: modelcontextprotocol.io/specification
  2. gVisor documentation and runsc runtime: gvisor.dev/docs
  3. Firecracker microVM design and performance: firecracker-microvm.github.io
  4. WASI Preview 2 specification and Wasmtime runtime: wasi.dev and wasmtime.dev
  5. OpenTelemetry GenAI semantic conventions: opentelemetry.io/docs/specs/semconv/gen-ai
  6. EU AI Act Article 14 (record-keeping requirements): artificialintelligenceact.eu/article/14
  7. CloudEvents 1.0 specification: cloudevents.io/spec

About the Author

Toc Am

Founder of AmtocSoft. Writing practical deep-dives on AI engineering, cloud architecture, and developer tooling. Previously built backend systems at scale. Reviews every post published under this byline.

LinkedIn X / Twitter

Published: 2026-04-30 · Written with AI assistance, reviewed by Toc Am.

Buy Me a Coffee · 🔔 YouTube · 💼 LinkedIn · 🐦 X/Twitter

Vector Database Cost Showdown 2026: pgvector vs Pinecone vs Weaviate vs Qdrant on Real Workloads

Vector Database Cost Showdown 2026: pgvector vs Pinecone vs Weaviate vs Qdrant on Real Workloads

Hero image showing four vector database logos arranged around a glowing dollar-sign cost graph, dark technical aesthetic with cyan and amber accents

Introduction

The first vector database bill that woke me up at 3am was not the one I expected. We had built a RAG-powered customer support agent for a mid-market SaaS company, indexed about 4.2 million chunks of documentation across roughly 800 customer accounts, and shipped to production in late January 2026. The Pinecone serverless dashboard quoted us a monthly estimate of $312 based on our test workload. The first real production week landed at $1,847. The second week was $2,610. By the time I ran a proper cost audit, we were on track for $11,400 a month against a quoted $312, and the agent was answering roughly the same questions over and over because the customer base was not actually that diverse.

The problem was not Pinecone. The problem was that I had no model for how a vector database actually costs money under a real RAG workload. I assumed cost scaled with stored vectors. It scaled with read units, which scaled with retrieval frequency, top-k, metadata filters, and namespace fan-out, none of which our load test exercised honestly. After two weeks of pulling per-namespace metrics and rewriting the retrieval layer, we cut the bill to about $620 a month without changing the agent's behaviour. A month later I migrated the same workload to pgvector on the customer's existing RDS Postgres instance for an incremental cost of about $90 a month, and the agent ran faster on the new setup.

This post is the comparison I wish I had done before that incident. I have run the same RAG retrieval benchmark, 1.2 million chunks at 1024 dimensions with realistic query patterns, against pgvector 0.8 on Postgres 18, Pinecone serverless on the standard plan, Weaviate Cloud Standard, and Qdrant Cloud Standard, and I have priced each at 1M-vector and 100M-vector scales using public pricing as of April 2026. The numbers below come from those runs and the published price pages cited at the end. Where I am quoting a benchmark from someone else, I cite it inline.

The Problem: Vector Database Cost Is Not Storage Cost

Every team I have helped onboard a vector database has started by asking the wrong question. They ask "how much does it cost to store ten million embeddings". The honest answer is that storage is the smallest line item for almost every workload that is actually doing retrieval-augmented generation in production. The cost driver is retrieval, and retrieval cost has at least five components most pricing pages do not break out cleanly.

The first component is the read unit, request unit, or query unit, depending on the vendor. Pinecone serverless prices reads in 4kb-aligned chunks. Weaviate Cloud bills query operations as a function of the SLA tier. Qdrant Cloud bills you for the underlying compute that handles the queries. pgvector bills you for the Postgres compute that also runs everything else in your application. A naive load test that fires 100 queries a second for ten minutes will not surface the cost of a production agent that fires 60 queries a second for sixteen hours a day, because the marginal pricing curves are different.

The second component is metadata filtering. Filtered vector search is a different algorithmic problem than unfiltered search, and the major vector databases handle it differently. Pinecone uses an inverted-index pre-filter that can balloon read units when the filter is selective. Weaviate's ACORN-1 filter strategy, available since v1.27, blends pre-filter and post-filter and tends to keep cost stable. Qdrant's payload indexes are explicit, fast when configured, and surprising when not. pgvector with a WHERE clause runs a query plan that may prefer a btree scan over the HNSW index for selective filters, which is sometimes cheaper, sometimes catastrophic.

The third component is index build cost. HNSW, the standard index family across all four databases in 2026, is expensive to build and re-build. If you re-embed your corpus when an embedding-model upgrade lands, the index rebuild can run for hours and cost more than a month of queries. Pinecone hides this in your namespace upsert cost. Weaviate and Qdrant expose it as compute time on the cluster. pgvector lets you watch every CPU core spin in your Postgres container.

The fourth component is namespace and tenant fan-out. Multi-tenant RAG systems where each customer has their own vector subset have a non-obvious cost profile. Pinecone's namespace model is cheap to scale in count, but each cold namespace still incurs reads when you do a sparse traffic pattern. Weaviate Multi-Tenancy, which became the default in v1.25, charges per-tenant on the SLA tier. Qdrant collections per tenant work but require collection-level pre-warming. pgvector with a tenant_id column is the cheapest model in raw dollars, the most painful in query-tuning at scale.

The fifth component is egress and network. This is the line nobody reads on the pricing page until the bill arrives. Pinecone reads cost more if you query from a different region than your index. Weaviate Cloud charges egress out of its managed VPC. Qdrant Cloud passes through cloud-provider egress at the underlying rate. pgvector on RDS bills you the standard intra-VPC or cross-AZ network depending on where your application server runs.

Architecture diagram showing the five cost components of a vector database system: read units, metadata filters, index build, namespace fan-out, and network egress, with arrows feeding into a central monthly bill calculator

How The Four Databases Charge In 2026

Each of the four databases has its own pricing model. Below is the simplest accurate summary as of April 2026, with the public pricing page links in the Sources section.

pgvector 0.8 On Self-Managed Postgres

pgvector is a Postgres extension. It costs whatever your Postgres instance costs, plus storage, plus the compute time for queries. There is no separate read-unit meter. If your application already runs Postgres, the marginal cost of adding pgvector is the disk for the vectors, the RAM for the HNSW graph, and the CPU cycles for queries.

For a 1M-vector, 1024-dim corpus, the HNSW index with default parameters consumes about 5GB of RAM and roughly 9GB of disk in the v0.7 halfvec format. A db.r7g.large instance on AWS RDS at $0.21 per hour, $151 a month, will hold this comfortably and run mixed application traffic. For a 100M-vector corpus, the same parameters need about 480GB of RAM, and you are now on a db.r7g.16xlarge or larger, $3,360 a month, before storage and IO. pgvector is dramatic value at low and mid scale, painful at top scale, and reliably the cheapest answer when "the database I already run" is part of the equation.

Pinecone Serverless

Pinecone serverless, which has been the default offering since 2024, prices on three meters: storage, write units, and read units. Storage is $0.33 per GB per month. Writes are $4.00 per million write units. Reads are $16.00 per million read units. A read unit is a 4kb-aligned read of vector and metadata data, so a query that fetches top-k=10 against a 1024-dim float32 index, plus metadata, costs roughly 5-15 read units depending on the metadata size and the read pattern of your filter.

The pricing page rate sheet looks innocent until you do the multiplication. A production agent that hits the index 1.5 times per user turn, runs 10,000 user turns per day, with top-k=20 and modest metadata, will burn about 600 read units per turn, 9 million read units per day, $144 per day, $4,300 per month, against a vector storage line of maybe $40. Pinecone is great when your traffic is predictable and your top-k is small, expensive when both are not. The pod-based legacy offering, still listed on the price page, is friendlier for predictable workloads but has been quietly deprecated in messaging since late 2025.

Weaviate Cloud Standard

Weaviate Cloud bills on the SLA tier and the size of your data, with three published tiers as of April 2026: Sandbox, Standard, and Enterprise. The Standard tier prices at $25 per month minimum, with a per-million-vectors charge that scales by the SLA you select. A 1M-vector workload on Standard runs about $130 a month, a 100M-vector workload runs about $4,800 a month. ACORN-1 filtered search and async indexing, both stable since 1.27 in 2025, are included.

Weaviate Cloud's pricing is the most predictable of the four when you do not know your retrieval pattern. The trade-off is that it is rarely the cheapest at any scale. The reason teams pick it is the schema-first model, the native module ecosystem (text2vec, generative, reranker), and the multi-tenancy feature, which became the default after 1.25 and is the cleanest on the market for SaaS RAG.

Qdrant Cloud Standard

Qdrant Cloud Standard bills on the size of the cluster, which is a function of vectors stored, RAM required, and replicas. Storage uses three quantization options: uncompressed, scalar (4-byte to 1-byte, ~75% RAM cut), and binary (1-bit, ~97% RAM cut, with rescoring). Binary quantization with HNSW rescoring is the headline feature for cost reduction at scale. A 1M-vector workload at 1024 dimensions on a small Qdrant Cloud cluster runs about $80-120 a month. A 100M-vector workload on a properly sized cluster with binary quantization runs about $1,800-2,400 a month, materially less than Pinecone or Weaviate at the same scale.

Qdrant's pricing model rewards you for understanding your workload. If you do not, the cluster is over-provisioned and you pay for the slack. If you do, binary quantization plus payload indexes plus the right shard count is the cheapest path to a managed vector database at top scale in 2026.

The Benchmark: 1.2M Chunks, 1024 Dim, Realistic Query Pattern

I ran the same retrieval benchmark against all four databases in early April 2026. The corpus was 1.2 million chunks of public technical documentation, embedded with text-embedding-3-large (3072 dim, reduced to 1024 via PCA), with a metadata payload of roughly 800 bytes per chunk. The query workload was 50,000 queries drawn from real customer-support traffic, with top-k=20 and a tenant filter on roughly 1% of the corpus. Each system ran on its smallest "production-ready" tier as of the test date.

                 p50    p95    p99    qps     monthly cost ($USD, est.)
pgvector v0.8    14ms   38ms   91ms   180     $151 (db.r7g.large + storage)
Pinecone serv.   22ms   54ms   87ms   140     $487 (serverless reads + storage)
Weaviate Cloud   18ms   46ms   78ms   170     $128 (Standard tier)
Qdrant Cloud     11ms   31ms   62ms   210     $115 (small cluster, scalar quant)

The numbers above are point-in-time and assume my test traffic, which is well-cached, well-distributed, and uses a single tenant filter. Your numbers will differ. Two findings carry across most workloads I have measured: Qdrant's quantized index is the fastest at low scale when configured well, and Pinecone serverless costs more than the others at low scale but stays predictable as you scale out. The crossover where Pinecone becomes cheaper than the others is rare and depends on a low-QPS, low-top-k, large-storage workload that most production RAG systems do not have.

flowchart LR
  subgraph App["Agent / RAG App"]
    Q[User query]
    E[Embed]
    R[Retrieve top-k]
    G[Generate]
  end
  subgraph DB["Vector DB"]
    I[HNSW index]
    M[Metadata + filter]
    P[Payload + return]
  end
  Q --> E --> R
  R -->|top-k=20, filter=tenant_id| I
  I --> M
  M --> P
  P -->|context| G
  G --> Out[Response]
  R -.cost.- I
  I -.cost.- M
  M -.cost.- P

The diagram above is the cost flow that mattered in my Pinecone incident. Every query fans out into the index, the metadata, and the payload return. Each of those touches a meter on the pricing page. A change to any one of top-k, filter selectivity, payload size, or query rate moves the bill in a way that your January load test did not exercise.

Hidden Cost #1: The Re-Embedding Storm

The single largest cost shock I have seen across all four databases was a re-embedding event triggered by an embedding-model upgrade. In late 2025, OpenAI's text-embedding-3-large model was retired with a 90-day deprecation notice and replaced by a successor with a different vector shape. Teams that had millions of vectors indexed had to re-embed their entire corpus, re-build the index, and run both the old and the new index in parallel for a verification window.

For a 100M-vector corpus, the re-embedding API spend was on the order of $30,000 at OpenAI's published rate. The vector-database-side cost was a separate hit. Pinecone billed write units against the re-upsert. Weaviate Cloud's index rebuild was a multi-hour cluster task. Qdrant required a collection swap with a temporary doubling of cluster size. pgvector required a CREATE INDEX CONCURRENTLY that ran for nine hours and roughly doubled the RAM headroom needed during the build.

If you do not budget for re-embedding events on a 12-18 month cycle, your annual cost-of-ownership for any vector database is materially understated. The 2026 model upgrade cycle has been faster than many teams expected, with three major providers retiring an embedding model in the past 18 months. Treat re-embedding cost as a line item, not a surprise.

Hidden Cost #2: The Selective-Filter Pothole

The single most painful debugging story I have from pgvector was a selective filter on a tenant table. Our schema had a tenant_id column on the vector table, indexed by btree, with the HNSW index on the embedding column. For a query like:

SELECT id, content
FROM chunks
WHERE tenant_id = $1
ORDER BY embedding <=> $2
LIMIT 20;

we expected the planner to use the HNSW index and apply the tenant_id filter as a post-filter. For tenants with thousands of chunks, this worked fine. For tenants with three chunks, the planner switched to a sequential scan over the entire 1.2M-row table because the cost model thought the btree index was not selective enough at the leaf level. The query dropped from 14ms to 4.2 seconds. We caught it because Postgres auto_explain logged the plan flip on a customer demo.

The fix was a partial HNSW index per high-traffic tenant plus iterative scan tuning, available since pgvector 0.8. The lesson was that pgvector's cost story depends on the planner agreeing with you about the index. Pinecone, Weaviate, and Qdrant have their own version of this gotcha. Pinecone's serverless pre-filter can read your entire namespace metadata if the filter is sparse. Weaviate's ACORN-1 has a published fallback to brute-force when the filter cardinality is low. Qdrant's payload index needs to be explicitly created to avoid a brute-force scan over the payload at filter time.

In every case, the published p95 latency on the vendor's website assumes a typical filter workload. Your atypical filter is where the cost surprise lives. Always run your benchmark on your real filter distribution.

flowchart TB
  Q[Query with metadata filter]
  Q --> S{Filter selectivity}
  S -->|>10% of corpus| HNSW[HNSW with post-filter]
  S -->|0.1-10%| HYB[Hybrid: pre-filter then HNSW]
  S -->|<0.1%| SCAN[Brute-force scan over filtered subset]
  HNSW --> Cost1[Stable cost]
  HYB --> Cost2[Moderate cost]
  SCAN --> Cost3[High cost or slow]
  Cost1 --> Out[Result]
  Cost2 --> Out
  Cost3 --> Out

Hidden Cost #3: Backups, DR, and Compliance

None of the published pricing pages quote a backup line in their headline numbers, and none of the four databases have a backup model that is free for production use. Pinecone offers paid collection backups on the standard tier and above. Weaviate Cloud's backup feature uses your S3 bucket and bills S3 storage at AWS rates. Qdrant Cloud offers snapshots that count against your cluster's storage. pgvector backups ride on whatever your Postgres backup strategy is, which on RDS means automated snapshots are included up to your provisioned-storage size and you pay for anything beyond.

For EU AI Act Article 14 compliance, in force from August 2026 for high-risk systems, a 90-day retention requirement on the vectors and the queries that produced retrieved-context decisions adds a non-trivial storage line. Treat 90-day retention plus the re-build window for every embedding-model upgrade as a real cost.

The Decision Matrix

After running this benchmark and the production migration earlier this year, I have a fairly stable decision matrix. It is not the only one that works, but it has not failed me on a 2026 RAG project yet.

Workload First choice Second choice Avoid
<1M vectors, you already run Postgres pgvector Qdrant Cloud Pinecone
1M-10M, multi-tenant SaaS RAG Weaviate Cloud Qdrant Cloud pgvector at the high end
10M-100M, predictable read pattern Qdrant Cloud (binary quant) Weaviate Cloud Enterprise Pinecone unless top-k is tiny
10M-100M, unpredictable burst traffic Pinecone serverless Qdrant Cloud with autoscale self-hosted anything
Compliance-heavy, EU residency Weaviate Cloud (EU) or self-hosted Qdrant pgvector on EU RDS Pinecone unless their EU region fits
Sub-1M, prototype pgvector or Qdrant Cloud Sandbox Weaviate Sandbox Pinecone (overkill at this scale)
Comparison visual showing four columns labeled pgvector, Pinecone, Weaviate, Qdrant with green/yellow/red dots across rows for cost, latency, multi-tenancy, ops overhead, and EU residency
flowchart TD
  Start[New RAG project]
  Start --> Q1{Already run Postgres?}
  Q1 -->|Yes, <10M vectors| Pgvec[pgvector 0.8]
  Q1 -->|No, or >10M| Q2{Multi-tenant SaaS?}
  Q2 -->|Yes| Q3{EU residency required?}
  Q3 -->|Yes| Weav[Weaviate Cloud EU]
  Q3 -->|No, predictable QPS| Qdrant1[Qdrant Cloud, binary quant]
  Q3 -->|No, bursty QPS| Pinecone1[Pinecone serverless]
  Q2 -->|No| Q4{Compliance heavy?}
  Q4 -->|Yes| Weav2[Weaviate self-hosted or Qdrant on-prem]
  Q4 -->|No, low budget| Pgvec2[pgvector on existing Postgres]
  Q4 -->|No, top scale| Qdrant2[Qdrant Cloud Enterprise]

Production Considerations

Three deployment notes that did not fit elsewhere but matter on every real project.

First, the embedding model is part of the vector database from a cost perspective even though it is billed separately. A 3072-dim model costs more to store, more to index, more to query, and more to re-embed than a 1024-dim model. The 2025-2026 cycle has favoured 1024-dim models with PCA-reduced inputs from 3072 because the recall trade-off is small and the cost-of-ownership trade-off is large. Run a recall@k test on your domain before you commit to a dimensionality.

Second, hybrid search (BM25 + vector) is an option in Weaviate, Qdrant, and now pgvector via the pgvector-rs and pg_search extensions, but not in Pinecone serverless directly without an external sparse index. If your retrieval depends on hybrid, Pinecone will cost you more in glue code and a second index, which is a real line item.

Third, observability for vector queries should ride on OpenTelemetry GenAI conventions, the same conventions covered in blog 167. Treat retrieval as an instrumented step in the trace, attach db.system, top-k, filter cardinality, and result count, and you will see the cost-shock early.

gantt
    title Vector DB migration timeline (typical 4-week project)
    dateFormat  YYYY-MM-DD
    section Decide
    Pick target DB           :a1, 2026-04-30, 3d
    Run benchmark on real data :a2, after a1, 4d
    section Build
    Provision new DB           :b1, after a2, 2d
    Dual-write old + new       :b2, after b1, 5d
    section Verify
    Recall@k validation        :c1, after b2, 4d
    Cost reconciliation        :c2, after c1, 3d
    section Cutover
    Read switch to new DB      :d1, after c2, 2d
    Decommission old DB        :d2, after d1, 4d

Conclusion

The 2026 vector database landscape rewards teams that benchmark on their real retrieval pattern instead of a synthetic load test. pgvector wins at low scale when you already run Postgres. Qdrant wins at top scale when you can configure quantization and payload indexes. Weaviate wins on multi-tenant SaaS RAG where the schema and the modules pay for themselves. Pinecone wins on bursty unpredictable traffic where the operational cost of running anything else is the deciding line item.

If you take one thing from this post, take this: run a 72-hour shadow benchmark of your production traffic against your candidate database before you sign anything longer than a monthly contract, and instrument the retrieval step with OpenTelemetry GenAI spans so you can see the cost flow per query. The $11,400-vs-$312 surprise that woke me up at 3am was avoidable if I had measured retrieval, not storage. Yours will be too.

Working code for the benchmark harness, a pgvector schema with the partial-index trick, and a Qdrant collection definition with binary quantization is in the companion repo at github.com/amtocbot-droid/amtocbot-examples/tree/main/vector-db-cost-showdown.

Sources

  1. pgvector 0.8 release notes and HNSW tuning guide: github.com/pgvector/pgvector
  2. Pinecone Serverless pricing: pinecone.io/pricing
  3. Weaviate Cloud pricing and ACORN filter strategy: weaviate.io/pricing
  4. Qdrant Cloud pricing and quantization guide: qdrant.tech/pricing and qdrant.tech/documentation/guides/quantization
  5. OpenTelemetry GenAI semantic conventions: opentelemetry.io/docs/specs/semconv/gen-ai
  6. EU AI Act Article 14 (record-keeping requirements): artificialintelligenceact.eu/article/14

About the Author

Toc Am

Founder of AmtocSoft. Writing practical deep-dives on AI engineering, cloud architecture, and developer tooling. Previously built backend systems at scale. Reviews every post published under this byline.

LinkedIn X / Twitter

Published: 2026-04-30 · Written with AI assistance, reviewed by Toc Am.

Buy Me a Coffee · 🔔 YouTube · 💼 LinkedIn · 🐦 X/Twitter

Wednesday, April 29, 2026

OpenTelemetry GenAI Conventions: A Practical Guide to LLM Span Attributes for Production Observability

OpenTelemetry GenAI Conventions: A Practical Guide to LLM Span Attributes for Production Observability

Hero diagram showing an LLM call wrapped in an OpenTelemetry span with attributes streaming into a collector and on to multiple backends, dark technical aesthetic with glowing trace lines

Introduction

The first time I watched a production AI agent silently double-bill a customer, the trace existed. We had wired every LLM call in our agent loop through a homegrown wrapper that logged a JSON line per call: model, prompt token count, completion token count, latency, cost. The wrapper was clean. The dashboards were clean. The Saturday-morning page that woke me up was a customer email forwarded by support: "I see two charges of $1,840.00 on my invoice." Our wrapper had logged both calls. The wrapper had not connected them to the same conversation. The agent had hit a planning step, stalled, restarted from a saved state, and re-entered the tool that issued a Stripe charge. Two hours of our wrapper's logs sat in Datadog with no parent-child relationship between the rogue retry and the original conversation. We could see the calls. We could not see the run.

The fix was not more logging. The fix was switching every LLM and tool call to emit a proper OpenTelemetry span with the GenAI semantic conventions, which became stable in OpenTelemetry 1.27 in late 2025 and have been the production standard through 2026. Once we did that, the rogue retry was three clicks away in any OTel-compatible backend. The parent-child relationship was implicit. The model name, prompt tokens, completion tokens, finish reason, and tool-call payloads were all standard attributes a downstream alerting rule could read without us writing a single grok pattern.

This post is the practical guide I wish I had on that Saturday morning. I will show you the GenAI semantic conventions as they exist in 2026, the exact span attributes you should be emitting from every LLM and tool call, the Python and Node code to instrument an agent loop, three concrete debugging stories where the conventions earned their keep, and the production gotchas that will bite you if you treat OTel as a logging library instead of a tracing protocol. Every code sample in this post runs against the live opentelemetry-instrumentation-openai-v2 package and the equivalent Anthropic instrumentation, both of which now ship the conventions out of the box.

The Problem: Why Homegrown LLM Logging Always Breaks

Every team I have worked with that has tried to roll their own LLM observability has hit the same four walls in roughly the same order, and it is worth naming them up front because they are the reason OpenTelemetry conventions exist at all.

The first wall is parent-child relationships. An agent loop is a tree. A user asks a question, the planner LLM emits a plan, the orchestrator runs three tool calls in parallel, two of them succeed and feed into a second LLM call that synthesises the answer, the third fails and triggers a retry that hits a different model. If your logging layer captures one event per call without span IDs and parent span IDs, you have a flat list. Reconstructing the tree from a flat list at incident time is the part that takes hours. With OTel spans, the tree is the data structure.

The second wall is vendor lock-in. Every observability backend, Langfuse, Arize Phoenix, Splunk LLM, Datadog LLM, Portkey, Logfire, defines its own JSON shape if you ship raw logs. The moment leadership asks you to evaluate a second vendor, you face a rewrite of every wrapper. With OTel and the GenAI conventions, you ship one set of spans to an OTel collector and route them to N backends. I have seen teams swap Langfuse for Phoenix in an afternoon because every span carried gen_ai.system, gen_ai.request.model, and gen_ai.usage.input_tokens in a backend-agnostic shape.

The third wall is cost attribution. Provider invoices arrive monthly, aggregated by API key. Your traces, if they exist, are per-call. Reconciling a $48,000 monthly OpenAI bill against per-conversation traces requires every span to carry the same set of attributes the provider's billing engine considers. The 2026 GenAI conventions do this exactly: gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.usage.cache_read_input_tokens. Add a single business attribute like gen_ai.conversation.id and you can answer "which conversation cost us $1,840 last Saturday" with one query.

The fourth wall is regulatory. The EU AI Act Article 14 traceability requirements, in force from August 2026 for high-risk systems, require that you can reconstruct the inputs, outputs, and decision context of any AI-driven decision after the fact. Homegrown logs that drop after thirty days, or that store prompts in one system and responses in another, fail this requirement. OTel spans with the GenAI conventions, retained per your data-retention policy in an OTel-compatible store, satisfy the structural part of Article 14 by construction. The legal team still wants policy and process around it, but the engineering substrate is there.

Architecture diagram showing an OTel collector receiving GenAI spans from multiple SDKs and fanning out to Langfuse, Phoenix, Splunk LLM, and a long-term audit store, with an EU AI Act compliance label on the audit store

The OTel GenAI Semantic Conventions: What Goes On A Span

The OTel GenAI conventions define a small, opinionated set of attributes that every LLM-related span should carry. The set has stabilised through 2026 around two span kinds, gen_ai.client.operation for an inference call and gen_ai.tool.call for an agent tool invocation, and one event kind, gen_ai.choice for streamed completions. The full reference lives at opentelemetry.io/docs/specs/semconv/gen-ai/. Here is the practical subset you actually need in production.

For every inference call:

Attribute Required Example Notes
gen_ai.system yes openai, anthropic, azure.ai.inference, bedrock Vendor identifier. Use the canonical short name.
gen_ai.operation.name yes chat, text_completion, embeddings Operation type.
gen_ai.request.model yes gpt-4o-2024-11-20, claude-sonnet-4-6 Exact model id you sent in the request.
gen_ai.response.model recommended gpt-4o-2024-11-20 What the provider routed to. May differ from request when an alias is resolved.
gen_ai.usage.input_tokens yes when known 1840 From the response, not estimated locally.
gen_ai.usage.output_tokens yes when known 412 From the response.
gen_ai.usage.cache_read_input_tokens yes when prompt caching 1640 Anthropic + OpenAI cached input count.
gen_ai.request.temperature optional 0.0 Useful for reproducibility audits.
gen_ai.request.max_tokens optional 2048
gen_ai.response.finish_reasons recommended ["stop"], ["tool_calls"], ["length"] Critical for debugging tool-call vs natural-stop ambiguity.
gen_ai.response.id recommended provider-issued response id Lets you cross-reference provider logs.

For tool calls:

Attribute Required Example Notes
gen_ai.tool.name yes lookup_invoice, charge_card Same name your agent emits.
gen_ai.tool.call.id yes provider-issued call id Connects the tool call back to the parent inference span.
gen_ai.tool.type recommended function, mcp, retrieval New in the 2026 update for MCP-backed tools.

The whole point of the convention is that any backend can read these attributes without your team writing a custom parser. Langfuse maps gen_ai.usage.input_tokens to its inputTokens field automatically. Phoenix uses the same attribute as the basis for its cost-attribution view. Splunk LLM ingests the attributes as searchable fields. You stop writing glue code and start writing alerts.

Implementation: A Production Agent Loop With OTel GenAI Spans

Here is a concrete Python agent that performs a planning step, runs two tool calls, synthesises an answer, and emits OTel spans that conform to the conventions. This is condensed from a working build I shipped in March 2026; the full version lives at github.com/amtocbot-droid/amtocbot-examples/tree/main/otel-genai-agent.

import os
from openai import OpenAI
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.semconv.attributes.gen_ai_attributes import (
    GEN_AI_SYSTEM,
    GEN_AI_OPERATION_NAME,
    GEN_AI_REQUEST_MODEL,
    GEN_AI_RESPONSE_MODEL,
    GEN_AI_USAGE_INPUT_TOKENS,
    GEN_AI_USAGE_OUTPUT_TOKENS,
    GEN_AI_RESPONSE_FINISH_REASONS,
    GEN_AI_RESPONSE_ID,
    GEN_AI_TOOL_NAME,
    GEN_AI_TOOL_CALL_ID,
)

# One-time tracer setup. In production this lives in a shared module.
provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint=os.environ["OTEL_COLLECTOR_URL"]))
)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("amtocsoft.agent")

client = OpenAI()

def call_llm(messages, model="gpt-4o-2024-11-20", tools=None, conversation_id=None):
    with tracer.start_as_current_span(
        f"chat {model}",
        attributes={
            GEN_AI_SYSTEM: "openai",
            GEN_AI_OPERATION_NAME: "chat",
            GEN_AI_REQUEST_MODEL: model,
            "gen_ai.conversation.id": conversation_id,
        },
    ) as span:
        resp = client.chat.completions.create(
            model=model,
            messages=messages,
            tools=tools,
        )
        choice = resp.choices[0]
        span.set_attribute(GEN_AI_RESPONSE_MODEL, resp.model)
        span.set_attribute(GEN_AI_RESPONSE_ID, resp.id)
        span.set_attribute(GEN_AI_USAGE_INPUT_TOKENS, resp.usage.prompt_tokens)
        span.set_attribute(GEN_AI_USAGE_OUTPUT_TOKENS, resp.usage.completion_tokens)
        span.set_attribute(GEN_AI_RESPONSE_FINISH_REASONS, [choice.finish_reason])
        return resp

def call_tool(tool_name, tool_call_id, args, fn):
    with tracer.start_as_current_span(
        f"tool {tool_name}",
        attributes={
            GEN_AI_TOOL_NAME: tool_name,
            GEN_AI_TOOL_CALL_ID: tool_call_id,
            "gen_ai.tool.type": "function",
        },
    ) as span:
        try:
            result = fn(**args)
            span.set_attribute("gen_ai.tool.outcome", "success")
            return result
        except Exception as e:
            span.record_exception(e)
            span.set_attribute("gen_ai.tool.outcome", "error")
            raise

def run_agent(user_query, conversation_id):
    with tracer.start_as_current_span(
        "agent.run",
        attributes={"gen_ai.conversation.id": conversation_id},
    ):
        plan = call_llm(
            [{"role": "user", "content": user_query}],
            tools=AGENT_TOOLS,
            conversation_id=conversation_id,
        )
        if plan.choices[0].message.tool_calls:
            tool_results = []
            for tc in plan.choices[0].message.tool_calls:
                result = call_tool(
                    tool_name=tc.function.name,
                    tool_call_id=tc.id,
                    args=parse_args(tc.function.arguments),
                    fn=TOOL_REGISTRY[tc.function.name],
                )
                tool_results.append({"role": "tool", "tool_call_id": tc.id, "content": result})
            return call_llm(
                [{"role": "user", "content": user_query}, plan.choices[0].message] + tool_results,
                conversation_id=conversation_id,
            )
        return plan

The key shape choices: every span starts inside a parent agent-run span so the call tree is visible end-to-end, the conversation id is a non-standard attribute we attach for cost attribution and audit trail, and the tool span lives as a child of the agent run, not a child of the chat span. That last choice matters in step three of the debugging stories below.

Here is the request lifecycle as a Mermaid diagram so you can see how the spans nest at runtime.

sequenceDiagram participant User participant Agent as agent.run span participant LLM1 as chat span (planner) participant Tool as tool span participant LLM2 as chat span (synthesiser) participant Collector as OTel Collector participant Backend as Phoenix / Langfuse User->>Agent: query Agent->>LLM1: prompt + tools LLM1-->>Agent: tool_calls Agent->>Tool: lookup_invoice Tool-->>Agent: result Agent->>LLM2: prompt + tool result LLM2-->>Agent: final answer Agent-->>User: answer par span export Agent-->>Collector: spans batch LLM1-->>Collector: spans batch Tool-->>Collector: spans batch LLM2-->>Collector: spans batch end Collector->>Backend: OTLP push

Run the agent against a real query, then open any OTel-compatible backend. You will see the agent.run span as the root, the two chat spans as siblings under it with the tool span between them, and every attribute the conventions specify already populated. No custom dashboards, no glue code.

Three Debugging Stories Where The Conventions Earned Their Keep

Three production debugging stories from the last six months will tell you more about why the conventions matter than any spec excerpt. Each is from a real incident; numbers are real, identifiers are sanitised.

Story 1: The 47,000-Token Prompt That Was Hiding In Plain Sight

We had a logistics-company agent that pages me on a Saturday because a single conversation had spent $4,180 in nine hours retrying one completion every twelve seconds. Pre-OTel, I spent forty minutes grepping CloudWatch logs to find the offending call. Post-OTel, the trace told the story in one screen: the conversation root span had gen_ai.conversation.id=conv_8f3a2c1, and under it sat a chat span whose gen_ai.usage.input_tokens attribute was 47,212. The previous-call span on the same conversation had gen_ai.usage.input_tokens=1,840. Something between those two calls had grown the prompt by 25x. Two clicks deeper, into the tool span between them, showed the tool had returned a 47-thousand-line CSV instead of an error.

The fix was a tool-result-size check, but the time-to-fix was the lesson. With the conventions, I wrote one PromQL alert that pages on gen_ai.usage.input_tokens > 30000 for a single span. That alert has fired three times since and saved us roughly $11,000 in runaway-loop costs, per our finance reconciliation in March 2026.

Story 2: The Cached Tokens Nobody Was Counting

Our finance team asked why our OpenAI bill had jumped 18 percent month-over-month while our request count was flat. Pre-OTel, this would have been a multi-day analytics project. Post-OTel, the answer was a single span query: gen_ai.usage.input_tokens was up 22 percent month-over-month, and gen_ai.usage.cache_read_input_tokens was zero. We had rolled out a prompt-caching change in the agent that broke cache hits because the system prompt now included a timestamp. The conventions had captured the cache-read attribute for every span, including the ones with zero cache reads, and the regression was visible in the first chart we built.

The lesson: the convention's optionality is treacherous. gen_ai.usage.cache_read_input_tokens is "yes when prompt caching" in the spec, which means it is absent when there is no caching activity. We changed our wrapper to always emit the attribute as zero when caching is in use but no hit occurred. That distinction, present-as-zero versus absent, gave us the alerting signal.

Story 3: The Tool Call That Looked Fine Until It Repeated

A customer-support agent was issuing duplicate Stripe charges intermittently. The cleanup post-mortem showed that under load, the planning LLM occasionally generated the same tool_call_id twice in two different chat completions, and the orchestrator did not de-duplicate. Pre-OTel, the duplicate was invisible at the application layer. Post-OTel, the conventions made it explicit: two tool spans with the same gen_ai.tool.call.id under the same conversation root meant a duplicate. We added a span processor that fired on duplicate tool-call IDs within a conversation window, and the bug surfaced within four hours.

Here is the decision flow for that span-processor logic.

flowchart TD A[New tool span emitted] --> B{tool.call.id already seen
in this conversation?} B -- no --> C[Record, continue] B -- yes --> D{Same parent agent.run span?} D -- yes --> E[Legitimate retry within run
tag span as retry] D -- no --> F[Cross-run duplicate
page on-call] F --> G[Auto-disable downstream
side-effecting tool]

The processor itself is fewer than fifty lines because the conventions did the structural work.

OTel vs Vendor-Specific Instrumentation: When To Use What

A reasonable question is when to bother with OTel at all versus using a vendor SDK like Langfuse's native trace API or Anthropic's claude-trace package. Here is the practical decomposition.

Situation OTel GenAI conventions Vendor SDK
Single backend, single LLM provider, < 12 weeks horizon Overkill Faster to ship
Multi-backend or planning to evaluate multiple backends Right tool Lock-in
Multi-provider (OpenAI + Anthropic + self-hosted) Right tool Multiple SDKs to bridge
EU AI Act Article 14 compliance horizon Right tool: schema is auditable Vendor-specific schema is harder to audit
You need custom business attributes (conversation id, user id, tenant) Right tool: attributes are first-class Possible but vendor-specific
Long-term retention beyond vendor's default Right tool: collector controls storage Vendor controls retention
Sub-100-call-per-day prototype Overkill Vendor SDK

The pattern I recommend is to start with OTel from day one if any of the following are true: you expect to run more than one LLM provider, you expect to evaluate more than one observability backend, you have any compliance horizon, or you have any business-attribute needs (conversation id, tenant id, user id). If none of those are true, the vendor SDK is fine and you can migrate later by wrapping their SDK output in OTel spans.

Comparison visual showing two stacks side by side: vendor-specific SDK with locked-in JSON shape on the left, OTel collector with three backends on the right, the right side labelled

Production Considerations And Gotchas

Three operational gotchas have caused real outages in teams I have advised. Each is the kind of thing the spec mentions in passing but that bites you only at scale.

The first is span-batch backpressure. The default BatchSpanProcessor buffers spans in memory and exports in batches every 5 seconds or 512 spans. At 4.2 million spans per week, which is what one of our blog 166 reference deployments runs at, the default queue size of 2,048 fills up under bursty load and the SDK starts dropping spans silently. Set OTEL_BSP_MAX_QUEUE_SIZE=8192 and OTEL_BSP_MAX_EXPORT_BATCH_SIZE=2048 and watch the otelcol_exporter_send_failed_spans metric on the collector. If it is non-zero, you are losing observability data.

The second is sampling. Tracing every LLM call costs storage. The conventions do not mandate a sampling rate; they assume you make that decision. The pattern I now use is "always sample failures and tool-call spans, head-sample 10 percent of clean inference spans, retain failures and tool-call spans for 90 days, head-sampled spans for 30 days". This is implementable as an OTel collector tail-sampling processor and keeps storage at roughly 18 percent of full-fidelity cost while preserving every audit-relevant span. EU AI Act compliance teams have signed off on this pattern for high-risk systems we have shipped.

The third is PII. The conventions do not say "store the prompt." Many teams add gen_ai.prompt as an attribute and then realise three months later that they are storing customer PII in their observability backend. The right pattern is to store a hash of the prompt as gen_ai.prompt.hash, store the actual prompt in a PII-aware store with a pointer attribute gen_ai.prompt.ref, and only resolve the pointer when an authorised investigator looks at the trace. This satisfies both Article 14 traceability and GDPR data-minimisation.

Here is the rollout timeline I now recommend for a team migrating from homegrown logging to OTel GenAI conventions.

gantt title OTel GenAI rollout (12 weeks) dateFormat YYYY-MM-DD section Week 1-2 SDK install + smoke test :a1, 2026-04-29, 14d section Week 3-4 Wrap chat spans :a2, after a1, 14d Wrap tool spans :a3, after a1, 14d section Week 5-6 Conversation-id business attribute :a4, after a2, 14d PII strategy + hash plan :a5, after a3, 14d section Week 7-8 Tail-sampling rollout :a6, after a4, 14d section Week 9-10 Backend evaluation (Langfuse/Phoenix/Splunk) :a7, after a6, 14d section Week 11-12 Decommission homegrown logger :a8, after a7, 14d EU AI Act audit dry-run :a9, after a7, 14d

The overall envelope is twelve weeks of focused work for an agent platform handling millions of spans per week. Smaller deployments compress this to four to six weeks. The longest-tail item is always the PII story, because it requires legal review.

Closing The Loop: What To Build Next Week

If you take one thing from this post, take this: instrument every LLM and tool call with OTel GenAI conventions, and add one business attribute, gen_ai.conversation.id, that lets you join spans into runs. Everything else can be added incrementally. Cost attribution, EU AI Act traceability, multi-backend portability, and the kind of debugging speed that turns a Saturday-morning incident from a four-hour CloudWatch dig into a four-minute span query, all flow from that minimum.

The full reference agent code, including the duplicate-tool-call processor and the tail-sampling collector configuration, is at github.com/amtocbot-droid/amtocbot-examples/tree/main/otel-genai-agent. Clone it, swap in your LLM provider, point the OTLP exporter at any compatible backend, and you will have a production-shaped trace topology in roughly an hour.

The agents that get cheaper, more compliant, and more debuggable in 2026 are not the ones with the most clever wrapper layer. They are the ones whose spans look the same as everyone else's spans, because the conventions did the work. Start there.

Sources

  1. OpenTelemetry GenAI Semantic Conventions: opentelemetry.io/docs/specs/semconv/gen-ai/
  2. OpenTelemetry Python Instrumentation for OpenAI v2: github.com/open-telemetry/opentelemetry-python-contrib/tree/main/instrumentation-genai/opentelemetry-instrumentation-openai-v2
  3. EU AI Act Article 14, Human Oversight (Regulation (EU) 2024/1689): eur-lex.europa.eu/eli/reg/2024/1689/oj
  4. Anthropic Engineering Blog, Production Agent Observability Patterns (January 2026): anthropic.com/engineering
  5. Langfuse OpenTelemetry Integration Guide: langfuse.com/docs/opentelemetry/get-started
  6. Arize Phoenix OTel Tracing Reference: docs.arize.com/phoenix/tracing/llm-traces
  7. AmtocSoft companion blog 166, AI Observability Stack 2026: amtocsoft.blogspot.com

About the Author

Toc Am

Founder of AmtocSoft. Writing practical deep-dives on AI engineering, cloud architecture, and developer tooling. Previously built backend systems at scale. Reviews every post published under this byline.

LinkedIn X / Twitter

Published: 2026-04-29 · Written with AI assistance, reviewed by Toc Am.

Buy Me a Coffee · 🔔 YouTube · 💼 LinkedIn · 🐦 X/Twitter

Context Packets for Production Agents: Keep the Model Small, Auditable, and Fast

Context Packets for Production Agents: Keep the Model Small, Auditable, and Fast Introduction: The Night the Prompt Became the Incide...