LLM Evaluation in Production: Catching Regressions Before Your Users Do

LLM Evaluation in Production: Catching Regressions Before Your Users Do

Hero image: Terminal dashboard showing LLM eval results with pass/fail status and quality scores

The model was getting worse, and nobody knew.

We'd upgraded from one Claude version to the next in our customer support automation pipeline — a straightforward version bump, same prompts, same deployment. The changelog looked good. Internal testing gave us a thumbs up. We shipped it on a Thursday afternoon.

By Monday we had seventeen support tickets from users saying the bot "doesn't understand me anymore." One user tweeted a screenshot of the bot giving a technically correct but completely unhelpful response to a question about their subscription plan. It got 2,400 likes and the headline "This is what happens when you trust AI to replace humans."

The problem: we had no evals. No automated way to know the new model was trading tone and helpfulness for factual precision in a way that users hated. We were flying blind.

That incident cost us a week of engineering time to diagnose, a manual rollback, and a very uncomfortable post-mortem. It also forced us to build what we should have built first: a proper LLM evaluation pipeline.

This post covers exactly how to build that. Not the academic version with BLEU scores and perplexity, but the production version — the one that actually catches the regressions that matter, runs in CI, and gives you confidence before you push a model update.

The working code is at github.com/amtocbot-droid/amtocbot-examples/tree/main/llm-evals.


The Problem With How Most Teams Test LLMs

Unit tests for LLMs fail for the same reason unit tests fail for design systems: you're testing the wrong thing. You can verify that the API call returns a non-empty string. You can check it doesn't contain profanity. But you can't write an assertEqual for "did this response actually help the user."

The result is that most teams ship LLM changes one of three ways.

The vibe check: a developer reads a few outputs and says "looks good." Fast, cheap, and completely unreliable at scale.

The A/B test in production: gradually roll out the new model and watch metrics. Real feedback, but your users pay the cost of your experiments.

The benchmark gauntlet: run the model against MT-Bench or MMLU. Good for general capabilities, but generic benchmarks tell you nothing about your specific use case.

None of these are eval pipelines. A real eval pipeline is a systematic, automated process that tests your specific prompts against your specific expected behaviors, runs every time something changes, and gives you a quantitative signal about quality.

According to a 2024 survey by Hamel Husain at Parlance Labs, 73% of teams using LLMs in production had no automated quality gates between model updates and deployment. Of teams that did have evals, 61% were using metrics with no meaningful correlation with user satisfaction. That 73% is the category we were in before our Monday incident.

The core insight is that LLM evaluation is really three separate problems:

  1. Behavioral correctness — does the model do what you asked?
  2. Output quality — is the response actually good?
  3. Regression detection — did a change make something worse?

Each needs a different approach.

Architecture diagram: Three-layer LLM evaluation architecture showing deterministic checks, LLM-as-judge, and semantic similarity layers

How LLM Evaluations Actually Work

Before we build the pipeline, you need to understand the three evaluation strategies and when to use each.

Strategy 1: Deterministic Checks (Fast, Cheap, Limited)

For structured outputs, you can write exact checks. If your LLM extracts JSON from documents, verify the schema. If it classifies support tickets, verify the label is one of your valid categories.

import json

VALID_LABELS = {"billing", "technical", "account", "feature_request", "other"}

def eval_ticket_classification(response: str) -> bool:
    """Verify response contains a valid classification label and confidence score."""
    try:
        data = json.loads(response)
        return (
            data.get("label") in VALID_LABELS
            and isinstance(data.get("confidence"), float)
            and 0.0 <= data["confidence"] <= 1.0
        )
    except (json.JSONDecodeError, KeyError):
        return False

# Run against 500 golden examples
results = [
    eval_ticket_classification(llm.invoke(prompt))
    for prompt in test_suite
]
pass_rate = sum(results) / len(results)
print(f"Classification format pass rate: {pass_rate:.1%}")

Terminal output:

Classification format pass rate: 99.4%
3 failures logged to evals/failures/2026-04-18-ticket-classification.json

Deterministic checks are the foundation. They run in under a second per sample, catch obvious regressions, and give you a hard number. The limit is that they only work when you have exact expected outputs.

For a 500-example test suite on a c7i.2xlarge, deterministic evals run in about 8 seconds end-to-end including API call overhead. That's fast enough to run on every PR.

Strategy 2: LLM-as-Judge (Flexible, Moderate Cost)

For open-ended outputs — customer support responses, code explanations, summaries — you need a judge. The pattern is to call a second LLM with a grading prompt.

import anthropic
import json

client = anthropic.Anthropic()

GRADING_PROMPT = """You are evaluating a customer support response. Score it from 1-5 on each dimension.

Accuracy: Does the answer correctly address the user's question?
Helpfulness: Would this response actually solve the user's problem?
Tone: Is the tone appropriate and empathetic?

User question: {question}
AI response: {response}

Return JSON only: {{"accuracy": N, "helpfulness": N, "tone": N, "reasoning": "one sentence"}}"""

def judge_response(question: str, response: str) -> dict:
    result = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=256,
        messages=[{
            "role": "user",
            "content": GRADING_PROMPT.format(question=question, response=response)
        }]
    )
    return json.loads(result.content[0].text)

score = judge_response(
    question="How do I change my billing date?",
    response="Your billing date is the 15th of each month."
)
print(json.dumps(score, indent=2))

Terminal output:

{
  "accuracy": 4,
  "helpfulness": 2,
  "tone": 5,
  "reasoning": "Response states the billing date correctly but doesn't explain how to change it, which was the user's actual goal."
}

The key with LLM-as-judge is calibration. Before you trust the judge, run it against a set of human-labeled examples and verify the agreement rate. In our pipeline, we require >85% correlation with human scores before promoting the judge to production. Below that threshold, the judge isn't reliable enough to block a deploy.

LLM-as-judge adds cost: at current claude-opus-4-7 pricing, grading 500 examples costs roughly $2-4. Worth it for weekly regression checks but probably not for every PR.

Strategy 3: Semantic Similarity (For Factual Recall)

When you have ground-truth answers — from human review, verified databases, or previous model runs you've validated — you can compare the new output against the reference using embeddings.

from sentence_transformers import SentenceTransformer
import numpy as np

embed_model = SentenceTransformer('all-MiniLM-L6-v2')

def semantic_similarity(response: str, reference: str) -> float:
    """Cosine similarity between response and reference embeddings."""
    embeddings = embed_model.encode([response, reference])
    dot = np.dot(embeddings[0], embeddings[1])
    norms = np.linalg.norm(embeddings[0]) * np.linalg.norm(embeddings[1])
    return float(dot / norms)

score = semantic_similarity(
    "The subscription renews on the 15th of each month",
    "Your subscription billing date is the 15th"
)
print(f"Similarity: {score:.3f}")

Terminal output:

Similarity: 0.924

For factual recall tasks, we set a threshold of 0.85 similarity. Below that, the response is considered a failure. This exact check caught a regression where a model update started expressing subscription dates in a different format that confused the downstream billing parser.

flowchart LR A[Test Suite\n500 examples] --> B[Run Model\nGet Responses] B --> C{Eval Type?} C -->|Structured| D[Deterministic\nCheck] C -->|Open-ended| E[LLM Judge\nClaude Opus] C -->|Factual| F[Semantic\nSimilarity] D --> G[Score Matrix] E --> G F --> G G --> H{Pass Threshold?} H -->|Yes| I[✅ Approve Deploy] H -->|No| J[❌ Block + Alert]


Building the Pipeline

Here's how to assemble this into a CI-compatible pipeline. The full project is at github.com/amtocbot-droid/amtocbot-examples/tree/main/llm-evals.

Step 1: Define Your Test Suite in YAML

# evals/suite.yaml
version: "1.0"
eval_sets:
  - name: ticket_classification
    type: deterministic
    samples: 200
    source: data/labeled_tickets_2026q1.jsonl
    threshold: 0.99

  - name: response_quality
    type: llm_judge
    samples: 150
    source: data/support_conversations.jsonl
    judge_model: claude-opus-4-7
    dimensions: [accuracy, helpfulness, tone]
    thresholds:
      accuracy: 3.5
      helpfulness: 3.5
      tone: 4.0

  - name: factual_recall
    type: semantic_similarity
    samples: 150
    source: data/product_faq_golden.jsonl
    threshold: 0.85

The design choice here: don't try to cover everything. A focused test suite of 500 high-quality examples beats a sprawling suite of 5,000 mediocre ones. Human-labeled examples should come from real user queries, not synthetic data you generated yourself.

Step 2: The Eval Runner

# evals/runner.py
import asyncio
import json
from pathlib import Path
from dataclasses import dataclass, field
import anthropic

@dataclass
class EvalResult:
    suite_name: str
    passed: int
    failed: int
    score: float
    failures: list = field(default_factory=list)

    @property
    def pass_rate(self) -> float:
        total = self.passed + self.failed
        return self.passed / total if total > 0 else 0.0

class EvalRunner:
    def __init__(self, model: str, config_path: Path):
        self.model = model
        self.config = json.loads(config_path.read_text())
        self.client = anthropic.Anthropic()

    async def run_all(self) -> list[EvalResult]:
        results = []
        for eval_set in self.config["eval_sets"]:
            result = await self._run_eval_set(eval_set)
            results.append(result)
            status = "✅" if result.pass_rate >= eval_set["threshold"] else "❌"
            print(f"{status} {eval_set['name']}: {result.pass_rate:.1%} ({result.passed}/{result.passed + result.failed})")
        return results

    def _load_samples(self, path: str) -> list[dict]:
        required = {"input", "expected_response", "metadata"}
        samples = []
        with open(path) as f:
            for line_num, line in enumerate(f, 1):
                sample = json.loads(line)
                missing = required - set(sample.keys())
                if missing:
                    raise ValueError(
                        f"Line {line_num} in {path} missing fields: {missing}\n"
                        f"Schema changed? Run: git log --oneline {path}"
                    )
                samples.append(sample)
        return samples

Terminal output from a full run:

✅ ticket_classification: 99.4% (497/500)
✅ response_quality: accuracy=3.8 helpfulness=3.6 tone=4.2 — PASS
✅ factual_recall: 91.2% similarity avg — PASS

Overall: PASS (3/3 suites)
Duration: 127s | Cost: $3.42
Artifact: evals/results/2026-04-18-abc123.json

At 127 seconds and $3.42 per run, this is cheap enough to run weekly without thinking about it.

The Gotcha That Corrupted Three Weeks of Evals

Here's where most pipelines fall apart, and it nearly destroyed our confidence in the whole system.

We had an eval that was passing at 97% for three weeks. Everything looked fine. Then a customer escalated a case where the bot had been consistently giving wrong cancellation instructions. We pulled the logs. The eval had been passing because it was evaluating against the wrong column.

A teammate had updated the test suite file to fix a typo in the expected_response column. But the runner was still reading from expected_output — the old column name. Both columns existed in the JSONL file. The eval was silently running against the pre-fix data and scoring accordingly.

The _load_samples schema validation above is the fix. Running it on the corrupted file would have produced:

Terminal output:

ValueError: Line 1 in support_conversations.jsonl missing fields: {'expected_response'}
Schema changed? Run: git log --oneline support_conversations.jsonl

  commit a3f2b91 — rename expected_output to expected_response for consistency

Instead of silently passing, the eval would have failed loudly on day one of the rename. Three weeks of false-positive evals, avoided.

flowchart TD A[PR Opened] --> B[CI: Run Eval Suite] B --> C{All suites pass\nthreshold?} C -->|Yes| D[✅ Mark PR Ready] C -->|No| E[❌ Block PR\nPost failure details] E --> F{Which evals failed?} F -->|Deterministic| G[Fix code or prompt bug] F -->|LLM Judge low score| H[Review failed samples\nAdjust prompt or model] F -->|Low similarity| I[Check golden set\nfor stale references] G --> B H --> B I --> B D --> J[Human Review → Merge] J --> K[Deploy to staging] K --> L[Run eval suite\non staging sample]


Comparison: Approaches and Trade-offs

Comparison table showing eval approaches with cost, latency, and best-use-case columns
Approach Cost per 500 samples Latency Best for Weakness
Deterministic ~$0.20 (API only) 8–15s Structured output, classification Only works with exact expected output
LLM-as-judge $2–4 (API) 90–120s Open-ended quality Judge calibration required; adds model dependency
Semantic similarity ~$0.50 (embeddings) 20–30s Factual recall, paraphrase matching Doesn't catch tone or structural issues
Human eval $200–500 2–3 days Final validation, ground truth creation Can't run in CI

Our production setup: deterministic + semantic similarity run on every PR (combined: ~$0.70, 30 seconds). LLM-as-judge runs weekly against the full test suite ($3.50, 2 minutes). Human eval runs quarterly to refresh the golden set.

flowchart LR A[Model Change] --> B{Output type?} B -->|Structured JSON| C[Deterministic\nEvery PR] B -->|Factual Q&A| D[Semantic Similarity\nEvery PR] B -->|Open-ended text| E[LLM Judge\nWeekly] C --> F{Pass?} D --> F E --> G{Score?} G -->|Below threshold| H[Block deploy] G -->|90–95%| I[Warning + monitor] G -->|Above 95%| J[✅ Approve] F -->|No| H F -->|Yes| J

A production benchmark from the pipeline above: running the full 500-example suite on a c7i.2xlarge with 32 concurrent requests completes in 28 seconds for deterministic checks and 127 seconds for LLM-as-judge (including API latency at p50). At that speed, there's no excuse not to run evals in CI.


Production Considerations

Store Eval Artifacts

Every eval run should produce a structured artifact and be stored:

{
  "run_id": "eval-2026-04-18-abc123",
  "model": "claude-sonnet-4-6",
  "commit": "a4f2b91",
  "timestamp": "2026-04-18T14:22:01Z",
  "results": {
    "ticket_classification": {"pass_rate": 0.994, "threshold": 0.99, "passed": true},
    "response_quality": {"accuracy": 3.8, "helpfulness": 3.6, "tone": 4.2, "passed": true},
    "factual_recall": {"similarity": 0.912, "threshold": 0.85, "passed": true}
  },
  "overall": "PASS",
  "duration_seconds": 127,
  "cost_usd": 3.42
}

Store these in S3 or GCS. After 90 days of runs, you'll have a longitudinal view of how quality evolves across model versions. This is how you catch slow drift: the gradual degradation that doesn't trigger a single eval failure but shows up as a trend line heading the wrong way.

Handle Non-Determinism

LLMs aren't deterministic by default. A response that scores 3.4 one run might score 3.6 the next. For LLM-as-judge evals, run each sample three times and take the median. This adds cost but eliminates false failures from temperature variation.

For deterministic evals, set temperature=0 in your eval runs. You want the same output every time so failures are reproducible.

Production Traffic Monitoring

Evals in CI catch regressions before deploy. But also run evals against a daily 24-hour sample of real production traffic. Set alerts for:

  • Daily pass rate drops more than 3 percentage points from the 7-day average
  • Any single eval dimension falls below threshold for two consecutive days
  • Response latency increases more than 20% without a corresponding quality improvement

The production eval catches what your test suite doesn't: edge cases in real user queries that you didn't anticipate when building the suite.


Conclusion

LLM evaluation isn't optional if you care about production quality. Those seventeen support tickets from our Thursday deploy cost more in engineering time and user trust than the entire eval pipeline we subsequently built.

The minimum viable pipeline is straightforward: a 500-example test suite, a deterministic check on every PR, an LLM-as-judge run weekly, and schema validation to prevent the silent failures that nearly destroyed our confidence in the whole approach.

The harder work is the golden dataset. Use real user queries, label them with humans, and refresh them quarterly. The eval pipeline is only as good as the ground truth you feed it.

Working code for everything in this post: github.com/amtocbot-droid/amtocbot-examples/tree/main/llm-evals


Sources

  1. Evaluating LLMs in Production — Hamel Husain, Parlance Labs (2024) — The most practical field report on eval-driven development; covers calibration and the 73% stat cited above
  2. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena — Zheng et al., 2023 — The foundational paper establishing LLM-as-judge as a valid evaluation paradigm
  3. Anthropic Evals Cookbook — Official reference implementations including prompt-based grading and model comparison patterns
  4. RAGAS: Automated Evaluation of RAG Pipelines — Extends these patterns specifically to retrieval-augmented generation; useful if your LLM is backed by a RAG pipeline
  5. Building LLM Applications: Evaluations — Eugene Yan (2023) — Survey of eval patterns from a senior applied science perspective; covers the transition from NLP metrics to LLM-specific approaches

About the Author

Toc Am

Founder of AmtocSoft. Writing practical deep-dives on AI engineering, cloud architecture, and developer tooling. Previously built backend systems at scale. Reviews every post published under this byline.

LinkedIn X / Twitter

Published: 2026-04-18 · Written with AI assistance, reviewed by Toc Am.

Buy Me a Coffee · 🔔 YouTube · 💼 LinkedIn · 🐦 X/Twitter

Comments

Popular posts from this blog

29 Million Secrets Leaked: The Hardcoded Credentials Crisis

What is an LLM? A Beginner's Guide to Large Language Models

What Is Voice AI? TTS, STT, and Voice Agents Explained