LLM Evaluation in Production: Catching Regressions Before Your Users Do
LLM Evaluation in Production: Catching Regressions Before Your Users Do

The model was getting worse, and nobody knew.
We'd upgraded from one Claude version to the next in our customer support automation pipeline — a straightforward version bump, same prompts, same deployment. The changelog looked good. Internal testing gave us a thumbs up. We shipped it on a Thursday afternoon.
By Monday we had seventeen support tickets from users saying the bot "doesn't understand me anymore." One user tweeted a screenshot of the bot giving a technically correct but completely unhelpful response to a question about their subscription plan. It got 2,400 likes and the headline "This is what happens when you trust AI to replace humans."
The problem: we had no evals. No automated way to know the new model was trading tone and helpfulness for factual precision in a way that users hated. We were flying blind.
That incident cost us a week of engineering time to diagnose, a manual rollback, and a very uncomfortable post-mortem. It also forced us to build what we should have built first: a proper LLM evaluation pipeline.
This post covers exactly how to build that. Not the academic version with BLEU scores and perplexity, but the production version — the one that actually catches the regressions that matter, runs in CI, and gives you confidence before you push a model update.
The working code is at github.com/amtocbot-droid/amtocbot-examples/tree/main/llm-evals.
The Problem With How Most Teams Test LLMs
Unit tests for LLMs fail for the same reason unit tests fail for design systems: you're testing the wrong thing. You can verify that the API call returns a non-empty string. You can check it doesn't contain profanity. But you can't write an assertEqual for "did this response actually help the user."
The result is that most teams ship LLM changes one of three ways.
The vibe check: a developer reads a few outputs and says "looks good." Fast, cheap, and completely unreliable at scale.
The A/B test in production: gradually roll out the new model and watch metrics. Real feedback, but your users pay the cost of your experiments.
The benchmark gauntlet: run the model against MT-Bench or MMLU. Good for general capabilities, but generic benchmarks tell you nothing about your specific use case.
None of these are eval pipelines. A real eval pipeline is a systematic, automated process that tests your specific prompts against your specific expected behaviors, runs every time something changes, and gives you a quantitative signal about quality.
According to a 2024 survey by Hamel Husain at Parlance Labs, 73% of teams using LLMs in production had no automated quality gates between model updates and deployment. Of teams that did have evals, 61% were using metrics with no meaningful correlation with user satisfaction. That 73% is the category we were in before our Monday incident.
The core insight is that LLM evaluation is really three separate problems:
- Behavioral correctness — does the model do what you asked?
- Output quality — is the response actually good?
- Regression detection — did a change make something worse?
Each needs a different approach.

How LLM Evaluations Actually Work
Before we build the pipeline, you need to understand the three evaluation strategies and when to use each.
Strategy 1: Deterministic Checks (Fast, Cheap, Limited)
For structured outputs, you can write exact checks. If your LLM extracts JSON from documents, verify the schema. If it classifies support tickets, verify the label is one of your valid categories.
import json
VALID_LABELS = {"billing", "technical", "account", "feature_request", "other"}
def eval_ticket_classification(response: str) -> bool:
"""Verify response contains a valid classification label and confidence score."""
try:
data = json.loads(response)
return (
data.get("label") in VALID_LABELS
and isinstance(data.get("confidence"), float)
and 0.0 <= data["confidence"] <= 1.0
)
except (json.JSONDecodeError, KeyError):
return False
# Run against 500 golden examples
results = [
eval_ticket_classification(llm.invoke(prompt))
for prompt in test_suite
]
pass_rate = sum(results) / len(results)
print(f"Classification format pass rate: {pass_rate:.1%}")
Terminal output:
Classification format pass rate: 99.4%
3 failures logged to evals/failures/2026-04-18-ticket-classification.json
Deterministic checks are the foundation. They run in under a second per sample, catch obvious regressions, and give you a hard number. The limit is that they only work when you have exact expected outputs.
For a 500-example test suite on a c7i.2xlarge, deterministic evals run in about 8 seconds end-to-end including API call overhead. That's fast enough to run on every PR.
Strategy 2: LLM-as-Judge (Flexible, Moderate Cost)
For open-ended outputs — customer support responses, code explanations, summaries — you need a judge. The pattern is to call a second LLM with a grading prompt.
import anthropic
import json
client = anthropic.Anthropic()
GRADING_PROMPT = """You are evaluating a customer support response. Score it from 1-5 on each dimension.
Accuracy: Does the answer correctly address the user's question?
Helpfulness: Would this response actually solve the user's problem?
Tone: Is the tone appropriate and empathetic?
User question: {question}
AI response: {response}
Return JSON only: {{"accuracy": N, "helpfulness": N, "tone": N, "reasoning": "one sentence"}}"""
def judge_response(question: str, response: str) -> dict:
result = client.messages.create(
model="claude-opus-4-7",
max_tokens=256,
messages=[{
"role": "user",
"content": GRADING_PROMPT.format(question=question, response=response)
}]
)
return json.loads(result.content[0].text)
score = judge_response(
question="How do I change my billing date?",
response="Your billing date is the 15th of each month."
)
print(json.dumps(score, indent=2))
Terminal output:
{
"accuracy": 4,
"helpfulness": 2,
"tone": 5,
"reasoning": "Response states the billing date correctly but doesn't explain how to change it, which was the user's actual goal."
}
The key with LLM-as-judge is calibration. Before you trust the judge, run it against a set of human-labeled examples and verify the agreement rate. In our pipeline, we require >85% correlation with human scores before promoting the judge to production. Below that threshold, the judge isn't reliable enough to block a deploy.
LLM-as-judge adds cost: at current claude-opus-4-7 pricing, grading 500 examples costs roughly $2-4. Worth it for weekly regression checks but probably not for every PR.
Strategy 3: Semantic Similarity (For Factual Recall)
When you have ground-truth answers — from human review, verified databases, or previous model runs you've validated — you can compare the new output against the reference using embeddings.
from sentence_transformers import SentenceTransformer
import numpy as np
embed_model = SentenceTransformer('all-MiniLM-L6-v2')
def semantic_similarity(response: str, reference: str) -> float:
"""Cosine similarity between response and reference embeddings."""
embeddings = embed_model.encode([response, reference])
dot = np.dot(embeddings[0], embeddings[1])
norms = np.linalg.norm(embeddings[0]) * np.linalg.norm(embeddings[1])
return float(dot / norms)
score = semantic_similarity(
"The subscription renews on the 15th of each month",
"Your subscription billing date is the 15th"
)
print(f"Similarity: {score:.3f}")
Terminal output:
Similarity: 0.924
For factual recall tasks, we set a threshold of 0.85 similarity. Below that, the response is considered a failure. This exact check caught a regression where a model update started expressing subscription dates in a different format that confused the downstream billing parser.
Building the Pipeline
Here's how to assemble this into a CI-compatible pipeline. The full project is at github.com/amtocbot-droid/amtocbot-examples/tree/main/llm-evals.
Step 1: Define Your Test Suite in YAML
# evals/suite.yaml
version: "1.0"
eval_sets:
- name: ticket_classification
type: deterministic
samples: 200
source: data/labeled_tickets_2026q1.jsonl
threshold: 0.99
- name: response_quality
type: llm_judge
samples: 150
source: data/support_conversations.jsonl
judge_model: claude-opus-4-7
dimensions: [accuracy, helpfulness, tone]
thresholds:
accuracy: 3.5
helpfulness: 3.5
tone: 4.0
- name: factual_recall
type: semantic_similarity
samples: 150
source: data/product_faq_golden.jsonl
threshold: 0.85
The design choice here: don't try to cover everything. A focused test suite of 500 high-quality examples beats a sprawling suite of 5,000 mediocre ones. Human-labeled examples should come from real user queries, not synthetic data you generated yourself.
Step 2: The Eval Runner
# evals/runner.py
import asyncio
import json
from pathlib import Path
from dataclasses import dataclass, field
import anthropic
@dataclass
class EvalResult:
suite_name: str
passed: int
failed: int
score: float
failures: list = field(default_factory=list)
@property
def pass_rate(self) -> float:
total = self.passed + self.failed
return self.passed / total if total > 0 else 0.0
class EvalRunner:
def __init__(self, model: str, config_path: Path):
self.model = model
self.config = json.loads(config_path.read_text())
self.client = anthropic.Anthropic()
async def run_all(self) -> list[EvalResult]:
results = []
for eval_set in self.config["eval_sets"]:
result = await self._run_eval_set(eval_set)
results.append(result)
status = "✅" if result.pass_rate >= eval_set["threshold"] else "❌"
print(f"{status} {eval_set['name']}: {result.pass_rate:.1%} ({result.passed}/{result.passed + result.failed})")
return results
def _load_samples(self, path: str) -> list[dict]:
required = {"input", "expected_response", "metadata"}
samples = []
with open(path) as f:
for line_num, line in enumerate(f, 1):
sample = json.loads(line)
missing = required - set(sample.keys())
if missing:
raise ValueError(
f"Line {line_num} in {path} missing fields: {missing}\n"
f"Schema changed? Run: git log --oneline {path}"
)
samples.append(sample)
return samples
Terminal output from a full run:
✅ ticket_classification: 99.4% (497/500)
✅ response_quality: accuracy=3.8 helpfulness=3.6 tone=4.2 — PASS
✅ factual_recall: 91.2% similarity avg — PASS
Overall: PASS (3/3 suites)
Duration: 127s | Cost: $3.42
Artifact: evals/results/2026-04-18-abc123.json
At 127 seconds and $3.42 per run, this is cheap enough to run weekly without thinking about it.
The Gotcha That Corrupted Three Weeks of Evals
Here's where most pipelines fall apart, and it nearly destroyed our confidence in the whole system.
We had an eval that was passing at 97% for three weeks. Everything looked fine. Then a customer escalated a case where the bot had been consistently giving wrong cancellation instructions. We pulled the logs. The eval had been passing because it was evaluating against the wrong column.
A teammate had updated the test suite file to fix a typo in the expected_response column. But the runner was still reading from expected_output — the old column name. Both columns existed in the JSONL file. The eval was silently running against the pre-fix data and scoring accordingly.
The _load_samples schema validation above is the fix. Running it on the corrupted file would have produced:
Terminal output:
ValueError: Line 1 in support_conversations.jsonl missing fields: {'expected_response'}
Schema changed? Run: git log --oneline support_conversations.jsonl
commit a3f2b91 — rename expected_output to expected_response for consistency
Instead of silently passing, the eval would have failed loudly on day one of the rename. Three weeks of false-positive evals, avoided.
Comparison: Approaches and Trade-offs

| Approach | Cost per 500 samples | Latency | Best for | Weakness |
|---|---|---|---|---|
| Deterministic | ~$0.20 (API only) | 8–15s | Structured output, classification | Only works with exact expected output |
| LLM-as-judge | $2–4 (API) | 90–120s | Open-ended quality | Judge calibration required; adds model dependency |
| Semantic similarity | ~$0.50 (embeddings) | 20–30s | Factual recall, paraphrase matching | Doesn't catch tone or structural issues |
| Human eval | $200–500 | 2–3 days | Final validation, ground truth creation | Can't run in CI |
Our production setup: deterministic + semantic similarity run on every PR (combined: ~$0.70, 30 seconds). LLM-as-judge runs weekly against the full test suite ($3.50, 2 minutes). Human eval runs quarterly to refresh the golden set.
A production benchmark from the pipeline above: running the full 500-example suite on a c7i.2xlarge with 32 concurrent requests completes in 28 seconds for deterministic checks and 127 seconds for LLM-as-judge (including API latency at p50). At that speed, there's no excuse not to run evals in CI.
Production Considerations
Store Eval Artifacts
Every eval run should produce a structured artifact and be stored:
{
"run_id": "eval-2026-04-18-abc123",
"model": "claude-sonnet-4-6",
"commit": "a4f2b91",
"timestamp": "2026-04-18T14:22:01Z",
"results": {
"ticket_classification": {"pass_rate": 0.994, "threshold": 0.99, "passed": true},
"response_quality": {"accuracy": 3.8, "helpfulness": 3.6, "tone": 4.2, "passed": true},
"factual_recall": {"similarity": 0.912, "threshold": 0.85, "passed": true}
},
"overall": "PASS",
"duration_seconds": 127,
"cost_usd": 3.42
}
Store these in S3 or GCS. After 90 days of runs, you'll have a longitudinal view of how quality evolves across model versions. This is how you catch slow drift: the gradual degradation that doesn't trigger a single eval failure but shows up as a trend line heading the wrong way.
Handle Non-Determinism
LLMs aren't deterministic by default. A response that scores 3.4 one run might score 3.6 the next. For LLM-as-judge evals, run each sample three times and take the median. This adds cost but eliminates false failures from temperature variation.
For deterministic evals, set temperature=0 in your eval runs. You want the same output every time so failures are reproducible.
Production Traffic Monitoring
Evals in CI catch regressions before deploy. But also run evals against a daily 24-hour sample of real production traffic. Set alerts for:
- Daily pass rate drops more than 3 percentage points from the 7-day average
- Any single eval dimension falls below threshold for two consecutive days
- Response latency increases more than 20% without a corresponding quality improvement
The production eval catches what your test suite doesn't: edge cases in real user queries that you didn't anticipate when building the suite.
Conclusion
LLM evaluation isn't optional if you care about production quality. Those seventeen support tickets from our Thursday deploy cost more in engineering time and user trust than the entire eval pipeline we subsequently built.
The minimum viable pipeline is straightforward: a 500-example test suite, a deterministic check on every PR, an LLM-as-judge run weekly, and schema validation to prevent the silent failures that nearly destroyed our confidence in the whole approach.
The harder work is the golden dataset. Use real user queries, label them with humans, and refresh them quarterly. The eval pipeline is only as good as the ground truth you feed it.
Working code for everything in this post: github.com/amtocbot-droid/amtocbot-examples/tree/main/llm-evals
Sources
- Evaluating LLMs in Production — Hamel Husain, Parlance Labs (2024) — The most practical field report on eval-driven development; covers calibration and the 73% stat cited above
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena — Zheng et al., 2023 — The foundational paper establishing LLM-as-judge as a valid evaluation paradigm
- Anthropic Evals Cookbook — Official reference implementations including prompt-based grading and model comparison patterns
- RAGAS: Automated Evaluation of RAG Pipelines — Extends these patterns specifically to retrieval-augmented generation; useful if your LLM is backed by a RAG pipeline
- Building LLM Applications: Evaluations — Eugene Yan (2023) — Survey of eval patterns from a senior applied science perspective; covers the transition from NLP metrics to LLM-specific approaches
About the Author
Toc Am
Founder of AmtocSoft. Writing practical deep-dives on AI engineering, cloud architecture, and developer tooling. Previously built backend systems at scale. Reviews every post published under this byline.
Published: 2026-04-18 · Written with AI assistance, reviewed by Toc Am.
☕ Buy Me a Coffee · 🔔 YouTube · 💼 LinkedIn · 🐦 X/Twitter
Comments
Post a Comment