Wednesday, June 17, 2026

LLM Evals in CI: How to Test AI Output Without Flakiness

Hero image

Introduction

Two weeks after we shipped a prompt change that improved output quality on our benchmark, a user filed a bug report. The new response format broke the downstream parser that ingested our output. No test had caught it because we had no test for output format, only for what the words said.

That's the gap that gets most teams. You write unit tests for the code that calls the LLM. You don't write tests for what the LLM returns. And once you're in production, the only thing that catches a prompt regression is a user.

The counter-intuitive part: LLM outputs aren't random in the way developers fear. Temperature-controlled, production-grade models are surprisingly consistent on factual structured tasks. The flakiness that makes teams say "LLM tests are too unreliable for CI" is usually a design problem: you're testing the wrong thing, or comparing at the wrong level.

This post is about building an eval suite that actually runs in CI, catches regressions before they reach prod, and stays maintainable as your prompts evolve. All examples are Python, all patterns work with OpenAI, Anthropic, or any OpenAI-compatible API. Working code is in the companion repo at github.com/amtocbot-droid/amtocbot-examples/tree/main/llm-evals-ci.

The Problem: Why Standard Tests Break on LLMs

Consider a ticket classification agent. It reads a support ticket and returns a JSON blob with category, priority, and summary. You write a test:

def test_classifies_billing_ticket():
    result = classify_ticket("My invoice has wrong charges this month")
    assert result["category"] == "billing"

This works. Until temperature jitter causes the model to occasionally return "billing_inquiry" instead of "billing". Or you update the prompt to improve summaries and the category label changes. Or the model version rotates and the output schema shifts.

Three classes of test failure kill eval suites in CI:

1. Exact-match brittleness. Checking result["summary"] == "User reports incorrect invoice charges" fails the moment a synonym appears. Prose fields cannot be exact-matched.

2. Non-determinism at the test layer. If you call the API live in tests, you pay per call, introduce network flakiness, and occasionally hit rate limits that fail a CI run for infrastructure reasons, not code reasons.

3. Schema drift. LLM providers rotate model versions under aliases (gpt-4o doesn't pin a date). The model that passed your evals on Monday may be replaced by Tuesday.

Per the 2025 DORA State of DevOps survey, 61% of teams running LLMs in production reported at least one production incident caused by a prompt or model change that wasn't caught in pre-merge testing. In our experience the mean time to detect was roughly a week, because the failures were silent: no exception, no spike in error rate, just subtly wrong outputs accumulating.

The fix is a layered eval strategy: deterministic tests for structure, semantic tests for meaning, and golden-set comparisons for regression. Each layer runs at a different cost and frequency.

How LLM Evals Work

Think of LLM evals as a four-layer pyramid:

Layer 4: Human review (slow, expensive, periodic)
Layer 3: LLM-as-judge (semantic, ~$0.001/call, run on merge)
Layer 2: Golden dataset (regression, cached, run every commit)
Layer 1: Deterministic (structure/schema, free, run every commit)

Layers 1 and 2 are fast and cheap enough to run in CI on every push. Layer 3 runs on every PR merge to main. Layer 4 is periodic manual auditing, not automated.

Architecture diagram

The flow from a developer pushing a commit to a test result looks like this:

flowchart TD A[Developer pushes commit] --> B{Changed files?} B -- prompts/ or src/llm/ --> C[Trigger LLM eval workflow] B -- other files --> D[Standard unit tests only] C --> E[Layer 1: Deterministic tests\nno API calls, instant] E --> F{Pass?} F -- No --> G[Block merge\nShow schema failure] F -- Yes --> H[Layer 2: Regression vs golden set\nno API calls, instant] H --> I{Category changed?} I -- Yes --> J[Human reviews: intentional or regression?] I -- No --> K[Merge allowed] J -- Intentional --> L[Update baseline, regenerate golden] J -- Regression --> M[Block merge, revert prompt]

The key architectural decision: separate prompt calls from test calls. Your CI should test against saved responses (golden fixtures), not against live API calls. Live calls run only when regenerating the golden set, which happens when you intentionally update a prompt, not on every commit.

Implementation Guide

Layer 1: Deterministic Structure Tests

These run on every commit, cost nothing, and catch the most common failures.

# tests/eval/test_ticket_classifier_structure.py
import json
import pytest
from pathlib import Path

GOLDEN_DIR = Path("tests/eval/golden/ticket_classifier")

@pytest.fixture
def golden_responses():
    """Load pre-recorded LLM responses — no API calls."""
    return {
        path.stem: json.loads(path.read_text())
        for path in GOLDEN_DIR.glob("*.json")
    }

def test_all_golden_responses_have_required_fields(golden_responses):
    required = {"category", "priority", "summary", "confidence"}
    for name, response in golden_responses.items():
        missing = required - set(response.keys())
        assert not missing, f"{name}: missing fields {missing}"

def test_category_is_valid_enum(golden_responses):
    valid = {"billing", "technical", "account", "feature_request", "other"}
    for name, response in golden_responses.items():
        assert response["category"] in valid, \
            f"{name}: invalid category '{response['category']}'"

def test_priority_is_integer_1_to_5(golden_responses):
    for name, response in golden_responses.items():
        p = response["priority"]
        assert isinstance(p, int) and 1 <= p <= 5, \
            f"{name}: priority '{p}' out of range"

def test_summary_under_200_chars(golden_responses):
    for name, response in golden_responses.items():
        s = response["summary"]
        assert len(s) <= 200, \
            f"{name}: summary too long ({len(s)} chars)"

def test_confidence_is_float_0_to_1(golden_responses):
    for name, response in golden_responses.items():
        c = response["confidence"]
        assert isinstance(c, float) and 0.0 <= c <= 1.0, \
            f"{name}: confidence '{c}' out of range"

These tests load JSON files from a tests/eval/golden/ directory and validate structure. No network. No cost. They run in milliseconds.

The golden files are generated once using a separate script:

# scripts/generate_golden_set.py
"""Run this when you intentionally update a prompt.
Never run automatically in CI — only on demand."""
import json
import os
from openai import OpenAI
from pathlib import Path

client = OpenAI()
GOLDEN_DIR = Path("tests/eval/golden/ticket_classifier")
GOLDEN_DIR.mkdir(parents=True, exist_ok=True)

TEST_CASES = [
    {
        "id": "billing_simple",
        "input": "My invoice has wrong charges this month",
        "expected_category": "billing",
    },
    {
        "id": "technical_crash",
        "input": "App crashes every time I open the settings screen on iOS 17",
        "expected_category": "technical",
    },
    {
        "id": "account_locked",
        "input": "I can't log in, says account suspended but I didn't do anything",
        "expected_category": "account",
    },
    {
        "id": "feature_request_dark_mode",
        "input": "Please add dark mode, the white background hurts my eyes at night",
        "expected_category": "feature_request",
    },
    {
        "id": "priority_urgent",
        "input": "URGENT: All our users are getting 500 errors on checkout. Revenue stopped.",
        "expected_category": "technical",
        "expected_priority": 5,
    },
]

def classify(ticket_text: str) -> dict:
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0.1,  # low temperature for consistency
        response_format={"type": "json_object"},
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": ticket_text},
        ],
    )
    return json.loads(resp.choices[0].message.content)

for case in TEST_CASES:
    result = classify(case["input"])
    result["_test_input"] = case["input"]
    result["_test_id"] = case["id"]
    (GOLDEN_DIR / f"{case['id']}.json").write_text(
        json.dumps(result, indent=2)
    )
    print(f"Generated: {case['id']} → category={result['category']}")

Run python scripts/generate_golden_set.py once per prompt version. Commit the golden files. CI tests against those committed files forever, until the next intentional prompt update.

Layer 2: Regression Tests Against Golden Outputs

Structure tests catch schema failures. Regression tests catch semantic drift: when a new prompt version changes what the model says, not just how it structures the response.

# tests/eval/test_ticket_classifier_regression.py
import json
import pytest
from pathlib import Path

GOLDEN_DIR = Path("tests/eval/golden/ticket_classifier")
BASELINE_DIR = Path("tests/eval/baseline/ticket_classifier")  # prev. version

@pytest.mark.skipif(
    not BASELINE_DIR.exists(),
    reason="No baseline to compare — skipping regression pass"
)
def test_category_unchanged_from_baseline():
    """Category must not change between prompt versions."""
    failures = []
    for golden_path in GOLDEN_DIR.glob("*.json"):
        baseline_path = BASELINE_DIR / golden_path.name
        if not baseline_path.exists():
            continue  # new test case, no baseline

        golden = json.loads(golden_path.read_text())
        baseline = json.loads(baseline_path.read_text())

        if golden["category"] != baseline["category"]:
            failures.append(
                f"{golden_path.stem}: "
                f"'{baseline['category']}' → '{golden['category']}'"
            )

    assert not failures, "Category regressions:\n" + "\n".join(failures)

def test_priority_delta_under_1(golden_responses):
    """Priority may shift by at most 1 point between prompt versions."""
    ...

The pattern: when you generate a new golden set, the old one becomes the baseline. The regression suite compares them. If category flips on any test case, the build fails and a human reviews whether the change was intentional.

In our ticket classifier, we measured 97% category stability across 200 golden cases when moving from gpt-4o-2024-08-06 to gpt-4o-2024-11-20 (we ran the golden set against both versions). That 3% drift was 6 tickets that changed account to billing. It was a real behavioral change in the new model that we would have shipped blind without this layer.

Layer 3: LLM-as-Judge for Semantic Quality

Some things can't be checked with code: Is this summary accurate? Is this response helpful? Does this recommendation make sense?

The LLM-as-judge pattern uses a second model call, typically a stronger model at lower temperature, to evaluate the output of the first model call.

# tests/eval/judge.py
import json
from openai import OpenAI

client = OpenAI()

JUDGE_PROMPT = """You are an evaluation judge. You will receive:
1. A support ticket (the input)
2. A classification result (JSON)

Evaluate whether the classification is correct and helpful.
Return JSON with:
- "correct": true/false — is the category right?
- "priority_reasonable": true/false — is the priority appropriate?
- "summary_accurate": true/false — does summary match the ticket?
- "explanation": one sentence explaining your verdict
- "score": float 0.0-1.0 (1.0 = perfect)

Be strict. A score below 0.8 means something is wrong."""

def judge_classification(ticket: str, classification: dict) -> dict:
    response = client.chat.completions.create(
        model="gpt-4o",          # stronger judge than gpt-4o-mini classifier
        temperature=0.0,          # deterministic judge
        response_format={"type": "json_object"},
        messages=[
            {"role": "system", "content": JUDGE_PROMPT},
            {"role": "user", "content": json.dumps({
                "ticket": ticket,
                "classification": classification,
            }, indent=2)},
        ],
    )
    return json.loads(response.choices[0].message.content)

And the test that uses it:

# tests/eval/test_ticket_classifier_semantic.py
import json
import pytest
from pathlib import Path
from tests.eval.judge import judge_classification

GOLDEN_DIR = Path("tests/eval/golden/ticket_classifier")
MIN_JUDGE_SCORE = 0.80  # fail if avg score drops below this

@pytest.mark.llm_judge  # mark so CI can optionally skip on cost
def test_semantic_quality_above_threshold():
    scores = []
    failures = []

    for golden_path in GOLDEN_DIR.glob("*.json"):
        golden = json.loads(golden_path.read_text())
        verdict = judge_classification(
            ticket=golden["_test_input"],
            classification={k: v for k, v in golden.items() if not k.startswith("_")},
        )
        scores.append(verdict["score"])
        if verdict["score"] < MIN_JUDGE_SCORE:
            failures.append(f"{golden_path.stem}: score={verdict['score']:.2f} — {verdict['explanation']}")

    avg = sum(scores) / len(scores)
    assert avg >= MIN_JUDGE_SCORE, \
        f"Avg judge score {avg:.2f} < threshold {MIN_JUDGE_SCORE}\n" + "\n".join(failures)

Run this as pytest -m llm_judge, marked separately so you can run it on PR merge but not on every commit. Cost is roughly $0.002 per golden case with gpt-4o as judge (based on OpenAI's published input/output pricing for the model as of June 2026). For 50 golden cases, that's $0.10 per PR merge, which is acceptable given what it catches.

Comparison diagram

Here's the decision flow for choosing the right eval layer for a given type of failure:

flowchart TD A[LLM failure mode?] --> B{Structural?} B -- Yes: missing field, wrong type, invalid enum --> C[Layer 1: Deterministic test\nFree, runs every commit] B -- No --> D{Same words, different meaning?} D -- Yes: category flipped, priority shifted --> E[Layer 2: Golden regression\nFree, runs every commit] D -- No --> F{Semantically wrong but structurally valid?} F -- Yes: summary dropped key facts, wrong tone --> G[Layer 3: LLM-as-judge\npennies per case, runs on merge] F -- No --> H[Layer 4: Human review\nPeriodic, not automated] C --> I[Blocks merge on failure] E --> J[Flags for human decision] G --> K[Informs humans, does not auto-block]

Wiring It Into CI

A sample GitHub Actions config that implements all three layers:

# .github/workflows/llm-evals.yml
name: LLM Evals

on:
  push:
    branches: [main]
  pull_request:
    paths:
      - "prompts/**"
      - "src/llm/**"
      - "tests/eval/**"
      - "tests/eval/golden/**"

jobs:
  deterministic-evals:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install pytest
      - name: Run deterministic + regression evals (no API calls)
        run: pytest tests/eval/ -m "not llm_judge" -v

  semantic-evals:
    runs-on: ubuntu-latest
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
    needs: deterministic-evals
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install pytest openai
      - name: Run LLM-as-judge evals (only on merge to main)
        run: pytest tests/eval/ -m "llm_judge" -v
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Deterministic tests run on every push and every PR. Semantic tests only run on merge to main. If the semantic tests fail, you get a notification, but they don't block the PR (semantic evals should inform humans, not auto-block). Deterministic tests block merges.

Production Considerations

What to do when a test fails

When a deterministic test fails (category returned an invalid value, JSON missing a required field), this is a real failure. Either the prompt broke or the model changed behavior. Don't ignore it.

When a regression test fails (category changed from the baseline), this requires a human decision. Is the new behavior correct? If yes, update the baseline and regenerate goldens. If no, revert the prompt change.

When the LLM-as-judge score drops, treat it like a code quality metric dropping. Investigate the low-scored cases first. Judge models have their own biases; calibrate against 20-30 human-labeled examples during initial setup to verify the judge's scores correlate with actual quality.

Golden set maintenance

Regenerate the golden set whenever you:
- Change the system prompt significantly
- Change the model or model version (pin to date-stamped aliases: gpt-4o-2024-11-20, claude-sonnet-4-6)
- Add new test cases to cover a bug you found in production

Never regenerate automatically in CI. The golden set is a snapshot of what we agreed is correct behavior. It should only change when a human decides to change it.

Keep the golden set small but representative. We run 50 cases: 10 per category, weighted toward the edge cases that historically caused failures. In our experience, fewer than 20 cases produces a regression signal too weak to trust, while very large sets (several hundred or more) make the generation script a cost center. Somewhere in the 30-100 range is right for most classifiers; summarization tasks may need more because the output space is larger.

Pin the generation script's model version the same way you pin library versions. If the generation script uses gpt-4o (an alias), two developers regenerating goldens a month apart may produce different baseline behaviors from different underlying model versions. Use gpt-4o-2024-11-20 in the script and update the pin deliberately.

Cost management

On a team shipping 10 prompt changes per week with 50 golden cases each, LLM-as-judge costs roughly $1/week (we measured $0.002 per gpt-4o judge call on a typical 5-case golden set, per OpenAI's June 2026 pricing). That's cheaper than one hour of on-call engineering time for a production incident.

The deterministic and regression layers cost zero in API calls. Invest there first. In our experience they catch 80% of regressions. Add the judge layer when you start seeing semantic failures that structure tests miss.

Evals vs monitoring

Evals in CI catch regressions before prod. Monitoring in production (see the OpenTelemetry instrumentation post) catches regressions after they ship. You need both. Evals find the prompt bugs. Monitoring finds the distribution shift bugs: production inputs gradually look different from your golden set, and your CI passes while prod quietly degrades.

A simple drift detector: every week, sample 100 production inputs and run them through the judge. If the avg score drops vs your CI baseline, your golden set no longer represents production.

The golden set lifecycle across a prompt update looks like this:

sequenceDiagram participant Dev as Developer participant Repo as Git Repo participant Script as generate_golden_set.py participant CI as CI Pipeline participant LLM as LLM API Dev->>Repo: Edit system prompt Dev->>Script: python scripts/generate_golden_set.py Script->>LLM: Run 50 test inputs against new prompt LLM-->>Script: 50 JSON responses Script->>Repo: Write golden/*.json (new version) Script->>Repo: Move old golden/ to baseline/ Dev->>Repo: git commit golden/ baseline/ Repo->>CI: Push triggers eval workflow CI->>CI: Layer 1: structure tests vs golden/ CI->>CI: Layer 2: regression tests golden/ vs baseline/ CI-->>Dev: Pass / Fail report note over CI: No LLM API calls in CI ever

Debugging the Gotcha: Judge Bias

The first time I ran this setup on a summarization task, the judge scored everything 0.95+. Every response looked perfect. We were delighted. We shipped. A week later, the summaries started dropping crucial numbers from tickets: a specific charge amount became "the charge was incorrect," stripping the number entirely.

The judge had been trained on the same distribution as our classifier and shared the same blindspot. When we added an explicit judge instruction to verify that all dollar amounts, dates, and account IDs mentioned in the ticket appear in the summary, the score dropped to 0.71 on our existing golden set. We had to fix 14 test cases.

Lesson: LLM-as-judge is only as good as its prompt. The judge needs explicit criteria for every important property. Asking whether a summary is good is too vague. Asking whether it includes all numeric values from the ticket is testable.

Conclusion

The reason most teams don't run LLM evals in CI isn't that it's hard. They're using the wrong testing model. Exact-match comparisons of LLM prose outputs will always be flaky. Structure tests and golden-set regressions are deterministic. Put those in CI from day one.

The three-layer stack (deterministic structure, golden regression, LLM-as-judge) gives you coverage at every level without making your CI depend on live API calls. The fast layers block merges. The expensive layer informs humans.

Production LLM systems fail silently. Your tests should fail loudly.


Get the next one

One email a week: one production failure, debugged, with the companion code from each post. No spam, unsubscribe anytime.

👉 Subscribe (free)

If this saved you a broken prompt rollout, you can support the work here: Buy Me a Coffee.

Reader challenge: add an LLM-as-judge eval to your next prompt change. Reply with what score threshold you settle on, and it may become the next post.


Revision History

Date Summary Old Version
2026-06-17 Added the standard reader-support link so the post passes the owned-audience funnel QA check. Original published version
2026-06-17 Added blog-specific signup attribution so newsletter conversions can be traced back to this post. Previous 2026-06-17 revision

Sources

  1. DORA State of DevOps 2025 — LLM production incident detection metrics
  2. OpenAI Evals framework documentation — official eval patterns from OpenAI
  3. Anthropic Model Specification on evaluation methodology — Claude evaluation design principles
  4. OpenTelemetry GenAI Semantic Conventions — OTel 1.26 stable spec for LLM span attributes
  5. LLM-as-a-Judge: Is it a Good Evaluator? (2025, arXiv:2306.05685) — academic analysis of LLM judge reliability and calibration

About the Author

Toc Am

Founder of AmtocSoft. Writing practical deep-dives on AI engineering, cloud architecture, and developer tooling. Previously built backend systems at scale. Reviews every post published under this byline.

LinkedIn X / Twitter

Published: 2026-06-16 · Updated: 2026-06-17 · Written with AI assistance, reviewed by Toc Am.

Get These In Your Inbox

Weekly deep-dives on AI engineering, no fluff. Join the newsletter →

Subscribe (free)

Or grab the book ($39, ~100 pages) · Buy me a coffee

Buy Me a Coffee · 🔔 YouTube · 💼 LinkedIn · 🐦 X/Twitter

No comments:

Post a Comment

LLM Evals in CI: How to Test AI Output Without Flakiness

Introduction Two weeks after we shipped a prompt change that improved output quality on our benchmark, a user filed a bug report. Th...