Posts

Showing posts from May, 2026

LLM SLOs in Production: Latency, Quality, Cost, and Availability Targets That Actually Move Decisions

Image
Introduction The first time I argued for LLM SLOs at our weekly platform review, the head of product told me the number I wanted to track was "user happiness." I laughed politely and asked how we measured user happiness today. He said the customer-success team had a feeling. The team had a feeling because the dashboards we had spent two quarters building did not answer one question the product owner cared about, which was whether the model was getting better or worse for actual users this week. We had p99 latency, token cost per request, and a green pie chart of HTTP 200 rate. None of those moved when the model regressed. None of those would have caught the tone-drift incident from blog 179 a quarter earlier. None of those gave a CTO a number to put in a board update. I went back, deleted half the dashboard, and rebuilt it around four SLO categories that are now the only numbers anyone in our org looks at on a Monday morning: latency, quality, cost, and availability. ...

Production LLM Canary Deployments: Shadow Mode, Traffic Splits, and Safe Model Rollouts

Image
Introduction The Tuesday I migrated our customer-support copilot from one frontier model to another, the team burned six engineering hours rolling back a deploy that, on paper, looked fine. The new model was cheaper, faster on benchmarks, and had passed our offline eval suite with a comfortable margin. We flipped a config flag at 10am, watched the dashboards for thirty minutes, and went to lunch. By 2pm the support managers were on a call asking why ticket-handling time had gone up by 40 seconds per ticket and why our agents were copy-pasting model outputs into a separate text editor to "clean them up before sending." The new model was technically correct on every test we had written. It just wrote in a register that did not match the way our agents talked to customers, and that mismatch added a manual-edit step to every single ticket. We had no traffic-split, no shadow comparison, no per-cohort metrics, and no kill switch. The rollback was a frantic config-flag revers...

LLM-as-a-Judge in Production: Why Your Eval Is Lying to You and How to Build One That Doesn't

Image
Introduction We shipped a prompt change that our LLM judge said was 31 percent better. Customers told us it was worse. The product was a customer-support summariser that took a long ticket thread and produced a one-paragraph summary for the agent to read before responding. We had built an LLM-as-a-judge eval pipeline two months earlier, hooked it into CI, and used it to gate prompt deploys. The new prompt scored 8.4 out of 10 on the judge's rubric versus 6.4 for the old one. Average winner of pairwise battles, 78 percent. The graphs were green. We deployed on a Tuesday morning. By Thursday, the support team's CSAT score had dropped by 11 points and three account managers were on a call with me asking what had changed. The new summaries were longer, more flowery, and consistently buried the actual customer issue under three sentences of preamble. We took the change down on Friday and spent the next sprint forensicating the judge. The new prompt told the model to produce ...

Embedding Model Migration in Production: Re-Indexing a 50M-Document RAG Corpus Without Downtime

Image
Introduction The first time we tried to swap embedding models on a live RAG, the rollback took eleven hours and we lost a customer. The product was a legal-document search system with about 38 million paragraphs indexed in pgvector, embedded with text-embedding-ada-002 . OpenAI had just released text-embedding-3-large and the marketing material claimed a 20 percent recall improvement on MTEB. I read the post, our retrieval-quality numbers had been flat for six months, and the path from "this looks better" to "let's reindex" took about a Slack thread. We started the re-embed run on a Wednesday afternoon. By Thursday morning we had a partially re-indexed corpus, a queue of 2.3 million paragraphs that had failed silently because the new model returned 3072 dimensions and our pgvector column was capped at 1536, an active customer who could not find their own contract because their query embedding now lived in a different vector space than the documents, and ...