Designing eval suites that survive contact with users
We share the template we use to grow a regression suite from a dozen examples to a few thousand without losing signal.
Read note →Research
What we are learning about evaluating models, designing safer interfaces, and operating AI-assisted software in real organisations.

A framework for comparing automation options that includes the human review time you still need to budget for. Includes a worksheet, two case studies, and the spreadsheet we actually use with clients.

How to author and grow regression suites that survive contact with users.
When to retrieve, what to retrieve, and how to evaluate the retriever separately from the model.
Reading traces, debugging multi-step runs, and instrumenting tool calls.
Refusals, redaction, and the policy surface around production AI.
We share the template we use to grow a regression suite from a dozen examples to a few thousand without losing signal.
Read note →A framework for comparing automation options that includes the human review time you still need to budget for.
Read note →Three patterns where injecting context made our systems worse, and how we now decide before reaching for RAG.
Read note →A short guide to inspecting a multi-step run when something goes wrong in production at 4 a.m.
Read note →Treating model refusals as a UX decision, not just a safety setting — with examples from regulated deployments.
Read note →How we keep rollouts uneventful: feature flags, shadow modes and the discipline of one variable at a time.
Read note →Roughly one note per month. Long-form, technical, no marketing. Easy to unsubscribe.