Grospan

Research

Notes from the workbench.

What we are learning about evaluating models, designing safer interfaces, and operating AI-assisted software in real organisations.

Topographic lines
Featured paper

Cost-per-decision: an honest metric for production AI.

A framework for comparing automation options that includes the human review time you still need to budget for. Includes a worksheet, two case studies, and the spreadsheet we actually use with clients.

Topographic layers

Topics we follow.

12 notes

Evaluation

How to author and grow regression suites that survive contact with users.

8 notes

Retrieval

When to retrieve, what to retrieve, and how to evaluate the retriever separately from the model.

6 notes

Observability

Reading traces, debugging multi-step runs, and instrumenting tool calls.

9 notes

Safety

Refusals, redaction, and the policy surface around production AI.

All notes.

Note · May 2026

Designing eval suites that survive contact with users

We share the template we use to grow a regression suite from a dozen examples to a few thousand without losing signal.

Read note →
Paper · April 2026

Cost-per-decision: an honest metric for production AI

A framework for comparing automation options that includes the human review time you still need to budget for.

Read note →
Note · March 2026

When not to retrieve

Three patterns where injecting context made our systems worse, and how we now decide before reaching for RAG.

Read note →
Note · February 2026

Reading traces: a practice

A short guide to inspecting a multi-step run when something goes wrong in production at 4 a.m.

Read note →
Paper · January 2026

Refusals as product surface

Treating model refusals as a UX decision, not just a safety setting — with examples from regulated deployments.

Read note →
Note · December 2025

Boring deployments, on purpose

How we keep rollouts uneventful: feature flags, shadow modes and the discipline of one variable at a time.

Read note →

Get the next note in your inbox.

Roughly one note per month. Long-form, technical, no marketing. Easy to unsubscribe.