The number that fooled everyone
Anomaly detection papers report image-level AUROC. Production anomaly detection lives or dies on per-image false-reject rate at a fixed threshold. These are not the same number, and confusing them is the #1 reason a "99 % accurate" model gets unplugged on day three.Here's the framework we actually use to evaluate.
Pixel AUROC vs image AUROC
- Image AUROC — given a binary label per image, how well does the score rank the image. Optimistic, because most images are easy.
- Pixel AUROC — given a per-pixel ground truth mask, how well does the heatmap rank pixels. Less optimistic, because most pixels are easy in the other direction (most pixels are good even in anomalous images).
- PRO score (per-region overlap) — penalises detectors that score a single pixel correctly but miss the rest of the defect. Closer to what operators care about.
We track all three during evaluation. We pick thresholds based on a fourth metric.
The metric that ops cares about: FRR at fixed FAR
On the line, the customer specifies one of:- "Maximum 0.1 % false accept rate" — for safety-critical defects
- "Maximum 2 % false reject rate" — for cost-of-quality applications
You then optimise the other one at the fixed value of the first. AUROC summarises the whole curve, but ops works at one operating point. Reporting AUROC = 99 % when the FRR at FAR = 0.1 % is 8 % is dishonest, even if accidentally.
Threshold selection: where most projects leak performance
- Don't pick the threshold that maximises F1 on the test set. That's test-set leakage, your production threshold will be wrong.
- Don't pick the threshold once and ship it. Lighting drifts, cameras drift, operators drift. Threshold drifts.
- Do pick the threshold on a held-out validation set, then verify on a never-touched test set.
- Do recalibrate the threshold weekly using the operator-confirmed feedback from production.
We run two thresholds in production: a hard threshold (above this → reject, no operator review) and a soft threshold (between hard and soft → operator review queue, decision logged for retraining).
Drift over time — the metric that should be on every dashboard
None of this is static. Build a dashboard that tracks:- Daily mean and 95th percentile anomaly score on good parts
- Daily reject rate
- Daily operator override rate
- Distribution divergence (KL or MMD) between this week's embeddings and the reference week
When any of those drift more than 2σ from baseline, you've drifted. Investigate before the customer notices.
The honest evaluation
The honest report we send a customer at end-of-PoC includes:- Image AUROC, pixel AUROC, PRO — on a held-out test set, with confidence intervals from bootstrap
- FRR @ customer-specified FAR — on the same test set
- Failure mode breakdown — which defect types we miss, with example images
- Drift sensitivity estimate — how much accuracy degrades after 2 weeks of simulated lighting drift
The customer cares about the last two more than the first two. Always.
What metrics make it onto your dashboard?