MVTec AD is a benchmark, not a dataset
Every anomaly detection paper tops out near 99 % image-AUROC on MVTec AD. That number is the reason teams confidently deploy a model and then watch it fail in production. MVTec AD is small (~5k images), pristine (lab lighting, clean backgrounds, single object), and curated (anomalies are visible to a human in <1 s). Your factory floor is none of those things.If you're collecting your own dataset, here's what we've learned the hard way.
Class imbalance is the whole problem
Anomalies are rare by definition. A line that produces 1 % defect rate gives you, in a typical week, a hundred or so anomalies and tens of thousands of good parts. This isn't a "balance the loss" problem. It's a "you don't have enough anomalies for supervised learning, ever" problem. Hence unsupervised methods.But: that 1 % includes maybe twenty distinct defect types. If you train on the data you have, you'll cover the common defects fine and miss the rare ones — the rare ones being, of course, the most expensive ones to miss.
The cold start problem
On day one of a project you have no good images and no anomaly images. Two weeks of data collection later, you have a few hundred good images and zero confirmed anomalies. The decision: deploy a "good only" anomaly detector now and find out what it flags, or wait until a couple of confirmed anomalies show up?We've converged on: deploy in shadow mode at the end of week 2. Use the operator's manual rejections as anomaly labels. Don't trust the labels until you've reviewed them.
Active learning loops that actually work
- Run inference on every part. Log score + image.
- Human reviewer queues: highest-score good parts (potential false rejects), lowest-score bad parts (potential false accepts).
- Operator labels in <30 s per image, in a UI built for it. Not a spreadsheet.
- Daily delta: 50-100 new labels, weekly retrain on the cumulative set.
This is the pattern that took our worst-performing project from 92 % to 99.4 % image-AUROC over six weeks of production. No new model architecture; just a better dataset.
Synthetic anomalies (CutPaste, DRAEM)
A surprisingly strong tool. The trick: paste random crops from the same image (CutPaste) or simulate Perlin-noise-driven structural anomalies (DRAEM-style). The model learns "this region is statistically inconsistent" rather than "this looks like the anomalies I've seen". Generalises better to unseen defect types than naive supervised approaches.We don't ship synthetic-only models. We ship models trained on real good samples + synthetic perturbations.
Things to actually capture, beyond the image
- Camera ID, lens config, lighting state — different cameras drift differently
- Shift, operator ID, line speed — operator-driven variance is a real signal
- Upstream process variables (temperature, pressure) when available — sometimes the anomaly is upstream
- Material lot — different supplier batches look different to the camera
These are the columns that let you debug a regression three months later instead of staring at a confusion matrix.
One last thing
Don't compress your training images. Lossy JPEG compression hides exactly the kind of low-amplitude defects you're trying to detect. Keep the raw PNGs in cold storage, downsample at training time if the model needs it.What's your dataset cadence? Weekly retrain, monthly, on-demand only?