Three families, three tradeoffs
The unsupervised anomaly detection space has more or less converged on three architectural families. Picking between them is less about which one tops the MVTec leaderboard and more about which one fits the constraints of your cell — memory budget, retrain cadence, latency, and how exotic your anomalies are.Here's the practical decision matrix we use.
Memory-bank methods (PaDiM, PatchCore)
PaDiM was the breakthrough; PatchCore is the refinement that actually shipped. They both extract features from a frozen backbone (WideResNet-50, EfficientNet) and store representative "good" patches in a memory bank. At inference, the test patch's distance to its nearest neighbor in the bank is the anomaly score.Where they win: small datasets (200-500 good samples), high accuracy on subtle defects, no training in the classical sense (just bank construction).
Where they hurt: memory bank gets large fast — a 5000-image bank at full resolution can blow past 4 GB. PatchCore with greedy coreset sub-sampling is the workable version. Inference latency is bounded by nearest-neighbour search; FAISS helps but doesn't fix the problem.
Distillation (EfficientAD, RD4AD)
A student network is trained to match a frozen teacher's features on good samples only. At inference, where student and teacher disagree → anomaly.Where they win: inference latency. EfficientAD on a 3060 runs at <10 ms per image. Memory footprint is fixed (it's just the student weights).
Where they hurt: slightly lower top-line accuracy on the harder MVTec categories, and training is more sensitive to hyperparameters than it sounds in the paper.
Reconstruction & flow (FastFlow, DRAEM)
Older lineage. Autoencoders, normalizing flows, or diffusion models that learn the distribution of "good" and flag deviations.Where they win: textured surfaces (fabric, wood, leather). DRAEM in particular handles defect types not seen at training time better than any memory-bank method.
Where they hurt: training stability, hyperparameter sensitivity, and a tendency to memorize anomalies if your "good" set is dirty.
How we pick
- Small dataset, sharp defects, latency >100 ms is fine → PatchCore
- Latency budget < 30 ms → EfficientAD
- Textured surface, expecting unseen defect types → DRAEM
- "Just give me the highest score on a benchmark" → PatchCore (and that's exactly why it doesn't always ship)
The thing the leaderboards don't tell you
None of these models handle drift. Train them in March, deploy them, and by August the lighting in your factory has shifted, your operators are loading parts slightly differently, and your anomaly score distribution has drifted half a standard deviation. The leaderboard model isn't the right model — the model with the cleanest retrain story is.What's your default? Curious whether anyone has converged on a single architecture across multiple use cases.