AUROC is misleading: how to actually evaluate an anomaly detector

Aior · Apr 30, 2026

Herkesi yanıltan sayı

Anomali tespit makaleleri görüntü düzeyi AUROC raporlar. Üretim anomali tespiti sabit eşikte görüntü başına yanlış reddetme oranıyla yaşar veya ölür. Bu ikisi aynı sayı değildir ve onları karıştırmak, "%99 doğru" bir modelin üçüncü gün fişten çekilmesinin bir numaralı sebebidir.

İşte değerlendirme için gerçekten kullandığımız çerçeve.

Pixel AUROC vs görüntü AUROC

Görüntü AUROC — görüntü başına ikili etiket verildiğinde, skor görüntüyü ne kadar iyi sıralıyor. İyimserdir, çünkü çoğu görüntü kolaydır.
Piksel AUROC — piksel başına ground truth maskesi verildiğinde, heatmap pikselleri ne kadar iyi sıralıyor. Daha az iyimser, çünkü çoğu piksel diğer yönde kolaydır (çoğu piksel anomalili görüntülerde bile iyi).
PRO skoru (per-region overlap) — tek bir pikseli doğru skorlayan ama kusurun gerisini kaçıran dedektörleri cezalandırır. Operatörlerin önemsediğine daha yakın.

Değerlendirmede üçünü de takip ediyoruz. Eşikleri dördüncü bir metriğe göre seçiyoruz.

Ops'un önemsediği metrik: sabit FAR'da FRR

Hatta müşteri ya şunu ya bunu belirler:

"Maksimum %0.1 yanlış kabul oranı" — güvenlik kritik kusurlar için.
"Maksimum %2 yanlış reddetme oranı" — kalite maliyeti uygulamaları için.

Sonra ilkinin sabit değerinde diğerini optimize edersiniz. AUROC tüm eğriyi özetler, ama ops tek bir operasyon noktasında çalışır. FAR=%0.1'de FRR=%8 iken AUROC=%99 raporlamak yanlıştır, kazara olsa bile.

Eşik seçimi: çoğu projenin performans sızdırdığı yer

Yapmayın — test setinde F1'i maksimize eden eşiği seçmek. Bu test seti sızıntısıdır, üretim eşiğiniz yanlış olur.
Yapmayın — eşiği bir kez seçip göndermek. Aydınlatma drift'i, kamera drift'i, operatör drift'i. Eşik drift'i.
Yapın — eşiği ayrılmış bir doğrulama setinde seçin, sonra hiç dokunulmamış bir test setinde doğrulayın.
Yapın — üretimden operatör-onaylı geri bildirimle eşiği haftalık yeniden kalibre edin.

Üretimde iki eşik çalıştırıyoruz: hard eşik (üstü → red, operatör inceleme yok) ve soft eşik (hard ile soft arası → operatör inceleme kuyruğu, karar yeniden eğitim için loglanır).

Zaman içinde drift — her dashboard'da olması gereken metrik

Bunların hiçbiri statik değildir. Şunları takip eden bir dashboard inşa edin:

İyi parçalarda günlük ortalama ve 95. persantil anomali skoru.
Günlük reddetme oranı.
Günlük operatör override oranı.
Bu haftanın embedding'leri ile referans hafta arası dağılım ayrışması (KL veya MMD).

Bunlardan herhangi biri temelden 2σ'dan fazla saparsa, drift olmuştur. Müşteri fark etmeden araştırın.

Dürüst değerlendirme

PoC sonunda müşteriye gönderdiğimiz dürüst rapor şunları içerir:

Görüntü AUROC, piksel AUROC, PRO — ayrılmış test setinde, bootstrap'tan güven aralıklarıyla.
Müşteri-belirtilen FAR'da FRR — aynı test setinde.
Başarısızlık modu dağılımı — hangi kusur türlerini kaçırıyoruz, örnek görüntülerle.
Drift duyarlılığı tahmini — 2 hafta simüle aydınlatma drift'i sonrası doğruluk ne kadar düşer.

Müşteri son ikisini ilk ikiden çok önemser. Her zaman.

Dashboard'unuza hangi metrikler giriyor?

The number that fooled everyone

Anomaly detection papers report image-level AUROC. Production anomaly detection lives or dies on per-image false-reject rate at a fixed threshold. These are not the same number, and confusing them is the #1 reason a "99 % accurate" model gets unplugged on day three.

Here's the framework we actually use to evaluate.

Pixel AUROC vs image AUROC

Image AUROC — given a binary label per image, how well does the score rank the image. Optimistic, because most images are easy.
Pixel AUROC — given a per-pixel ground truth mask, how well does the heatmap rank pixels. Less optimistic, because most pixels are easy in the other direction (most pixels are good even in anomalous images).
PRO score (per-region overlap) — penalises detectors that score a single pixel correctly but miss the rest of the defect. Closer to what operators care about.

We track all three during evaluation. We pick thresholds based on a fourth metric.

The metric that ops cares about: FRR at fixed FAR

On the line, the customer specifies one of:

"Maximum 0.1 % false accept rate" — for safety-critical defects.
"Maximum 2 % false reject rate" — for cost-of-quality applications.

You then optimise the other one at the fixed value of the first. AUROC summarises the whole curve, but ops works at one operating point. Reporting AUROC = 99 % when the FRR at FAR = 0.1 % is 8 % is dishonest, even if accidentally.

Threshold selection: where most projects leak performance

Don't pick the threshold that maximises F1 on the test set. That's test-set leakage, your production threshold will be wrong.
Don't pick the threshold once and ship it. Lighting drifts, cameras drift, operators drift. Threshold drifts.
Do pick the threshold on a held-out validation set, then verify on a never-touched test set.
Do recalibrate the threshold weekly using the operator-confirmed feedback from production.

We run two thresholds in production: a hard threshold (above this → reject, no operator review) and a soft threshold (between hard and soft → operator review queue, decision logged for retraining).

Drift over time — the metric that should be on every dashboard

None of this is static. Build a dashboard that tracks:

Daily mean and 95th percentile anomaly score on good parts.
Daily reject rate.
Daily operator override rate.
Distribution divergence (KL or MMD) between this week's embeddings and the reference week.

When any of those drift more than 2σ from baseline, you've drifted. Investigate before the customer notices.

The honest evaluation

The honest report we send a customer at end-of-PoC includes:

Image AUROC, pixel AUROC, PRO — on a held-out test set, with confidence intervals from bootstrap.
FRR @ customer-specified FAR — on the same test set.
Failure mode breakdown — which defect types we miss, with example images.
Drift sensitivity estimate — how much accuracy degrades after 2 weeks of simulated lighting drift.

The customer cares about the last two more than the first two. Always.

What metrics make it onto your dashboard?

AUROC is misleading: how to actually evaluate an anomaly detector

AUROC is misleading: how to actually evaluate an anomaly detector

Aior

Administrator

Herkesi yanıltan sayı

Pixel AUROC vs görüntü AUROC

Ops'un önemsediği metrik: sabit FAR'da FRR

Eşik seçimi: çoğu projenin performans sızdırdığı yer

Zaman içinde drift — her dashboard'da olması gereken metrik

Dürüst değerlendirme

The number that fooled everyone

Pixel AUROC vs image AUROC

The metric that ops cares about: FRR at fixed FAR

Threshold selection: where most projects leak performance

Drift over time — the metric that should be on every dashboard

The honest evaluation

Similar threads

Forum statistics

Members online

Latest posts

Newest members

Featured content

Trending content

Share this page

Legal Notice

We value your privacy

AUROC is misleading: how to actually evaluate an anomaly detector

AUROC is misleading: how to actually evaluate an anomaly detector

Aior

Administrator

Herkesi yanıltan sayı​

Pixel AUROC vs görüntü AUROC​

Ops'un önemsediği metrik: sabit FAR'da FRR​

Eşik seçimi: çoğu projenin performans sızdırdığı yer​

Zaman içinde drift — her dashboard'da olması gereken metrik​

Dürüst değerlendirme​

The number that fooled everyone​

Pixel AUROC vs image AUROC​

The metric that ops cares about: FRR at fixed FAR​

Threshold selection: where most projects leak performance​

Drift over time — the metric that should be on every dashboard​

The honest evaluation​

Similar threads

Forum statistics

Members online

Latest posts

Newest members

Featured content

Trending content

Share this page

Tüm ihtiyaçlarınız için Teklif alın

Legal Notice

We value your privacy

Herkesi yanıltan sayı

Pixel AUROC vs görüntü AUROC

Ops'un önemsediği metrik: sabit FAR'da FRR

Eşik seçimi: çoğu projenin performans sızdırdığı yer

Zaman içinde drift — her dashboard'da olması gereken metrik

Dürüst değerlendirme

The number that fooled everyone

Pixel AUROC vs image AUROC

The metric that ops cares about: FRR at fixed FAR

Threshold selection: where most projects leak performance

Drift over time — the metric that should be on every dashboard

The honest evaluation