Hailo-8 in production: lessons from shipping eight stations on the same p

Aior · Apr 30, 2026

Neden Hailo

Bir görü muayene iş yükünde Jetson Xavier NX ile baş başa test ettikten sonra yaklaşık iki yıl önce Hailo-8 hızlandırıcıları göndermeye başladık. Manşet sayılar nettir: karşılaştırılabilir inference performansı, yaklaşık dörtte bir güçte, çok daha küçük termal zarf. Sekiz üretim istasyonundan sonra, pazarlama malzemesinde olmayan bildiklerimiz şu.

Pratikte toolchain akışı

PyTorch modelinden Hailo-deploy ikilisine giden yol:

PyTorch / TensorFlow'da eğitin.
ONNX'e export edin.
Hailo Dataflow Compiler (DFC) ile optimize edin — bu INT8 quantization içerir.
Belirli çipi (Hailo-8 / 8L / 15) hedefleyen Hailo Executable Format'a (HEF) derleyin.
HailoRT runtime ile deploy edin.

Adım 3 ve 4 gerçek işin yapıldığı yerdir. DFC, temsili bir kalibrasyon veri seti gerektirir — en az 64 görüntü, ideal olarak 512, üretim koşulları altında çekilmiş. Kalibrasyon "FP32 ile neredeyse aynı doğruluk" ile "müşteriye açıkladığımız utanç verici doğruluk regresyonu" arasındaki farktır.

Quantization duyarlılığı gerçek

Bazı mimariler temiz quantize olur. Bazıları olmaz.

ResNet, MobileNet, YOLO aileleri — INT8 ile <%1 doğruluk regresyonu. Drama yok.
Transformers (ViT, DETR) — duyarlı. Sıklıkla kanal başına quantization gerekir, bazen attention head'lerde kısmi FP16 tutma gerekir.
Anomaly detection (PatchCore, EfficientAD) — mesafe tabanlı skorlama quantization gürültüsüne duyarlıdır. EfficientAD'de QAT ile %2 AUROC kurtarmak için bir hafta harcadıktan sonra onu Jetson Orin Nano üzerinde tutmaya karar verdik.

Pragmatik kural: modelinizin alışılmadık nümeriği varsa (kayıpta cosine similarity, mesafe tabanlı skorlama, özel layer norm'lar), quantization'ın size %1-3 doğruluk maliyeti olacağını varsayın ve QAT için bütçe ayırın.

Bellek & çoklu model deploylar

Hailo-8'de 20 MB on-chip SRAM var. Tipik bir YOLOv8s post-quantization yaklaşık 12 MB; YOLOv8m yaklaşık 25 MB ve tek başına sığmaz. Çip sonra "context switch" yapar — host RAM'den kısmi grafikleri yükler — bu gecikmeye mal olur.

Çoklu model deploylar için (örneğin aynı çipte detection + classification + OCR), HailoRT frameler arası model swap'i destekler. Tek modelden ölçülebilir biçimde daha yavaş. Gecikmenin önemli olduğu yerde tek modele göre boyutlandırıyoruz; kullanım durumu 30-50 ms swap cezasını tolere edebildiğinde çoklu modele.

Hailo-15 vs Hailo-8 — ne zaman yükseltmeli

Hailo-15, yerleşik ISP, video codec ve daha fazla compute'a sahip yeni SoC tarzı çip. Şu durumlarda kullanıyoruz:

Hücre alan-kısıtlı ve kamera + hızlandırıcıyı tek board'da istiyoruz.
Üretim çözünürlüğünde >1 stream gerekiyor.
Çoklu model deploylar Hailo-8'e sığmayı bıraktığında.

Tek kameralı, tek modelli istasyon için Hailo-8 M.2 hâlâ en ucuz yol.

Uyaracağımız bir şey

Çip başına tek bir kamera-karar pipeline'ı kuruyorsanız Hailo ekosistemi mükemmel. 20 transform ve 3 koşullu modelle heterojen veri pipeline'ı kuruyorsanız daha az ergonomik — bunun için Hailo tek başına değil, CPU + Hailo istersiniz.

Gerçek hücrelerde Hailo-15 çalıştıran var mı? ISP entegrasyon hikâyesi ve gerçekten ayrık bir kamera ASIC'i yerine geçip geçmediği merak ediyoruz.

Why Hailo

We started shipping Hailo-8 accelerators about two years ago, after testing it head-to-head with Jetson Xavier NX on a vision inspection workload. The headline numbers were clear: comparable inference performance at roughly a quarter of the power, with a much smaller thermal envelope. After eight production stations, here's what we know that's not in the marketing material.

The toolchain workflow, in practice

The path from a PyTorch model to a Hailo-deployed binary is:

Train in PyTorch / TensorFlow.
Export to ONNX.
Optimize with the Hailo Dataflow Compiler (DFC) — this includes quantization to INT8.
Compile to a Hailo Executable Format (HEF) targeting the specific chip (Hailo-8 / 8L / 15).
Deploy via HailoRT runtime.

Steps 3 and 4 are where the real work happens. The DFC needs a representative calibration dataset — at least 64 images, ideally 512 — captured under production conditions. Calibration is the difference between "almost the same accuracy as FP32" and "embarrassing accuracy regression we explain to the customer".

Quantization sensitivity is real

Some architectures quantize cleanly. Others don't.

ResNet, MobileNet, YOLO families — INT8 with <1 % accuracy regression. No drama.
Transformers (ViT, DETR) — sensitive. Often need per-channel quantization, sometimes need partial FP16 retention on attention heads.
Anomaly detection (PatchCore, EfficientAD) — distance-based scoring is sensitive to quantization noise. We spent a week recovering 2 % AUROC on EfficientAD with QAT before deciding to keep it on a Jetson Orin Nano instead.

The pragmatic rule: if your model has unusual numerics (cosine similarity in the loss, distance-based scoring, custom layer norms), assume quantization will cost you 1-3 % accuracy and budget for QAT.

Memory & multi-model deployments

Hailo-8 has 20 MB of on-chip SRAM. A typical YOLOv8s post-quantization is around 12 MB; YOLOv8m is around 25 MB and doesn't fit alone. The chip then "context switches" — loading partial graphs from host RAM — which costs latency.

For multi-model deployments (e.g. detection + classification + OCR on the same chip), HailoRT supports model swapping between frames. It's measurably slower than a single model. We size for single-model where latency matters, multi-model where the use case can tolerate 30-50 ms swap penalties.

Hailo-15 vs Hailo-8 — when to upgrade

Hailo-15 is the newer SoC-style chip with built-in ISP, video codec, and more compute. We use it when:

The cell is space-constrained and we want camera + accelerator on a single board.
We need >1 stream at production resolution.
Multi-model deployments stop fitting on Hailo-8.

For a single-camera, single-model station, the Hailo-8 M.2 is still the cheapest path.

One thing we'd warn about

The Hailo ecosystem is excellent if you're building one camera-to-decision pipeline per chip. It is less ergonomic if you're building a heterogeneous data pipeline with 20 transforms and 3 conditional models — for that you want CPU + Hailo, not Hailo alone.

Anyone running Hailo-15 in real cells yet? Curious about the ISP integration story and whether it actually replaces a discrete camera ASIC.

Hailo-8 in production: lessons from shipping eight stations on the same p

Hailo-8 in production: lessons from shipping eight stations on the same p

Aior

Administrator

Neden Hailo

Pratikte toolchain akışı

Quantization duyarlılığı gerçek

Bellek & çoklu model deploylar

Hailo-15 vs Hailo-8 — ne zaman yükseltmeli

Uyaracağımız bir şey

Why Hailo

The toolchain workflow, in practice

Quantization sensitivity is real

Memory & multi-model deployments

Hailo-15 vs Hailo-8 — when to upgrade

One thing we'd warn about

Similar threads

Forum statistics

Members online

Latest posts

Newest members

Featured content

Trending content

Share this page

Legal Notice

We value your privacy

Hailo-8 in production: lessons from shipping eight stations on the same p

Hailo-8 in production: lessons from shipping eight stations on the same p

Aior

Administrator

Neden Hailo​

Pratikte toolchain akışı​

Quantization duyarlılığı gerçek​

Bellek & çoklu model deploylar​

Hailo-15 vs Hailo-8 — ne zaman yükseltmeli​

Uyaracağımız bir şey​

Why Hailo​

The toolchain workflow, in practice​

Quantization sensitivity is real​

Memory & multi-model deployments​

Hailo-15 vs Hailo-8 — when to upgrade​

One thing we'd warn about​

Similar threads

Forum statistics

Members online

Latest posts

Newest members

Featured content

Trending content

Share this page

Tüm ihtiyaçlarınız için Teklif alın

Legal Notice

We value your privacy

Neden Hailo

Pratikte toolchain akışı

Quantization duyarlılığı gerçek

Bellek & çoklu model deploylar

Hailo-15 vs Hailo-8 — ne zaman yükseltmeli

Uyaracağımız bir şey

Why Hailo

The toolchain workflow, in practice

Quantization sensitivity is real

Memory & multi-model deployments

Hailo-15 vs Hailo-8 — when to upgrade

One thing we'd warn about