İçeriğe geç
KAMPANYA

Logo Tasarım + Web Tasarım + 1 Yıl Domain + E-posta + Hosting — $299 +KDV

AIOR

ONNX, INT8, QAT: what actually breaks when you quantize a model

Sektör topluluğu — sorularınız, deneyimleriniz ve duyurularınız için.

ONNX, INT8, QAT: what actually breaks when you quantize a model

Aior

Administrator
Staff member
Joined
Apr 2, 2023
Messages
175
Reaction score
2
Points
18
Age
40
Location
Turkey
Website
aior.com
1/3
Thread owner
500


The conversion pipeline, in three steps that always go wrong​

Going from a research-trained PyTorch model to an INT8 binary running on edge hardware looks simple in the docs. In practice, every step has its own failure modes. Here's the version with the warnings included.

Step 1 — PyTorch → ONNX​

ONNX export is mostly a solved problem for vanilla CNNs. Where it goes wrong:
  • Dynamic shapes. If your model uses dynamic input dimensions, you have to export with explicit dynamic_axes. Skipping this and discovering it three weeks later in deployment is a rite of passage.
  • Custom ops. Anything from torchvision that hasn't been upstreamed (DeformableConv, certain NMS variants) needs a custom symbolic function or a workaround at the model level.
  • Control flow. If-statements that depend on tensor values become Loop / If nodes that not all backends support. The fix: refactor your forward pass to be statically traceable.
  • Opset version. Many target runtimes lag the latest opset by 1-2 versions. Export targeting the lowest opset your runtime supports, not the latest your PyTorch supports.

We always validate the ONNX export against the PyTorch model on 10-100 images before moving on. Outputs should match to within 1e-5. If they don't, the export is wrong — and quantization will magnify the error.

Step 2 — ONNX → optimized graph​

ONNX Runtime, TensorRT, OpenVINO, Hailo DFC all do graph-level optimization (operator fusion, constant folding, layout transforms). Mostly transparent. Where it bites:
  • Operator support gaps. TensorRT's coverage is excellent; Hailo's is more selective. Always run a "can this model compile" check before promising the customer a deployment date.
  • Layout transforms. NCHW vs NHWC matters. Some runtimes pick the wrong default and you eat a 30 % perf penalty.
  • Dynamic batch sizes. Most production deployments are batch=1; build the engine for batch=1 specifically. Generic engines are slower.

Step 3 — quantization to INT8​

This is where the accuracy regression lives. Two flavors:

Post-training quantization (PTQ). Calibrate the activation distributions on a representative dataset, choose scale/zero-point per tensor, done. Fast (minutes). Often loses 0.5-2 % accuracy on common architectures.

Quantization-aware training (QAT). Inject fake-quantization ops into the training graph, fine-tune for a few epochs. Slow (hours, sometimes days). Recovers most of the PTQ accuracy regression.

When PTQ is enough vs when QAT is needed:
  • ResNet-style backbones, large models, well-distributed activations → PTQ is fine
  • Mobile-style architectures (MobileNet, EfficientNet), already-compressed models → QAT often needed
  • Anything with attention, custom normalization, or cosine-similarity scoring → QAT, almost always
  • Production target with fixed accuracy SLA → run QAT regardless, accept the time cost

Calibration dataset selection — the underrated step​

The calibration set determines the activation ranges the quantizer sees. Use:
  • At least 256 images, ideally 1024
  • Captured from production, not from the original training set
  • Covering the variation in lighting, products, and time-of-day you actually expect
  • No augmentation. Quantize for what the model will see, not what it was trained on.

A bad calibration set is the difference between INT8 within 1 % of FP32 and INT8 with 5 % regression. We've seen this go the wrong way enough times that we now treat calibration set construction as a separate engineering task with its own review.

Validation, end-to-end​

The final binary should be validated on a held-out test set, not on the calibration set. Compare:
  • FP32 PyTorch accuracy (baseline)
  • ONNX FP32 accuracy (should match PyTorch)
  • ONNX INT8 accuracy (target: within 1 % of FP32)
  • On-device INT8 accuracy (should match ONNX INT8)

If any step diverges by more than the noise floor, you have a bug. Don't ship.

What's your conversion stack? Anyone using ONNX-Runtime + DirectML or sticking strictly to vendor toolchains?
 

Forum statistics

Threads
171
Messages
178
Members
27
Latest member
AIORAli

Members online

No members online now.

Featured content

AIOR
AIOR TEKNOLOJİ

Tüm ihtiyaçlarınız için Teklif alın

Hosting · Domain · Sunucu · Tasarım · Yazılım · Mühendislik · Sektörel Çözümler

Teklif al

7/24 Destek · Anında yanıt

Back
Top