ONNX, INT8, QAT: what actually breaks when you quantize a model

Aior · Thursday at 11:34 PM

The conversion pipeline, in three steps that always go wrong

Going from a research-trained PyTorch model to an INT8 binary running on edge hardware looks simple in the docs. In practice, every step has its own failure modes. Here's the version with the warnings included.

Step 1 — PyTorch → ONNX

ONNX export is mostly a solved problem for vanilla CNNs. Where it goes wrong:

Dynamic shapes. If your model uses dynamic input dimensions, you have to export with explicit dynamic_axes. Skipping this and discovering it three weeks later in deployment is a rite of passage.
Custom ops. Anything from torchvision that hasn't been upstreamed (DeformableConv, certain NMS variants) needs a custom symbolic function or a workaround at the model level.
Control flow. If-statements that depend on tensor values become Loop / If nodes that not all backends support. The fix: refactor your forward pass to be statically traceable.
Opset version. Many target runtimes lag the latest opset by 1-2 versions. Export targeting the lowest opset your runtime supports, not the latest your PyTorch supports.

We always validate the ONNX export against the PyTorch model on 10-100 images before moving on. Outputs should match to within 1e-5. If they don't, the export is wrong — and quantization will magnify the error.

Step 2 — ONNX → optimized graph

ONNX Runtime, TensorRT, OpenVINO, Hailo DFC all do graph-level optimization (operator fusion, constant folding, layout transforms). Mostly transparent. Where it bites:

Operator support gaps. TensorRT's coverage is excellent; Hailo's is more selective. Always run a "can this model compile" check before promising the customer a deployment date.
Layout transforms. NCHW vs NHWC matters. Some runtimes pick the wrong default and you eat a 30 % perf penalty.
Dynamic batch sizes. Most production deployments are batch=1; build the engine for batch=1 specifically. Generic engines are slower.

Step 3 — quantization to INT8

This is where the accuracy regression lives. Two flavors:

Post-training quantization (PTQ). Calibrate the activation distributions on a representative dataset, choose scale/zero-point per tensor, done. Fast (minutes). Often loses 0.5-2 % accuracy on common architectures.

Quantization-aware training (QAT). Inject fake-quantization ops into the training graph, fine-tune for a few epochs. Slow (hours, sometimes days). Recovers most of the PTQ accuracy regression.

When PTQ is enough vs when QAT is needed:

ResNet-style backbones, large models, well-distributed activations → PTQ is fine
Mobile-style architectures (MobileNet, EfficientNet), already-compressed models → QAT often needed
Anything with attention, custom normalization, or cosine-similarity scoring → QAT, almost always
Production target with fixed accuracy SLA → run QAT regardless, accept the time cost

Calibration dataset selection — the underrated step

The calibration set determines the activation ranges the quantizer sees. Use:

At least 256 images, ideally 1024
Captured from production, not from the original training set
Covering the variation in lighting, products, and time-of-day you actually expect
No augmentation. Quantize for what the model will see, not what it was trained on.

A bad calibration set is the difference between INT8 within 1 % of FP32 and INT8 with 5 % regression. We've seen this go the wrong way enough times that we now treat calibration set construction as a separate engineering task with its own review.

Validation, end-to-end

The final binary should be validated on a held-out test set, not on the calibration set. Compare:

FP32 PyTorch accuracy (baseline)
ONNX FP32 accuracy (should match PyTorch)
ONNX INT8 accuracy (target: within 1 % of FP32)
On-device INT8 accuracy (should match ONNX INT8)

If any step diverges by more than the noise floor, you have a bug. Don't ship.

What's your conversion stack? Anyone using ONNX-Runtime + DirectML or sticking strictly to vendor toolchains?

ONNX, INT8, QAT: what actually breaks when you quantize a model

ONNX, INT8, QAT: what actually breaks when you quantize a model

Aior

Administrator

The conversion pipeline, in three steps that always go wrong

Step 1 — PyTorch → ONNX

Step 2 — ONNX → optimized graph

Step 3 — quantization to INT8

Calibration dataset selection — the underrated step

Validation, end-to-end

Forum statistics

Members online

Latest posts

Newest members

Featured content

Trending content

Share this page

Legal Notice

We value your privacy

ONNX, INT8, QAT: what actually breaks when you quantize a model

ONNX, INT8, QAT: what actually breaks when you quantize a model

Aior

Administrator

The conversion pipeline, in three steps that always go wrong​

Step 1 — PyTorch → ONNX​

Step 2 — ONNX → optimized graph​

Step 3 — quantization to INT8​

Calibration dataset selection — the underrated step​

Validation, end-to-end​

Forum statistics

Members online

Latest posts

Newest members

Featured content

Trending content

Share this page

Tüm ihtiyaçlarınız için Teklif alın

Legal Notice

We value your privacy

The conversion pipeline, in three steps that always go wrong

Step 1 — PyTorch → ONNX

Step 2 — ONNX → optimized graph

Step 3 — quantization to INT8

Calibration dataset selection — the underrated step

Validation, end-to-end