The conversion pipeline, in three steps that always go wrong
Going from a research-trained PyTorch model to an INT8 binary running on edge hardware looks simple in the docs. In practice, every step has its own failure modes. Here's the version with the warnings included.Step 1 — PyTorch → ONNX
ONNX export is mostly a solved problem for vanilla CNNs. Where it goes wrong:- Dynamic shapes. If your model uses dynamic input dimensions, you have to export with explicit dynamic_axes. Skipping this and discovering it three weeks later in deployment is a rite of passage.
- Custom ops. Anything from torchvision that hasn't been upstreamed (DeformableConv, certain NMS variants) needs a custom symbolic function or a workaround at the model level.
- Control flow. If-statements that depend on tensor values become Loop / If nodes that not all backends support. The fix: refactor your forward pass to be statically traceable.
- Opset version. Many target runtimes lag the latest opset by 1-2 versions. Export targeting the lowest opset your runtime supports, not the latest your PyTorch supports.
We always validate the ONNX export against the PyTorch model on 10-100 images before moving on. Outputs should match to within 1e-5. If they don't, the export is wrong — and quantization will magnify the error.
Step 2 — ONNX → optimized graph
ONNX Runtime, TensorRT, OpenVINO, Hailo DFC all do graph-level optimization (operator fusion, constant folding, layout transforms). Mostly transparent. Where it bites:- Operator support gaps. TensorRT's coverage is excellent; Hailo's is more selective. Always run a "can this model compile" check before promising the customer a deployment date.
- Layout transforms. NCHW vs NHWC matters. Some runtimes pick the wrong default and you eat a 30 % perf penalty.
- Dynamic batch sizes. Most production deployments are batch=1; build the engine for batch=1 specifically. Generic engines are slower.
Step 3 — quantization to INT8
This is where the accuracy regression lives. Two flavors:Post-training quantization (PTQ). Calibrate the activation distributions on a representative dataset, choose scale/zero-point per tensor, done. Fast (minutes). Often loses 0.5-2 % accuracy on common architectures.
Quantization-aware training (QAT). Inject fake-quantization ops into the training graph, fine-tune for a few epochs. Slow (hours, sometimes days). Recovers most of the PTQ accuracy regression.
When PTQ is enough vs when QAT is needed:
- ResNet-style backbones, large models, well-distributed activations → PTQ is fine
- Mobile-style architectures (MobileNet, EfficientNet), already-compressed models → QAT often needed
- Anything with attention, custom normalization, or cosine-similarity scoring → QAT, almost always
- Production target with fixed accuracy SLA → run QAT regardless, accept the time cost
Calibration dataset selection — the underrated step
The calibration set determines the activation ranges the quantizer sees. Use:- At least 256 images, ideally 1024
- Captured from production, not from the original training set
- Covering the variation in lighting, products, and time-of-day you actually expect
- No augmentation. Quantize for what the model will see, not what it was trained on.
A bad calibration set is the difference between INT8 within 1 % of FP32 and INT8 with 5 % regression. We've seen this go the wrong way enough times that we now treat calibration set construction as a separate engineering task with its own review.
Validation, end-to-end
The final binary should be validated on a held-out test set, not on the calibration set. Compare:- FP32 PyTorch accuracy (baseline)
- ONNX FP32 accuracy (should match PyTorch)
- ONNX INT8 accuracy (target: within 1 % of FP32)
- On-device INT8 accuracy (should match ONNX INT8)
If any step diverges by more than the noise floor, you have a bug. Don't ship.
What's your conversion stack? Anyone using ONNX-Runtime + DirectML or sticking strictly to vendor toolchains?