The gap between a demo and a deployment
Every edge AI project we've inherited had the same problem: the demo on the engineer's bench worked beautifully. The deployment on the factory floor failed in interesting and expensive ways. The patterns below are what we apply to close that gap consistently.Provisioning: don't build twelve devices by hand
At one or two units, manual provisioning is fine. At ten, it's a problem. At thirty, it's a project on its own.What works:
- A golden image — fully configured OS + binaries + dependencies, built in CI, signed.
- Per-device bootstrap config — device ID, network config, certificate, encryption keys — written to a small per-device partition or pulled from a provisioning service on first boot.
- No manual SSH — if you're SSHing into the device after deployment, your provisioning is incomplete.
Updates: OTA or you are not in control
Models drift, code has bugs, dependencies have CVEs. If you can't push an update from a central place to a device in the field without driving to the factory, you're not running the system — you're running each individual device.The minimum viable update story:
- Versioned releases (model + code + config bundled)
- A device-side updater that pulls from a central repo, validates a signature, swaps atomically
- Rollback path on failure
- Staged rollout — push to one device, then 10 %, then everyone — not all-at-once
We ship this as a small Go binary on every device. It's 800 lines of code and it's the highest-leverage piece of infrastructure we own.
Telemetry: what to actually log
Per-device:- Inference latency (per-frame, p50 / p95 / p99 over rolling 5-min window)
- Inference throughput (frames / sec actually processed)
- Camera health (frames dropped, reconnects, exposure stability)
- Anomaly score distribution (mean, p95)
- CPU / GPU / accelerator utilization
- Disk usage, memory usage, thermal state
- Application uptime, last successful inference timestamp
Per-cell:
- Reject rate, override rate, throughput
- Operator interactions per shift
We push to a central Prometheus + Grafana. Alerting on the per-device telemetry catches problems hours to days before the cell-level metrics notice.
A/B model deployment
Every model change should ship to one device first, then a fraction of the fleet, then the rest. The infrastructure:- The deployment manifest specifies model version per device
- The runtime can hot-swap models (see the deployment article in the Anomaly Detection forum)
- Telemetry includes the model version, so dashboards split metrics by version
This sounds heavy for an edge AI project. It is the single capability that has saved the most production hours in the last two years.
Security minimums
- No default passwords on any device. Per-device generated credentials.
- TLS for any control-plane traffic.
- Signed binaries and signed model artifacts. The runtime refuses unsigned.
- Firewall: outbound-only to known endpoints. No inbound from the factory network.
- Disk encryption if the device might walk away. (It does.)
The handover everyone forgets
Six months in, the customer's IT team takes over operational responsibility. What they need to inherit:- Device inventory — make/model/serial/location/version
- Update workflow — how to push a new release, how to roll back
- Telemetry dashboard — what's normal, what's an alert
- Runbook for the top 5 failure modes you've actually seen
- Escalation path for the model-level changes (retrain, threshold change)
A handover that includes all of this is also the moment the project becomes maintainable. Without it, every device is a small project of its own forever.
One pattern we'd never repeat
Running the deployment from the engineer's laptop. The first time the engineer leaves the company, the customer calls. There's no path forward that doesn't involve a long re-platforming. Build it on infrastructure the customer can own from day one.What does your deployment story look like? Curious about anyone running 50+ edge nodes from a single repo.