Why the gateway is the critical layer
Sensors generate data. The cloud receives data. Between them sits the gateway, and the gateway is where 90 % of the operational and security work happens. Get the gateway right and the rest of the IoT system is mostly tractable. Get it wrong and you're firefighting per-device problems forever.The pattern below is what we've converged on.
What a gateway actually does
- Protocol bridging — Modbus RTU on the OT side, MQTT on the IT side. Or LoRa on the field side, HTTPS on the cloud side.
- Edge filtering & buffering — drop noisy samples, hold data when uplink is down, push when restored.
- Local logic — alarms that need to fire faster than the cloud round-trip allows.
- Security boundary — the place where OT-side trust ends and IT-side authentication begins.
- Firmware / model distribution — pulling updates from the cloud and pushing to local devices.
- Diagnostics — the place that has a complete view of "is the system actually working".
That's a lot of jobs for one device. Don't put them in five different gateways at one site, ever.
Hardware
We default to:- CM4-based industrial carrier (Revolution Pi, Compulab, or an in-house carrier) — when the gateway is doing real work, including edge inference or rich analytics.
- Industrial x86 mini-PC (Logic Supply, AAEON) — when the gateway needs to run Windows-only software (rare) or when the customer's IT team specifies x86.
- ESP32 / STM32-based gateway — for the simple "convert RS-485 to MQTT" use case. Beware: simple turns into complex within a year.
The temptation to use a Raspberry Pi consumer board is real. Don't. Industrial enclosures, eMMC storage, proper power input, and a real RTC are the difference between a deployment and a maintenance burden.
Software architecture
The pattern we ship:- Linux (Debian or Yocto-based) base
- Containerised application stack (Docker or Podman)
- A protocol-bridge service (e.g. Telegraf, in-house Go binary, Node-RED for prototypes that should not be left in prod)
- Local time-series buffer (TimescaleDB-on-edge or DuckDB) for offline operation
- An MQTT client (or HTTPS) for cloud uplink
- A local management agent for OTA updates and diagnostics
Service supervision via systemd. Logs forwarded via journald + Promtail to central Loki. Metrics via Prometheus node-exporter + Telegraf to central Prometheus. The boring observability stack pays off dramatically.
Security checklist (non-negotiable)
- No default passwords. Per-device generated.
- TLS for any uplink. mTLS where the cloud platform supports it.
- Disk encryption (LUKS) — the gateway might walk away.
- Network segmentation — gateway has interfaces on both OT and IT VLANs, but no IP forwarding between them. The gateway is a proxy, not a router.
- No SSH from anywhere except an explicit jumphost. Public-key only.
- Signed firmware updates with rollback capability.
- Vendor-supplied default services (FTP, telnet, embedded web servers) all disabled.
Offline operation — the test you must do
Disconnect the gateway's uplink and observe behaviour for 24 hours, then 7 days. The questions:- Does it keep collecting data? (It must.)
- Does the local buffer fill correctly?
- When uplink returns, does it backfill correctly without overwhelming the cloud?
- Does it surface "I am offline" status visibly?
- Does the local logic that needs to keep running, keep running?
A gateway that works on the bench with a permanent uplink and falls over when the customer's site has a 3-hour outage is not done.
One thing we'd warn about
Node-RED in production. It's a wonderful prototyping tool. It is not a production application platform. Every gateway we've inherited that ran a customer's "MVP" Node-RED flow grew to be a tangled, undebuggable system. Use it to prove the architecture, then rebuild the production gateway in a real language.The handover
At handover, the customer should be able to:- See gateway health on a dashboard
- Push a firmware update to one or all gateways without engineering involvement
- Read the last 30 days of buffered data from any single gateway
- Trigger a remote restart from the dashboard
If those four things aren't possible, the gateway isn't deployable.
What does your gateway stack look like? Curious whether anyone is replacing Linux gateways with embedded RTOS for stricter determinism.