As the number of IoT devices has exploded, so has the compute power of each individual device. This post looks at a simple way to use that power to improve the availability of edge/IoT devices.
The Problem
IoT devices tend to have unreliable network connections, by their nature of being at the “edge”. Whether it’s flakey wired connections or WiFi that drops every two hours, these devices need a robust way to recover. It’s far too common for an IoT device to get “wedged” and not come back online.
Cloud servers have management APIs that allow an administrator to reboot a hung machine. Physical servers typically have serial consoles and remotely-managed BIOS and power supplies — or at least someone who can go kick the server.
Edge devices often have no remote management options. When their network connection fails, they need to handle the issue autonomously. Adding to the challenge, the exact steps these IoT devices must follow are unique to their environment.
Edge Network Watchdog
I created edge-netdog as a last-resort solution for IoT devices that experience network problems. It is simple, flexible, and intended to run on small Linux devices like the Raspberry Pi, Jetson Nano, or Wyze Cam.
Rather than testing individual network services like WiFi, DNS, and routing, edge-netdog
checks for access to a website. This “end-to-end” testing ensures that all networking components are working in harmony. If edge-netdog
detects a network outage, it will perform a set of mitigation actions, one at a time.
Here’s a simple edge-netdog
configuration. After two minutes of network downtime (4 checks x 30 seconds), it runs the first mitigation action. If reconfiguring the WiFi hasn’t fixed networking, edge-netdog
will try refreshing DHCP and restarting the networking
service. Finally, if all else fails, it will reboot the device.
---
global:
debug: true
monitor_interval: 30s
target_attempts: 4
action_delay: 30s
target_url: https://example.com
target_match: Example Domain
actions:
- sudo wpa_cli -i wlan0 reconfigure
- sudo dhclient -
- sudo service networking restar
- sudo reboot
This approach is effective for IoT devices, but also heavy-handed. It’s definitely inappropriate for any device that offers remote out-of-band management. Don’t use edge-netdog
on your servers!
Using the Processor’s Watchdog
The edge-netdog
tool works well for detecting network issues, but it’s worthless if the device’s operating system fails. We need a second watchdog!
Modern processors offer an internal watchdog for just this scenario. Once enabled, the processor will force a reboot if it isn’t “patted” by the OS every so often.
For example, the Raspberry Pi’s processor watchdog interacts with the Linux watchdogd
service. The Linux service performs some health checks and “pats” the processor’s watchdog. If a health check fails, or if watchdogd
itself stops working, the processor watchdog force a reboot.
The watchdogd
service has a bunch of configurable health checks, such as max-load-1
, min-memory
, max-temperature
, and the ability to watch arbitrary text files. It’s even possible to configure rudimentary network health checks, but they are limited to ping’ing a hardcoded IP address. The edge-netdog
tool supplements watchdogd
with higher-level network checking and more customizable recovery actions.
Every edge/IoT device running Linux should have watchdogd
configured. Here’s an excellent guide with more technical details.
General Tips for Stability
- Power supply: Flakey power and brownout conditions are responsible for many IoT problems. Ensure your power supply meets the device’s requirements, and deploy a battery backup in locations that have unreliable power.
- SD card/flash: SD cards are flakey and happy to corrupt themselves on power failure. Make sure to use branded, high-end SD cards for your production deployments. It’s also worth considering an industrially-hardened IoT device, like the Balena Fin. Instead of an SD card, the Fin has an on-board eMCC flash chip for storage.
- USB/SATA drive: SD cards are great, but an external hard drive is essential for I/O intensive applications. An SD card will wear out after 100k write cycles. Databases, monitoring systems, and similar apps will quickly run into problems with SD wear and deserve a real disk.
- Enclosure: Picking an appropriate enclosure for your IoT device is key. The case should support the device’s cooling needs, including room for heat sinks and possibly a fan. Weatherproof cases should be used where appropriate.
- Hardware Watchdog: There are situations where even the processor’s watchdog can’t help. Power surges or brownouts can leave the processor unable to perform a soft reboot. A hardware watchdog provides the ultimate insurance in these cases — it can perform a hard power cycle if the device becomes hopelessly wedged. There are many good options for Raspberry Pi and Arduino-compatible devices.
Let’s talk about IoT!
If edge-netdog
is working for you [or not!], I’d love to hear about it below!
Looking for help with your IoT/edge-networking? I consult on strategy, architecture, security, deployment, and fleet management — let’s chat!! Follow or DM me on twitter at @nedmcclain.