Monitoring ESPhome nodes with Nagios (using MQTT/node-red as the middleman)

09Sep 2020 by chindemax

I have a HomeAssiatnt (Hass.io to be specific) instance running as a VM in my network and there are some ESPHome devices (ESP8266s) scattered around the house doing various measurements and pushing data to Home Assistant. All my ESPHome nodes have MQTT configured and they have static IPs.

I also have a Nagios server (another VM – planning to move this into a docker container soon!) that monitors my home network and whether the network nodes are up and running.

How do I make sure all my ESPHome nodes are up and running and generate an alert if any of those go down or unresponsive?

As far as I found out, ESPHome devices are not usually the best when it comes to responding to ping requests. So using a “binary_sensor: – platform: ping” was not an option for me and there are some ESPHome nodes which go to deep sleep and pinging wouldn’t work for them at all. Besides I wanted to get these monitored through Nagios since I was already using it as my network monitor.

Here’s how I did it; might not be the most efficient or direct way to do it, but I could get those to be monitored through Nagios.

For this to work:
1. we will use MQTT and node-red to make the connection from HomeAssiatnt to Nagios.
2. The ESPHome nodes must have MQTT configured and should have static IPs.
3. We will use a node-red instance to process MQTT messages (It should be possible to do this without node-red with automations in HA).
4. We will use a custom Nagios plugin to monitor MQTT and decide if the Node is up.

Let’s get to it :

Find the MQTT Topics that the ESPHome node uses to push data to HA. Go; HA –> Configuration –> Integrations –> MQTT –> And select the node you want to be monitored using Nagios (Ex- The Wemos D1 mini I have monitoring the Temperatures in Kitchen Fridge & Freezer)

MQTT Info –> And find the state topics. This ESPHome node pushes two measurements to HA every 10 minutes and we will sue both topics to determine I the node is up and running.

We will decide if this node is up based on the MQTT messages it pushes into HA. We’ll use a node-red flow to process these messages and publish the IP address of the node to an MQTT topic.
If the IP address of the ESPHome node is 192.168.1.63, below is a sample node red flow to handle this. Here, whenever a new value is pushed by the ESPHome node to HA (via MQTT) node red will push the IP address of the node to another topic (nagios/node_check). Make sure that we don’t set the rain flag on in the MQTT out node.

Now we will have a python script running in a linux host (This could be the Nagios server or a host which runs NRPE client) that listens to the MQTT topic (nagios/node_check) and writes a file in the host based on the MQTT payload.

You need paho-mqtt installed for this to work : https://pypi.org/project/paho-mqtt/

Bash script that runs the Python MQTT client continuously. This must be scheduled as cron job which starts up as the boot.

#!/bin/bash
while true
do
/usr/bin/sudo python3 /home/demouser/nagios/node-check/check-nodes-mqtt/check-nodes-mqtt.py
done

Python script called by above Bash script: (See it on Github)

#!/usr/bin/python
import sys
import time
import paho.mqtt.client as mqtt

broker_url = "<IP_Address_of_MQTT_broker>"
broker_port = <MQTT_Broker_port>

def on_connect(client, userdata, flags, rc):
print("Connected With Result Code: {}".format(rc))

def on_message(client, userdata, message):
print("Message Recieved: "+message.payload.decode())
file_name=message.payload.decode()
file_path="/home/demouser/nagios/node-check/logs/"+file_name+".ok"
file1 = open(file_path, 'w')
file1.write(message.payload.decode()+" is up and running\n")
file1.close()

def on_disconnect(client, userdata, rc):
print("Client Got Disconnected")

client = mqtt.Client("Nagios_NodeChecker")
client.on_connect = on_connect
client.on_disconnect = on_disconnect
client.on_message = on_message
client.username_pw_set(username="<mqtt_username>",password="<mqtt_password>")

client.connect(broker_url, broker_port)
client.subscribe(topic="nagios/node_check", qos=2)
client.message_callback_add("nagios/node_check", on_message)

client.loop_start()
time.sleep(300)
client.loop_stop()

This script will run for 300 seconds after which the bash script will start it back.

Whenever it reads a message on “nagios/node_check” topic it will write a file to “/home/demouser/nagios/node-check/logs” folder named “<message>.ok”. Since the MQTT message would be an IP address pushed by the node-red flow above, this would be actually <IP_Address>.ok. In our case for the above ESPHome node (IP Address – 192.168.1.63) wherever HA gets a message from ESPS8266 a file will be written or updated named “192.168.1.63.ok”.

Now we would get to the Nagios plugin that I have written (See on Github) which will decide if the Node is up and running based on this file.

#!/bin/bash
STATE_OK=0
STATE_WARNING=1
STATE_CRITICAL=2
STATE_UNKNOWN=3
STATE_DEPENDENT=4

usage1="Usage: $0 -L <log_location> -I <ip_address> -w <warn_if_no_ping_in_last_mins> -c <critical_if_no_ping_in_last_mins>"
exitstatus=$STATE_WARNING #default
while test -n "$1"; do
case "$1" in
    -c)
        crit=$2
        shift
        ;;
    -w)
        warn=$2
        shift
        ;;
    -I)
        ipaddr=$2
        shift
        ;;
    -L)
        logloc=$2
        shift
        ;;
    -h)
       echo $usage1;
       echo
       exit $STATE_UNKNOWN
       ;;
    *)
      echo "Unknown argument: $1"
      echo $usage1;
      echo
      exit $STATE_UNKNOWN
      ;;
esac
shift
done

if [ $(find $logloc/$ipaddr.ok -mmin -$warn | wc -l) -gt 0 ]; then
    echo "$ipaddr - OK (signaled within $warn mins)"
    exit $STATE_OK
fi

if [ $(find $logloc/$ipaddr.ok -mmin -$crit | wc -l) -gt 0 ]; then
    echo "$ipaddr - WARNING (signaled within $crit mins)"
    exit $STATE_WARNING
fi

echo "$ipaddr - CRITICAL (no signal within $crit mins)"
exit $STATE_CRITICAL

From the Nagios server we will call this plugin (through NRPE if it’s not in Nagios server itself) and it checks the timestamp of the file created above and decides the last time the ESPHome node communicated to HA. We can pass parameters to define a warning or a critical alert on the last timestamp. In this case it will give a warning if the Node 192.168.1.63 hasn’t updated it’s status in 90 minutes and a critical alerts if it hasn’t done so in 180 minutes. We also pass the location to check for the files as a parameter here.

We will be monitoring his node (192.168.1.63) as a service, not as a host.

define service {
        use     1min-service
        check_interval     60
        max_check_attempts     2
        host_name     <host_which_the_script_runs_on>
        service_description     NODE CHECK
        check_command     check_node!-L /home/demouser/nagios/node-check/logs -I 192.168.1.63 -w 90 -c 180
        contacts     adminemail, admintext
}

Node status on aNag app (Nagios client for android) and Web interface.

This is particularly helpful when these ESPHome nodes get into a deep sleep, which will mess-up monitoring their availability by checking ping/binary_sensor in HA.

As always thanks for you interest and thanks for reading!