I have used Nagios for years now, corporately and for the handful of systems with which I am now involved.  It works well, within some annoyance limits, and once the hurdle of the initial setup is climbed, it runs merrily without much additional administration.  It is unfortunate that what started off as a fully free (libre and gratis) product now is presented as a libre "core" with implications of the real deal being the fully paid-for version.  So many free and open source projects seem to go to the wall once this type of business model is introduced, for reasons we don't need to go into here.  There are alternatives to Nagios, of course, like Icinga, which was an early fork, but the initial set up hurdle with Icinga, it seems to be, was even higher than with Nagios.  For  small scale, simple use, it seems overkill.

It occurred to me recently, though, that my use of Nagios is extremely simplistic. I just need to know if a handful of services on a handful of systems are running, to alert me if they are not, and to alert me if they come back on line.  In 99 cases out of 100, the problem is the network, usually my home ADSL connection, begrudgingly supplied by BT Openreach to offer as little in our rural areas as they can get away with (yes, I'm still bitter, thanks for asking....) My Nagios instance takes care of this, using few resources and doing an adequate job.

As I mentioned, there are annoyances. I don't like the defaults installation's web screen touching youtube.com when you fire it up, for example. There are other annoyances in the web interface too. But the way Nagios alerts for my particular circumstances is a little sub-optimal.  I have it set up to send me a jabber and tp email me on failures.  So when we have a big down or upload going on, Nagios alerts like mad.  Yes, it has flapping prevention built-in, but it's unreasonable to expect it to understand that there is a local connectivity issue.  None of these annoyances are killer issues, and, as I say, Nagios works well.  But...

I got to wondering if this could be done differently. In particular, whether Munin and/or Monit were still relevant.  It's many years since last I looked at these monitoring solutions.  The advantage of Munin is that it can be distributed, but that's not necessarily a big deal for me, as it's the external services I want to monitor. I must admit as well that I had a lot of trouble trying to get non-core plugins to work.  Some seemed to work but returned no results. I may go back to it when I feel like a little self-flagellation, but I quickly concluded that Munin was not a more straight forward alternative to Nagios, unless it was just a subset of local services you wanted to monitor.

Monit goes way back.  The developers now make money by selling a Monit aggregation system called M/Monit, which is reasonable, and rather more nuanced than that free "core" business model.  I understand M/Monit is not too expensive, but my requirements don't justify any spending, to be honest.  Could I get Monit to run roughly as I currently run Nagios, checking a series of services, on a series of hosts and alerting me via email and xmpp?

The answer is "Yes." (Well it would be, wouldn't it, or I'd not be writing this.)

It means using Monit slightly differently to most of the guides.  These days, for example, I am not convinced about the need for Monit's ability to restart services which have failed.  It seems to me far more important to ensure services don't fail in the first place, but one can see that there may be fragile services needing this option.

I already use the little sendxmpp utility to send jabber messages from the server. It just needs a .sendxmpprc file with suitable entries in the sending user's home directory and then has a very cli mail-like syntax. I'll not cover details here.

Monit's alerts are focussed on email. But it is also possible to execute a script when something happens, so instead of the configuration syntax being "alert", it becomes "exec <script>".  Good.  I also realised that it would be easier, given that I want to do email and jabber, not to rely on the built-in email, but send mail and jabber messages from within the script. I'll explain this later.

For comparison, here are screen shots of Nagios and Monit, looking after pretty much the same services.

The important similarity is not the detail, but the fact that all services are green.

What I also wanted, ideally, was if a service goes down, I'd like an occasional reminder that it is down, but not frequently.  I may also only want a given number of reminders.  I don't think that's possible with Nagios without acknowledging the service problem in the web front end.  I ended up with two scripts, which I put in /usr/local/bin. One is the alert script, and the other for service recovery.  Monit helpfully exposes a few pieces of information to help with this, but extracting that information seems to be a little clumsy, at least for my non-existent scriptwriting "skills". I have not been able to use the exposed variables directly, so I go through a little process of putting them into variables. In the examples given here, items in <> are my details, do you need to change these without the <>, of course.  Please contact me with a more elegant way of doing these two scripts, because, even for me, I think this is clumsy. Here is the alert script, suitably sanitised:

#!/bin/bash

MAILSENDTO="<your_email_alert_recipient>"
MAILFROM="<user you want mail to appear to come from>"
JABBERTO="<your_jabber_alert_recipient>"

#############################################
FILE=/tmp/${RANDOM}.mon

echo "$MONIT_HOST" >> $FILE
echo "$MONIT_EVENT" >> $FILE
echo "$MONIT_SERVICE" >> $FILE
echo "$MONIT_DESCRIPTION" >> $FILE

declare -a array=()
i=0

# reading file in row mode, insert each line into array
while IFS= read -r line; do
    array[i]="$line"
    let "i++"
    # reading from file path
done < $FILE
## var0=HOST
## var1=Event
## var2=Service
## var3=Desciption

rm $FILE

JABBER="Monit Alert - ${array[1]} for ${array[2]}. \n Monit ${array[3]}"
SUBJECT="Monit Alert - ${array[2]} ${array[1]}"

### need for details of sendxmpprc file
echo -e "$JABBER" | sendxmpp -f /root/.sendxmpprc -t -n "$JABBERTO"

### Now email too
echo -e "$JABBER" | mail -r "$MAILFROM" -s "$SUBJECT" "$MAILSENDTO"
##################################################

The recovery script is very similar. I have not tried to make the script parameter driven as that may complicated the monit config files.

#!/bin/sh
MAILSENDTO="<your_email_alert_recipient>"
MAILFROM="<user you want mail to appear to come from>"
JABBERTO="<your_jabber_alert_recipient>"

###############################
FILE=/tmp/${RANDOM}.mon

echo "$MONIT_HOST" >> $FILE
echo "$MONIT_EVENT" >> $FILE
echo "$MONIT_SERVICE" >> $FILE
echo "$MONIT_DESCRIPTION" >> $FILE

declare -a array=()
i=0

# reading file in row mode, insert each line into array
while IFS= read -r line; do
    array[i]="$line"
    let "i++"
    # reading from file path
done < $FILE
## var0=HOST
## var1=Event
## var2=Service
## var3=Desciption

#echo "Contents of file " >>  /root/test.txt
#cat $FILE >> /root/test.txt

rm $FILE

JABBER="System Recovery - ${array[1]} for ${array[2]}. \n Monit ${array[3]}"
SUBJECT="Monit Recovery - ${array[2]} ${array[1]}"
### need for details of sendxmpprc file
echo -e "$JABBER" | sendxmpp -f /root/.sendxmpprc -t -n $JABBERTO

## Now email
echo -e "$JABBER" | mail -r "$MAILFROM" -s "$SUBJECT" "$MAILSENDTO"

Now we get on to monit's configuration. Setttings can be done in the /etc/monitrc file, but it's easier to separate them out in separate files for each server or service I am monitoring.  I won't go through the monitrc file, but we need fill in virtually nothing there, because email isn't going to be used directly, Just set up the web interface, optionally with a username and password. This is conventionally a service running on port 2812. Don't do what I did an get yourself in knots because I was trying to access the service on port 2182, not 2812.

Under /etc/monit.d, we can set up the files.  Again, unlike monit's usual capabilities, I'm not interested in starting or stopping services, just alerting me to service availability.  So one system I monitor, which is accessible via openvpn, has this config file, sanitised, named <servername>.conf. Sorry about the long lines.

check host ACA_HTTP with address 192.168.xxx.xx
  if failed port 80 protocol http with timeout 30 seconds then exec "/usr/local/bin/monit-alert.sh" with repeat every 5 cycles
  else if succeeded 1 times within 1 cycles then exec "/usr/local/bin/monit-recover.sh"

check host ACA_SSH with address 192.168.xxx.xx
   if failed port 22 protocol ssh with timeout 30 seconds then exec "/usr/local/bin/monit-alert.sh" with repeat  every 5 cycles
   else if succeeded 1 times within 1 cycles then exec "/usr/local/bin/monit-recover.sh"

Another service conf file looks after the services I run from my home server. Again, it's sanitised, but by now you get the idea of what it's doing, just making sure the service is accessible, repeating the alerts, but not too frequently, and alerting on recovery.

check host homenet_IMAPS with address <FQDN>
    if failed port 993 protocol IMAPS with timeout 30 seconds then exec "/usr/local/bin/monit-alert.sh" with repeat every 10 cycles
    else if succeeded 1 times within 1 cycles then exec "/usr/local/bin/monit-recover.sh"

check host homenet_SUBMISSION with address <FQDN>
    if failed port 587 protocol SMTP with timeout 30 seconds then exec "/usr/local/bin/monit-alert.sh" with repeat every 10 cycles
    else if succeeded 1 times within 1 cycles then exec "/usr/local/bin/monit-recover.sh"

check host homenet_SMTP with address <FQDN>
    if failed port 25 protocol SMTP with timeout 30 seconds then exec "/usr/local/bin/monit-alert.sh" with repeat every 10 cycles
    else if succeeded 1 times within 1 cycles then exec "/usr/local/bin/monit-recover.sh"

check host homenet_SSH with address <FQDN>
    if failed port 1011 protocol ssh with timeout 20 seconds then exec "/usr/local/bin/monit-alert.sh" with repeat every 10 cycles
    else if succeeded 1 times within 1 cycles then exec "/usr/local/bin/monit-recover.sh"

check host homenet_Nextcloud with address <FQDN>
    if failed port 443 protocol https with timeout 20 seconds then exec "/usr/local/bin/monit-alert.sh" with repeat every 10 cycles
    else if succeeded 1 times within 1 cycles then exec "/usr/local/bin/monit-recover.sh"

One service is a sales web site. I want to be reminded more frequently about that:

check host RC-HTTPS with address <www.shop.site>
    if failed port 443 protocol https with timeout 20 seconds then exec "/usr/local/bin/monit-alert.sh" with repeat every 5 cycles
    else if succeeded 1 times within 1 cycles then exec "/usr/local/bin/monit-recover.sh"

check host RC-HTTP with address <www.shop.site>
    if failed port 80 protocol http with timeout 20 seconds then exec "/usr/local/bin/monit-alert.sh" with repeat every 5 cycles
    else if succeeded 1 times within 1 cycles then exec "/usr/local/bin/monit-recover.sh"
The alerts, when they come through, look like this, the example being jabber, but email is formatted in the same way:
[15:01:52] <jabber_address> Monit Alert - Connection failed for ACA_HTTP. 
Monit failed protocol test [HTTP] at [192.168.xxx.xx]:80 [TCP/IP] -- Connection timed out
While recovery messages are like this:
[15:02:55] <jabber_address> System Recovery - Connection succeeded for ACA_SSH. 
Monit connection succeeded to [192.168.xxx.xx]:22 [TCP/IP]

Ss all this making Monit do a Nagios-like task worth it? I'm not sure. Possibly, as using the above examples makes it a lot quicker and easier to set up.  The again, Nagios gives uptime histories and graphs and deep details about each particular connection, which is nice but hardly the type of thing you study. Is Monit a valid little way of alerting you to issues?  Most certainly yes.

Addendum:  Amazed at the success of hacking the cervlet.c files, I got carried away, and wondered if it was possible to display the response times, shown on the detail pages of each service, on the main page.  After a great deal of trial and error, as I know nothing about C-code,  I managed to get it right.  The main monitoring page now looks like this, and gives all the details Nagios used to give, on an easily-scanned single page.

Later I also added a little snippet of js to include a timestamp of the refreshed page - you can see it just below "Manager" in the header. I found I was unsure whether the page had refreshed, so adding this time stamp was handy. I also found my lack of C knowledge meant I had no chance of doing the job in C, but realised that the C code was spitting out HTML.

The new monit service has been used in anger, when one of the remote services I monitor went down. This was a strange event, as the remote system flipped its root to read-only, so, while the web server was still running, there was, so to say, no-one at home. The way monit is set up allowed me to trouble shoot the problem and speculate about the issue.

However, it did show that Nagios can still do one thing that monit can't. With monit, you can only disable monitoring while you work on the problem, while with Nagios, you can acknowledge the problem, and so stop getting peril messages, and the service is monitored again when it is restored. With monit you'd need to do those steps manually. Considering how much easier this is, it's no deal breaker.

Attached is my version of cervlet.c, the file which spits out the html etc, and which needs to be edited for the above to happen.  Monit compiled easily with these changes under OpenSUSE 15.2 on a Pi4.