For the past few years I've had a bunch of virtual machines hosting websites, services, and servers. Of course I want them to be available - especially since I charge people money to access at some of them (for example my dns-hosting service) - and that means I want to know when they're not.
The way I've gone about this is to have a bunch of machines running stuff, and then dedicate an entirely separate machine solely for monitoring and alerting. Sure you can run local monitoring, testing that services are available, the root-disk isn't full, and that kind of thing. But only by testing externally can you see if the machine is actually available to end-users, customers, or friends.
A local-agent might decide "I'm fine", but if the hosting-company goes dark due to a fibre cut you're screwed.
I've been hosting my services with Hetzner (cloud) recently, and their service is generally pretty good. Unfortunately I've started to see an increasing number of false-alarms. I'd have a server in Germany, with the monitoring machine in Helsinki (coincidentally where I live!). For the past month I've started to get pinged with a failure every three/four days on average, "service down - dns failed", or "service down - timeout". When the notice would wake me up I'd go check and it would be fine, it was a very transient failure.
To be honest the reason for this is my monitoring is just too damn aggressive, I like to be alerted immediately in case something is wrong. That means if a single test fails I get an alert, as rather than only if a test failed for something more reasonable like three+ consecutive failures.
I'm experimenting with monitoring in a less aggressive fashion, from my home desktop. Since my monitoring tool is a single self-contained golang binary, and it is already packaged as a docker-based container deployment was trivial. I did a little work writing an agent to receive failure-notices, and ping me via telegram - instead of the previous approach where I had an online status-page which I could view via my mobile, and alerts via pushover.
So far it looks good. I've tweaked the monitoring to setup a timeout of 15 seconds, instead of 5, and I've configured it to only alert me if there is an outage which lasts for >= 2 consecutive failures. I guess the TLDR is I now do offsite monitoring .. from my house, rather than from a different region.
The only real reason to write this post was mostly to say that the process of writing a trivial "notify me" gateway to interface with telegram was nice and straightforward, and to remind myself that transient failures are way more common than we expect.
I'll leave things alone for a moment, but it was a fun experiment. I'll keep the two systems in parallel for a while, but I guess I can already predict the outcome:
- The desktop monitoring will report transient outages now and again, because home broadband isn't 100% available.
- The heztner-based monitoring, in a different region, will report transient problems, because even hosting companies are not 100% available.
- Especially at the cheap prices I'm paying.
- The way to avoid being woken up by transient outages/errors is to be less agressive.
- I think my paying users will be OK if I find out a services is offline after 5 minutes, rather than after 30 seconds.
- If they're not we'll have to talk about budgets ..