I've mostly been avoiding the computer this evening, but I did spend the last hour working on attempt #2 at distributed monitoring.
The more I plot, plan & ponder the more appealing the notion becomes.
Too many ideas to discuss, but in brief:
My previous idea of running tests every few minutes on each node scaled badly when the number of host+service pairs to be tested grew.
This lead to the realisation that as long as some node tests each host+service pair you're OK. Every node need not check each host on every run - this was something I knew, and had discussed, but I assumed that would be a nice optimisation later rather than something which is almost mandatory.
My previous idea of testing for failures on other nodes after seeing a local failure was similarly flawed. It introduces too many delays:
- Node 1 starts all tests - notices a failure. Records it
- Fetches current results from all neighbour nodes.
- Sees they are OK - the remote server only just crashed. Oops.
- Node 2 starts all tests - notices a failure. Records it.
- Fetches current results from all neighbour nodes.
- Sees they are OK - the remote server only just crashed. Oops.
In short you have a synchronisation problem which coupled with the delay of making a large number of tests soon grows. Given a testing period of five minutes, ten testing nodes, and 150 monitored hosts+services, you're looking at delays of 8-15 minutes. On average. (Largely depends on where in the cycle the failing host is, and how many nodes must see a failure prior to alerting.)
So round two has each node picking tests at "random" (making sure no host+service was tested more than 5 minutes ago) and at the point a failure is detected the neighbour nodes are immediately instructed to test and report their results (via XML::RPC).
The new code is simpler, more reliable, and scales better. Plus it doesn't need Apache/CGI.
Anyway bored now. Hellish day. MySQL blows goats.
ObFilm: Hellboy
Tags: dmonitor, mysql 2 comments
If each machine drops it's current status onto the next machine on or back as part of the test, along with the status provided to it from the two machines either side, then you should be able to find an up to date status of all machines unless you have a major failure across multiple machines. With a bit of simple version control you should be able to ensure each machine has a pretty up to date status on all the rest, so you can query a basic status from any one.
For resilience again false positives you may like to use 2 either side as default instead of 1.
OK, end of quick brain dump. I'll probably realise it's not a good idea as soon as I hit 'post'!