About Archive Tags RSS Feed

 

I'm fireproof, you're not

7 May 2009 21:50

I've mostly been avoiding the computer this evening, but I did spend the last hour working on attempt #2 at distributed monitoring.

The more I plot, plan & ponder the more appealing the notion becomes.

Too many ideas to discuss, but in brief:

My previous idea of running tests every few minutes on each node scaled badly when the number of host+service pairs to be tested grew.

This lead to the realisation that as long as some node tests each host+service pair you're OK. Every node need not check each host on every run - this was something I knew, and had discussed, but I assumed that would be a nice optimisation later rather than something which is almost mandatory.

My previous idea of testing for failures on other nodes after seeing a local failure was similarly flawed. It introduces too many delays:

  • Node 1 starts all tests - notices a failure. Records it
    • Fetches current results from all neighbour nodes.
    • Sees they are OK - the remote server only just crashed. Oops.
  • Node 2 starts all tests - notices a failure. Records it.
    • Fetches current results from all neighbour nodes.
    • Sees they are OK - the remote server only just crashed. Oops.

In short you have a synchronisation problem which coupled with the delay of making a large number of tests soon grows. Given a testing period of five minutes, ten testing nodes, and 150 monitored hosts+services, you're looking at delays of 8-15 minutes. On average. (Largely depends on where in the cycle the failing host is, and how many nodes must see a failure prior to alerting.)

So round two has each node picking tests at "random" (making sure no host+service was tested more than 5 minutes ago) and at the point a failure is detected the neighbour nodes are immediately instructed to test and report their results (via XML::RPC).

The new code is simpler, more reliable, and scales better. Plus it doesn't need Apache/CGI.

Anyway bored now. Hellish day. MySQL blows goats.

ObFilm: Hellboy

| 2 comments

 

Comments on this entry

icon Paul Tansom at 12:04 on 8 May 2009
Assuming that each of the machines has access to the rest what about working on a monitoring ring approach? If each machine has a copy of the same list of all machines to be tested, then each can test the one before and after it in the list. If either fails check the next one down (or up) the list, and if all 4 fail flag as a local failure.
If each machine drops it's current status onto the next machine on or back as part of the test, along with the status provided to it from the two machines either side, then you should be able to find an up to date status of all machines unless you have a major failure across multiple machines. With a bit of simple version control you should be able to ensure each machine has a pretty up to date status on all the rest, so you can query a basic status from any one.
For resilience again false positives you may like to use 2 either side as default instead of 1.
OK, end of quick brain dump. I'll probably realise it's not a good idea as soon as I hit 'post'!
icon Steve Kemp at 10:09 on 11 May 2009

A ring approach seems like it would work - and we could probably automate working out the ordering pretty easily.

(Right now I just have a static text file called nodes.txt - and the ordering there could be used for determining which is next/prev)

It seems like a good idea, I guess technically there isn't a lot of difference between that and my current idea except in the case of failures. If you were to see a failure with a ring arrangement you'd confirm by testing "next" and "prev" nodes.

What I do is test on all known nodes, regardless of placement. So it becomes a star-like arrangement.

If the threshold of failures is 1,2 or 3, then just testing itself, the next & previous would be enough and you'd save some overhead.

Right now that overhead doesn't seem too significant since it only comes into play in the alarm condition - but if failures are common it would be a useful change.