About Archive Tags RSS Feed


We can't just let you walk away!

3 May 2009 21:50

My Desktop

I use a number of computers in my daily life, but the machine I use the most often is my "desktop box". This is one of a pair of machines sat side by side on my desk.

One machine is desktop (Sid) and one is the backup host (Lenny). The backup machine is used by random visitors to my flat, and otherwise just runs backups for my remote machines (www.steve.org.uk, www.debian-administration.org, etc) every 4/6/24 hours.

I apply updates to both boxes regularly but my desktop machine tends to have lots of browsers open, and terminals. I rarely restart it, or logout. So the recent updates to X, hal, udev, mostly pass me by - I can go months without logging out and restarting the system.

On Saturday the desktop machine died with a OOM condition when I wrote some bad recursive code for indexing a collection of mailboxes. Oops.

When it came back I was greeted with a new login window, and all the fonts look great. Now in the past the fonts looked OK, but now? They look great.

I cannot pin down what has changed precisely, but everything looks so smooth and sexy.

So, a long entry, the summary is "I restarted my machine after a few months of being logged in, and now it looks better".

Distributed Monitoring?

Random conversation with Alex about monitoring yesterday made me curious to see if anybody has put together a useful distributed monitoring system?

Assume you have a network with Nagios, or similar, monitoring it. If your link between the monitoring box and the hosts being checked is flaky, unreliable, or has blips you will see false positives. We've all been there and seen that.

So, what is the solution? There are two "obvious" ones:

  • Move the monitoring as close to the services as possible.
  • Monitor from multiple points.

Moving the monitoring closer to the services does reduce the risk of false positives, but introduces its own problems. (i.e. You could be monitoring your cluster, and it could be saying "MySQL up", "Web up", but your ISP could have disconnected you - and you're not connected to the outside world. Oops. The solution there is to test external connectivity too, but that re-introduces the flakyness problem if your link is lossy.)

Distributed monitoring brings up its own issues, but seems like a sane way to go.

I wrote a simple prototype which has the ability to run as a standalone tester, or a CGI script under Apache. The intention is that you run it upon >3 nodes. If the monitoring detects a service is unavailable it queries the other monitoring nodes to see if they also report a failure - if they do it alerts, if not it assumes the failure is due to a "local" issue.

There is a lot of scope for doing it properly, which seems to be Alex's plan, having the nodes run in a mesh and communicate amongst each other "Hey I'm node #1 - I cannot see service X on host Y - Is down for you too?" - but the simple version of having the script just do a wget on the CGI-version on the other nodes is probably "good enough".

I really don't track the state of the art in this field, just struggle to battle nagios into submission. Do there exist systems like this already?

(sample code is here, sample remote-status checks are here and here. Each node will alert if >=2 nodes see a failure. Otherwise silence is golden.)

ObFilm: X-Men Origins: Wolverine



Comments on this entry

icon Jonathan at 12:56 on 3 May 2009
You could take a look at Lemon - it's a monitoring system developed at CERN. Among many other 'sensors' (monitoring modules), it contains a sensor to check the availability of a certain service. If properly configured it can also do things like automated repair actions when sensors raise an 'alarm' (i.e. sensor value reaches a specific value) or a combination of sensor values occur (like 'subversion server X is not available and local copy is more than Y hours old'). It takes a while to get used to Lemon, however :).
icon Steve Kemp at 13:49 on 3 May 2009

Thanks for the pointer Jonathan, but I'm not sure what makes that different to other products like Nagios.

Many monitoring systems exist which will let you run checks on things like pending security updates, or similar, the problem that I'm specifically interested in is the idea of a system that allows monitoring to be conducted from multiple locations.

That solves, or at least mitigates, the issues involved when the off-site monitoring system is at the mercy of poor connectivity, or high-timeouts.

Having Lemon run tests, from a single location, is not really any different than having Nagios, or similar, do so. Unless it allows coordination between multiple reporting/status-checking nodes ..?

icon Alex Howells at 08:52 on 5 May 2009

Perhaps I should clarify that the specific gist of my "distributed" idea is that multiple nodes communicate with each other in a "Hai, SQL down for joo too?!" manner and then leverage this to reduce false positives.
In the event you have 8-9 nodes, the following is true:
* If any one node cannot communicate with 25-30% or more of your other nodes, it sleeps
* If any one node shows a failure of a service, it communicates this to the mesh, actioning immediate checks of the service on all nodes.
* Alerts are only sent if a defined number of checks fails, as with Nagios, but you may also specify what percentage of the mesh must agree about the failure before alarms are sounded.
* There are elections held to decide a 'master' and also 'backup master' for alerting purposes, avoiding duplicate SMS messages about same issue.

What triggered this annoyance on my part is offsite network monitoring where you poll core switches and routers every sixty seconds for failure; in the event your offsite *provider* has issues you'll get 3am text messages screaming "zOMG your Internetz is down d00d!" which sucks. What sucks more? Getting up to find out it's fiiiiiine!