Entries posted in May 2009
3 May 2009 21:50
I use a number of computers in my daily life, but the machine I use the most often is my "desktop box". This is one of a pair of machines sat side by side on my desk.
One machine is desktop (Sid) and one is the backup host (Lenny). The backup machine is used by random visitors to my flat, and otherwise just runs backups for my remote machines (www.steve.org.uk, www.debian-administration.org, etc) every 4/6/24 hours.
I apply updates to both boxes regularly but my desktop machine tends to have lots of browsers open, and terminals. I rarely restart it, or logout. So the recent updates to X, hal, udev, mostly pass me by - I can go months without logging out and restarting the system.
On Saturday the desktop machine died with a OOM condition when I wrote some bad recursive code for indexing a collection of mailboxes. Oops.
When it came back I was greeted with a new login window, and all the fonts look great. Now in the past the fonts looked OK, but now? They look great.
I cannot pin down what has changed precisely, but everything looks so smooth and sexy.
So, a long entry, the summary is "I restarted my machine after a few months of being logged in, and now it looks better".
Random conversation with Alex about monitoring yesterday made me curious to see if anybody has put together a useful distributed monitoring system?
Assume you have a network with Nagios, or similar, monitoring it. If your link between the monitoring box and the hosts being checked is flaky, unreliable, or has blips you will see false positives. We've all been there and seen that.
So, what is the solution? There are two "obvious" ones:
- Move the monitoring as close to the services as possible.
- Monitor from multiple points.
Moving the monitoring closer to the services does reduce the risk of false positives, but introduces its own problems. (i.e. You could be monitoring your cluster, and it could be saying "MySQL up", "Web up", but your ISP could have disconnected you - and you're not connected to the outside world. Oops. The solution there is to test external connectivity too, but that re-introduces the flakyness problem if your link is lossy.)
Distributed monitoring brings up its own issues, but seems like a sane way to go.
I wrote a simple prototype which has the ability to run as a standalone tester, or a CGI script under Apache. The intention is that you run it upon >3 nodes. If the monitoring detects a service is unavailable it queries the other monitoring nodes to see if they also report a failure - if they do it alerts, if not it assumes the failure is due to a "local" issue.
There is a lot of scope for doing it properly, which seems to be Alex's plan, having the nodes run in a mesh and communicate amongst each other "Hey I'm node #1 - I cannot see service X on host Y - Is down for you too?" - but the simple version of having the script just do a wget on the CGI-version on the other nodes is probably "good enough".
I really don't track the state of the art in this field, just struggle to battle nagios into submission. Do there exist systems like this already?
(sample code is here, sample remote-status checks are here and here. Each node will alert if >=2 nodes see a failure. Otherwise silence is golden.)
ObFilm: X-Men Origins: Wolverine
Tags: desktop, dmonitor, fonts, sid
4 May 2009 21:50
dmonitor now has a webpage.
I've been running it for a night now, watching alerts come and go via manual firewall rules so I'm pretty confident it works reliably and in a way that avoids transient failures. The only obvious failure case is if each monitoring node loses the links to each of the others. (Solution there is to have a sufficiently large number of them! Hence the reason the configuration is file/directory based. rsync for the win!)
Still I will leave it there for now. The only things missing are better instructions, and more service checking plugins..
Now to go paint some more ... adding new wall decorations has inspired me a little!
Tags: dmonitor, flat, hacks, space invaders
7 May 2009 21:50
I've mostly been avoiding the computer this evening, but I did spend the last hour working on attempt #2 at distributed monitoring.
The more I plot, plan & ponder the more appealing the notion becomes.
Too many ideas to discuss, but in brief:
My previous idea of running tests every few minutes on each node scaled badly when the number of host+service pairs to be tested grew.
This lead to the realisation that as long as some node tests each host+service pair you're OK. Every node need not check each host on every run - this was something I knew, and had discussed, but I assumed that would be a nice optimisation later rather than something which is almost mandatory.
My previous idea of testing for failures on other nodes after seeing a local failure was similarly flawed. It introduces too many delays:
- Node 1 starts all tests - notices a failure. Records it
- Fetches current results from all neighbour nodes.
- Sees they are OK - the remote server only just crashed. Oops.
- Node 2 starts all tests - notices a failure. Records it.
- Fetches current results from all neighbour nodes.
- Sees they are OK - the remote server only just crashed. Oops.
In short you have a synchronisation problem which coupled with the delay of making a large number of tests soon grows. Given a testing period of five minutes, ten testing nodes, and 150 monitored hosts+services, you're looking at delays of 8-15 minutes. On average. (Largely depends on where in the cycle the failing host is, and how many nodes must see a failure prior to alerting.)
So round two has each node picking tests at "random" (making sure no host+service was tested more than 5 minutes ago) and at the point a failure is detected the neighbour nodes are immediately instructed to test and report their results (via XML::RPC).
The new code is simpler, more reliable, and scales better. Plus it doesn't need Apache/CGI.
Anyway bored now. Hellish day. MySQL blows goats.
Tags: dmonitor, mysql
10 May 2009 21:50
This post is being made from my EEEEEeeee PC, using a 3G modem plugged into the USB port. The fact that I'm sat on my sofa, within easy reach of both a network cable and multiple WiFi access points is irrelevant!
I started my adventure yesterday evening, getting pretty annoyed along the way that that it wasn't just plug and go. It turns out I was suffering from two problems:
- The USB device itself alternates between being a modem and being a dumb USB storage device (full of Windows software).
- My copy of Network-Manager was too old.
In short from my Lenny installation I had to upgrade to Sid to get a copy of Network Manager with a "Mobile Broadband" section in its preferences. (I looked for backports, to no avail, and I didn't have the patience to make one mysefl). The new connection looks like this:
Once I added the connection discovered the USB modem device (/dev/ttyUSB0) just didn't work - and I learned about the dual-nature of the device. Thankfully switching is nice and easy "apt-get install usb-modeswitch" then:
# disable storage
usb_modeswitch -v 12d1 -p 1003 -d 1
# enable modem
usb_modeswitch -v 12d1 -p 1003 -H 1
Once that was done the connection worked almost immediately. (I just had to upgrade to a 2.6.29 kernel because I got panics on the Lenny kernel; something the upgrade installed but I'd previously ignored. Kernels: Bane of my life.)
Update: I do see some kernel weirdness talking about timeouts talking to the USB-serial device. Perhaps something to investigate in the future.
Anyway running a 3G O2 PAYG (pay as you go) modem on Debian, on an EEEPC is possible, it is justfiddlier than I had expected, and it required an upgrade to Sid - since Lenny didn't have a network manager with mobile broadband support.
For google's benefit the modem is described by O2 as a "mobile broadband USB modem - E160". This appears under lsusb as :
Bus 001 Device 005: ID 12d1:1003 Huawei Technologies Co., Ltd.
E220 HSDPA Modem / E270 HSDPA/HSUPA Modem
Hope that helps somebody else spend less than 5 hours getting it working. I guess the friends who said "It just worked" were running Ubuntu and so had a slightly newer network manager by default - and possibly their modems didn't need to toggle between "dumb storage" and "actual modem" modes.
Anyway it works now, and even though it was fiddly the issue wasn't insurmountable. I'm just a little grumpy because I've gotten used to a world in which Debian just works - the last time I struggled to get new hardware toys playing nice was .. a long time ago.
ObFilm: Pretty In Pink
Tags: 3g, modems, network-manager, o2, usb
16 May 2009 21:50
I've said it multiple times, but all mailing list managers suck. Especially mailman. (On that topic SELinux is nasty, Emacs is the one true editor, and people who wear furry boots are silly.)
Having setup some new domains I needed a mailing list manager, and had to re-evaluate the available options. Mostly I want something nice and simple, that doesn't mess around with character sets, that requires no fancy deamons, and has no built in archiving solution.
Really all we need is three basic operations:
- Confirmed opt-in subscription.
- Confirmed opt-out unsubscription.
- Post to list.
Using pipes we can easily arrange for a script to be invoked for different options:
# Aliases for 'list-blah' mailing list.
^list-blah-subscribe$: "|/sbin/skxlist --list=list-blah --domain=example.com --subscribe"
^list-blah-unsubscribe$: "|/sbin/skxlist --list=list-blah --domain=example.com --unsubscribe"
^list-blah$: "|/sbin/skxlist --list=list-blah --domain=example.com --post"
The only remaining concerns are security related. The most obvious concern is that the script named will be launched by the mailserver user (Debian-exim in my case). That suggests that any files it creates (such as a list of email addresses - i.e. list members) will be owned by that user.
That can be avoided with setuid-fu and having the mailing list manager be compiled. But compiled code? When there are so many lovely Perl modules out there? No chance!
In conclusion, if you're happy for the exim user to own and be able to read the list data then you can use skxlist.
It is in live use, allows open lists, member-only lists, and admin-only lists. It will archive messages in a maildir, but they are ignored for you to use if you see fit.
List options are pretty minimal, but I do a fair amount of sanity checking and I see no weaknesses except for the use of the Debian-exim UID.
Tags: skxlist, skxmail
20 May 2009 21:50
Etch -> Lenny
This Saturday I'll be upgrading my main box to lenny.
Mostly this should be painless, as the primary services aren't going to change too much.
I've tested the upgrade of the virtual hosting configuration which I use for exim4 on lenny and that works, as-is. I also have a local version of qpsmtpd which I'll be deploying and that works on lenny with my custom plugins.
A new version of Apache 2.x shouldn't cause any problem, although I will need to test each site I have to make sure that Perl module upgrades don't cause any breakage.
I expect random readers will neither notice nor care if my sites go down for an hour or two, but for local people consider this notice ;)
This allows dl/dt/dd/definition lists to have their contents collapsed easily.
Currently I use some custom code to do that (e.g. as used here) but this jquery plugin is far neater, even if the plugin code isn't perhaps the best.
This plugin converts plain links to things that make AJAX requests. In theory this allows graceful enhancements.
e.g. <a href="foo.html#bar">link</a> becomes an AJAX request that loads the contents of "foo.html" into the div with ID bar.
It seems this is a cheap clone of ajaxify, but I didn't know that when I put it together.
ObFilm: The Breakfast Club
Tags: etch, jquery, lenny
24 May 2009 21:50
I've successfully upgraded my primary web/mail/misc host from Debian Etch to Debian Lenny. There were a few minor problems, but on the whole the upgrade was as painless as I've come to expect.
In the past I'd edited my Exim4 configuration to add quite a few ACL checks, for example rejecting mails based upon spoofed/bogus HELO identifiers, and rejecting messages that didn't contain "Subject" or "Date" headers.
The Debian Exim4 configuration may be split into multiple files (which is how I prefer it on the whole). The idea that you just add new files into the existing hierarchy and they'll magically appear in the correct location when a real configuration file is generated. On the whole this works well, but sometimes editing files in-place is required, and it was these local edits that caused me pain.
Fixing things up was mostly not a challenge, primarily it was a matter of removing ACLs until exim4 started without errors - all my spam checking is handled ahead of exim4 these days, except for the last-ditch spam filtering with a combination of procmail-fu and the crm114 classifier package.
Taking a hint from Bubulle's weblog I decided to nuke my CRM114 spam database to avoid any possible version-mismatch issues so now I'm having to classify a lot of "unsure" messages. Happily my memory of doing this last time round is that the initial training of spam/ham takes a day or so to complete.
Anyway now I can start looking to advantage of the things new to Lenny. But probably not until I'm sure things have calmed down and upgraded correctly.
05:00:31 up 260 days, 14:23, 2 users, load average: 0.95, 0.51, 0.31
steve@skx:~$ cat /etc/issue
Debian GNU/Linux 5.0 \n \l
ObFilm: Bill & Ted's Excellent Adventure
Tags: crm114, etch, exim4, lenny