Sanity testing drives

Thursday, 12 August 2010

Recently I came across a situation where moving a lot of data around on a machine with a 3Ware RAID card ultimately killed the machine.

To test the hardware in advance for this requires a test of both:

  • The individual drives, which make up the RAID array
  • The filesystem which is layered upon the top of it.

The former can be done with badblocks, etc. The latter requires a simple tool to create a bunch of huge files with "random" contents, then later verify they have the contents you expected.

With that in mind:

dt --files=1000  --size=100M [--no-delete|--delete]


  • Creates, in turn, 1000 files.
  • Each created file will be 100Mb long.
  • Each created file will have random contents written to it, and be closed.
  • Once closed the file will be re-opened and the MD5sum computed
    • Both in my code and by calling /usr/bin/md5sum.
    • If these sums mis-match, indicating a data-error, we abort.
  • Otherwise we delete the file and move on.

Adding "--no-delete" and "--files=100000" allows you to continue testing until your drive is full and you've tested every part of the filesystem.

Trivial toy, or possibly useful to sanity-check a filesystem? You decide. Or just:

hg clone http://dt.repository.steve.org.uk/

(dt == disk test)

ObQuote: "Stand back boy! This calls for divine intervention! " - "Brain Dead"



I've accidentally written a replication-friendly filesystem

Thursday, 29 July 2010

This evening I was mulling over a recurring desire for a simple, scalable, and robust replication filesystem. These days there are several out there, including Gluster.

For the past year I've personally been using chironfs for my replication needs - I have /shared mounted upon a number of machines and a write to it on any will be almost immediately reflected in the others.

This evening, when mulling over a performance problem with Gluster I was suddenly struck by the idea "Hey, Redis is fast. Right? How hard could it be?".

Although Redis is another one of those new-fangled key/value stores it has several other useful primitives, such as "SETS" and "LISTS". Imagine a filesystem which looks like this:


Couldn't we store those entries as members of a set? So we'd have:

  SET ENTRIES:/              -> srv, tmp, var
  SET ENTRIES:/var/spool     -> tmp
  SET ENTRIES:/var/spool/tmp -> (nil)

If you do that "readdir(path):" becomes merely "SMEMBERS entries:$path" ("SMEMBERS foo" being "members of the set named foo"). At this point you can add and remove directories with ease.

The next step, given an entry in a directory "/tmp", called "bob", is working out the most important things:

  • Is /tmp/bob a directory?
    • Read the key DIRECTORIES:/tmp/bob - if that contains a value it is.
  • What is the owner of /tmp/bob?
    • Read the key FILES:/tmp/bob:UID.
  • If this is a file what is the size? What are the contents?
    • Read the key FILES:/tmp/bob:size for the size.
    • Read the key FILES:/tmp/bob:data for the contents.

So with a little creative thought you end up with a filesystem which is entirely stored in Redis. At this point you're thinking "Oooh shiny. Fast and shiny". But then you think "Redis has built in replication support..."

Not bad.

My code is a little rough and ready, using libfus2 & the hiredis C API for Redis. If there's interest I'll share it somewhere.

It should be noted that currently there are two limitations with Redis:

  • All data must fit inside RAM.
  • Master<->Slave replication is trivial, and is the only kind of replication you get.

In real terms the second limitation is the killer. You could connect to the Redis server on HostA from many locations - so you get a true replicated server. Given that the binary protocol is simple this might actually be practical in the real-world. My testing so far seems "fine", but I'll need to stress it some more to be sure.

Alternatively you could bind the filesystem to the redis server running upon localhost on multiple machines - one redis server would be the master, and the rest would be slaves. That gives you a filesystem which is read-only on all but one host, but if that master host is updated the slaves see it "immediately". (Does that setup even have a name? I'm thinking of master-write, slave-read, and that gets cumbersome.)

ObQuote: Please, please, please drive faster! -- Wanted



I'm begging you, take a shot. Just one hit.

Friday, 31 July 2009

A while back I mentioned my interest in clustered storage solutions. I only mentioned this in passing because I was wary of getting bogged down in details.

In short though I'd like the ability to seamlessly replicate content across machines which I own, maintain, and operate (I will refer to "storage machines" as "client nodes" to keep in the right theme!)

In clustering terms there are a few ways you can do this. One system which I've used in anger has been drbd. drbd basically allows you to shares block devices across a network. This approach has some pros and some cons.

The biggest problem with drbd is that it allows the block device to be mounted at only one location - so you cannot replicate in a live sense. The fact that it is block-based also has implications on the size of your available storage.

Moving back to a higher level I'm pretty much decided that when I say that I want "a replicated/distributed filesystem" that is precisely what I mean: The filesystem should be replicated, rather than having the contents split up and shared about in pieces. The distinction is pretty fine but considering the case of backups this is how you want it to work: Each client node has a complete backup of the data. Of course the downside to this is that that maximum storage available is the smallest of the available nodes.

(e.g. 4 machines with 200Gb of space, and 1 machine of 100gb - you cannot write more than 100Gb to the clustered filesystem in total or the replicated contents will be too large to fit on the smaller host.)

As a concrete example, given the tree:


I'd expect that exact tree to be mirrored and readable as a filesystem on the client nodes. In practise I want a storage node to either be complete, or broken. I don't want to worry about failure cases when dealing with things like "A file must be present on 2+ nodes out of the 5 available".

Now the cheaty way of doing this is to work like this:

# modify files
# for i in client1 client2 client3 ; do \
    rsync -vazr --in-place --delete /shared $i:/backing-store ; \

i.e. Just rsync after every change. The latency would be a killer, and the level of dedication would be super-human, but if you could wire that up in a filesystem you'd be golden. And this is why I've struggled with lustre, mogilefs, and tahoe. They provide both real and complicated clustering - but they do not give you a simple filesystem you can pickup and run with from any client node in an emergency. They're more low-level. (Probably more reliable, sexy and cool. I bet they get invited to parties and stuff ;)

Thankfully somebody has written exactly what I need: chironfs. This allows you to spread writes to a single directory over multiple storage directories - each of which gets a complete copy.

This is the first example from the documentation :

# make client directories and virtual one
mkdir /client1 /client2 /real

# now do the magic
chironfs --fuseoptions allow_other /client1=/client2 /real

Once you've done this you can experiment:

skx@gold:/real$ touch x
skx@gold:/real$ ls /client1/
skx@gold:/real$ ls /client2/
skx@gold:/real$ touch fdfsf

skx@gold:/real$ ls /client1
fdfsf  x
skx@gold:/real$ ls /client2
fdfsf  x

What this demonstrates is that running "touch foo" in the "/real" directory resulted in that same operation being carried out in both the directory /client1 and "/client2".

This is looking perfect, but wait it isn't distributed?!

Thats the beauty of this system if /client1 & /client2 are NFS-mounted it works! Just the way you imagine it should!

Because I'm somewhat wary of fuse at the moment I've set this up at home - with a "/backups" directory replicated to an LVM filesystem locally and also to a remote host which is mounted via SSH.

Assuming it doesn't go mad, lose data, or fail badly if the ssh-based mount disappears I'm going to be very very happy.

So far it is looking good though, I can see entrie logged like this when I unmount behind its back:

2009/07/30 23:28 mknod failed accessing /client2 No such file or directory
2009/07/30 23:28 disabling replica failed accessing /client2

So a decent alarm on that would make it obvious when replication was failing - just like the emails that mdadm sends out a semi-regular cron-check would probably be sufficient.

Anyway enough advertising! So far I love it and I think that if it works out for me in anger that chironfs should be part of Debian. Partly because I'll want to use it more, and partly because it just plain rocks!

ObFilm: The Breakfast Club



I'm getting married, I'm not joining a convent!

Tuesday, 21 July 2009

(This post was accidentally made live before it was completed; it is now complet.)

I'll keep this brief and to the point.

syslog indexing and searching

Jason Hedden suggested using swish-e to index and then search syslog files which are stored on disk - rather than inserting the log entries into mysql.

I have 120+ machines writing to a central server, and running a search of 'sshd.*refused' takes less than a second to complete now.

(To be fair using php-syslog-ng was fast, it was just ugly, hard to manage, and the mysql database got overloaded)

Cloud Storage .. but on my machines

I've become increasingly interested in both centralised hosting, and reliable backups.

Cloud storage, where I contrl all the nodes, allows good backups.

So far I've experimented with both mogilefs and peerfuse. Neither setup is entirely appropriate for me, but I love the idea of seamless replication.


Many ice-creams bought in supermarkets come in packs of three. Annoying:

  • One for madam x.
  • one for me.
  • Who gets the spare? (Me, when she's gone ;)

It happens too often to be a coincidence: my cynicism wonders if it is designed to ensure people buy two boxes..?

new software releases

skxlist, the simple mailing list manager, got a couple of new options after user-submitted suggestions.

asql got a bugfix.

My todolist code is now running on at least one other site!

Nothing else much to say mostly because I'm suffering from poor sleep at the moment. In part because I've got a new clock on my bedroom windowsill and the ticking is distracting me (not to mention the on-the-hour chime!)

Still I'm sure it will pass. I grew up in York in a house that had the back yard abutting the local convent. Every hour, on the hour, they'd have bell ring! We moved house when I was about 11, but for months after the move I'd still wake up at midnight confused that the bells hadn't rung...

ObFilm: Mamma Mia!



I saw green fields and flowers. I could smell the grass.

Tuesday, 20 January 2009

Fabio Tranchitella recently posted about his new filesystem which really reminded me of an outstanding problem I have.

I do some email filtering, and that is setup in a nice distributed fashion. I have a web/db machine, and then I have a number of MX machines which process incoming mail rejecting spam and queuing good mail for delivery.

I try not to talk about it very often, because that just smells of marketting. More users would be good, but I find explicit promotion & advertising distasteful. (It helps to genuinly consider users as users, and not customers even though money changes hands.)

Anyway I handle mail for just over 150 domains (some domains will receive 40,000 emails a day others will receive 10 emails a week) and each of these domains has different settings, such as "is virus scanning enabled?" and "which are the valid localparts at this domain?", then there are whitelists, blacklists, all that good stuff.

The user is encouraged to fiddle with their settings via the web/db/master machine - but ultimately any settings actually applied and used upon the MX boxes. This was initially achieved by having MySQL database slaves, but eventually I settled upon a simpler and more robust scheme: Using the filesystem. (Many reasons why, but perhaps the simplest justification is that this way things continue to work even if the master machine goes offline, or there are network routing issues. Each MX machine is essentially standalone and doesn't need to be always talking to the master host. This is good.)

On the master each domain has settings beneath /srv. Changes are applied to the files there, and to make the settings live on the slave MX boxes I can merely rsync the contents over.

Here's an anonymized example of a settings hierarchy:

|-- basics
|   `-- enabled
|-- dnsbl
|   |-- action
|   `-- zones
|       |-- foo.example.com
|       `-- bar.spam-house.com
|-- language
|   `-- english-only
|-- mx
|-- quarantine
|   `-- admin_._admin
|-- spam
|   |-- action
|   |-- enabled
|   `-- text
|-- spamtraps
|   |-- anonymous
|   `-- bobby
|-- uribl
|   |-- action
|   |-- enabled
|   `-- text
|-- users
|   |-- bob
|   |-- root
|   |-- simon
|   |-- smith
|   |-- steve
|   `-- wildcard
|-- virus
|   |-- action
|   |-- enabled
|   `-- text
`-- whitelisted
    |-- enabled
    |-- hosts
    |-- subjects
    |   `-- [blah]
    |-- recipients
    |   `-- simon
    `-- senders
        |-- root@steve.orgy
        |-- @someisp.com
        `-- foo@bar.com

So a user makes a change on the web machine. That updates /srv on the master machine immediately - and then every fifteen minutes, or so, the settigngs are pushed accross to the MX boxes where the incoming mail is actually processed.

Now ideally I want the updates to be applied immediately. That means I should look at using sshfs or similar. But also as a matter of policy I want to keep things reliable. If the main box dies I don't want the machines to suddenly cease working. So that rules out remotely mounting via sshfs, nfs or similar.

Thus far I've not really looked at the possabilities, but I'm leaning towards having each MX machine look for settings in two places:

  • Look for "live" copies in /srv/
  • If that isn't available then fall back to reading settings from /backup/

That way I can rsync to /backup on a fixed schedule, but expect that in everyday operation I'll get current/live settings from /srv via NFS, sshfs, or something similar.

My job for the weekend is to look around and see what filesystems are available and look at testing them.




