I'm begging you, take a shot. Just one hit.

Friday, 31 July 2009

A while back I mentioned my interest in clustered storage solutions. I only mentioned this in passing because I was wary of getting bogged down in details.

In short though I'd like the ability to seamlessly replicate content across machines which I own, maintain, and operate (I will refer to "storage machines" as "client nodes" to keep in the right theme!)

In clustering terms there are a few ways you can do this. One system which I've used in anger has been drbd. drbd basically allows you to shares block devices across a network. This approach has some pros and some cons.

The biggest problem with drbd is that it allows the block device to be mounted at only one location - so you cannot replicate in a live sense. The fact that it is block-based also has implications on the size of your available storage.

Moving back to a higher level I'm pretty much decided that when I say that I want "a replicated/distributed filesystem" that is precisely what I mean: The filesystem should be replicated, rather than having the contents split up and shared about in pieces. The distinction is pretty fine but considering the case of backups this is how you want it to work: Each client node has a complete backup of the data. Of course the downside to this is that that maximum storage available is the smallest of the available nodes.

(e.g. 4 machines with 200Gb of space, and 1 machine of 100gb - you cannot write more than 100Gb to the clustered filesystem in total or the replicated contents will be too large to fit on the smaller host.)

As a concrete example, given the tree:


I'd expect that exact tree to be mirrored and readable as a filesystem on the client nodes. In practise I want a storage node to either be complete, or broken. I don't want to worry about failure cases when dealing with things like "A file must be present on 2+ nodes out of the 5 available".

Now the cheaty way of doing this is to work like this:

# modify files
# for i in client1 client2 client3 ; do \
    rsync -vazr --in-place --delete /shared $i:/backing-store ; \

i.e. Just rsync after every change. The latency would be a killer, and the level of dedication would be super-human, but if you could wire that up in a filesystem you'd be golden. And this is why I've struggled with lustre, mogilefs, and tahoe. They provide both real and complicated clustering - but they do not give you a simple filesystem you can pickup and run with from any client node in an emergency. They're more low-level. (Probably more reliable, sexy and cool. I bet they get invited to parties and stuff ;)

Thankfully somebody has written exactly what I need: chironfs. This allows you to spread writes to a single directory over multiple storage directories - each of which gets a complete copy.

This is the first example from the documentation :

# make client directories and virtual one
mkdir /client1 /client2 /real

# now do the magic
chironfs --fuseoptions allow_other /client1=/client2 /real

Once you've done this you can experiment:

skx@gold:/real$ touch x
skx@gold:/real$ ls /client1/
skx@gold:/real$ ls /client2/
skx@gold:/real$ touch fdfsf

skx@gold:/real$ ls /client1
fdfsf  x
skx@gold:/real$ ls /client2
fdfsf  x

What this demonstrates is that running "touch foo" in the "/real" directory resulted in that same operation being carried out in both the directory /client1 and "/client2".

This is looking perfect, but wait it isn't distributed?!

Thats the beauty of this system if /client1 & /client2 are NFS-mounted it works! Just the way you imagine it should!

Because I'm somewhat wary of fuse at the moment I've set this up at home - with a "/backups" directory replicated to an LVM filesystem locally and also to a remote host which is mounted via SSH.

Assuming it doesn't go mad, lose data, or fail badly if the ssh-based mount disappears I'm going to be very very happy.

So far it is looking good though, I can see entrie logged like this when I unmount behind its back:

2009/07/30 23:28 mknod failed accessing /client2 No such file or directory
2009/07/30 23:28 disabling replica failed accessing /client2

So a decent alarm on that would make it obvious when replication was failing - just like the emails that mdadm sends out a semi-regular cron-check would probably be sufficient.

Anyway enough advertising! So far I love it and I think that if it works out for me in anger that chironfs should be part of Debian. Partly because I'll want to use it more, and partly because it just plain rocks!

ObFilm: The Breakfast Club



Comments On This Entry

[gravitar] Ewoud Kohl van Wijngaarden

Submitted at 00:38:37 on 31 july 2009

What's the performance hit and how well does this perform when a client is disconnected? Will it block, just ignore the updates or sync afterward?

Anyway, this does sound promising and something I will look into.

[gravitar] Warren

Submitted at 02:38:13 on 31 july 2009

So what happens after the replication fails, how do you resync?

Reading how a bunch of DDs are working, they have a travel laptop, a home workstation and a home backup server (for example). The laptop regularly gets unplugged and wanders with limited/intermittent network connectivity for the day. Or the battery may die on the laptop for a few days. What happens to bring the filesystems back in sync after an event occurs?

[gravitar] nona

Submitted at 03:23:02 on 31 july 2009

Thanks for the link, I hadn't looked at that one yet. I'm curious what kind of overhead this gives.

I had been looking into different replication strategies for a while too, and I haven't found the perfect system yet. I'd like async replication (because the performance cost of synced operation for everything is prohibitive, and I would like disconnected operation/automatic recovery from failure). I'd also like multi-master writes.

Options I've looked at:
- Coda/Intermezzo - looked interesting, but seems pretty dead these days.
- unison - very useful, does pretty much what I want - but the performance just isn't there to deal with huge filesystems. Scanning the entire filesystem and keeping state on it on every run is just inefficient use of local IO/CPU.
- rsync - it's faster than unison (event though it's also scanning the entire filesystem) but only replicates in one direction.
- syrep - an interesting tool, pretty fast, bidirectional and agnostic when it comes to how to actually transfer the changes over the network. However, it doesn't deal with metadata well (or maybe I didn't find out properly yet how to do it). rdiff and rdup are a bit similar. I suspect git would be better for this these days.
- GFS or OCFS2 on DRBD - apparently a bit brittle and not really recommended.
- Zumastor - a pretty nice and efficient solution. However, the devs seem to have paused the project to work on Tux3, and it's a bit in limbo at the moment.
- ... (lots more)

... all the options suck. Semantics are often different from what apps expect (replicate-on-close doesn't work with databases f.e.), or performance is bad (unison/rsync scan everything on every run), or it's replicating at the wrong level (block device like drbd), or ...

right now I'm using a combination of rsync, unison and git.

Well. I'm off to try chironfs.

[gravitar] twisla

Submitted at 08:07:59 on 31 july 2009

You should take a look at HAMMER ( http://www.dragonflybsd.org/hammer/ ) in DragonFlyBSD, it has a local / network (trough ssh) replication feature, and a lot more. Currently replication is only master/slave but there are plans to implement a multi-master mode.

[gravitar] Charles Darke

Submitted at 08:19:30 on 31 july 2009

Hmm. I'm looking to do a similar thing and an opposite thing:

1. Get a replication across machines - mainly so that my files are on desktop, work and laptop. But this should work so that laptop can work offline and files merge correctly afterwards. Not sure how 3way merge would handle.

At first I thought rsync, but can't think of a good way to have that work seamlessly and transparently (e.g. work with unplug of network).

2. Second thing is opposite - centralise storage into one box and remove local storage in other boxes (all will boot via FC). It's easy to slice the disk into separate bits but it would be nice to be able to share common files since all machines run debian. Also laptop does not have a FC HBA so would either have to have local storage or maybe iSCSI.

[gravitar] Ronny Adsetts

Submitted at 08:40:56 on 31 july 2009


Have you taken a look at glusterfs? Probably not any better than chironfs but an alternative.

In theory coda will do the job too though in practise coda may not be an option - it's been development forever and releases seem to get fewer over the last couple of years.


[gravitar] Thomas

Submitted at 08:52:04 on 31 july 2009

with drbd8 you can have two primaries with for example ocfs2 on top of it.

- thomas

[author] Steve Kemp

Submitted at 10:16:39 on 31 july 2009

Ewoud & Warren : When it notices that a location goes away it stops syncing.

At that point you need to restore the location, use rsync to bring it up to date, then restart the chironfs to get it re-syncing.

Still I'm happy to discount laptops for this - I'm thinking of only using it on machines which are constantly (or almost constantly) available. Doing that does sidestep the need to handle offline machines and make "catchup" easy.

If a client node fails I don't mind bringing it up to speed manually - as the expectation is that this will be a rare event, and that in every day operation my master and slaves will be working correctly.

[author] Steve Kemp

Submitted at 10:18:47 on 31 july 2009

Charles - If you want to replicate files across desktop/laptop machines have you just considered pulling from a remote git/svn/mercurial repository?

I have ~/bin/ and similar files pulled from a central site automatically once a week via a ~/.bashrc file.

The attraction of using a revision control system is that you can handle three-way merges gracefully ..

[author] Steve Kemp

Submitted at 10:22:31 on 31 july 2009

Twisla : Hammer looks great, but I'm only interested in Linux solutions so I cannot use it.

Ronny : I did look at glusterfs, but I didn't make it work to my satisfaction ..

[gravitar] Charles Darke

Submitted at 11:37:27 on 31 july 2009

The main thing I'm stuck on is how to make it user friendly. I don't want to have to remember to 'sync' each time I log on or log off.

I could put something in start-up and shutdown scripts, but don't want a delay at start up or shutdown (and want to make it possible to unplug network at random and have everything work).

I guess I could hook on network up-and-down and then maybe watch a directory for changes and sync once this is detected. Hmm...

[author] Steve Kemp

Submitted at 11:54:36 on 31 july 2009

Usng incron to watch for changes could be a way to go then?

There are a few systems I maintain that need to sync state bidirectionally and I do that via a cronjob that runs once a minute - which looks something like this:

if [ ! -e /srv/dirty ]; then
commit-changes /srv
cd /srv && hg push
cd /srv && hg pull --update
rm -f /srv/dirty

The tricky part is writing a wee wrapper so that you "hg add" new files, and "hg remove" files which were previously present but no longer locally available.

But in practise this actually works in a reliable fashion without being too open to race conditions. (Changes pushed to the repository generate an email alert so that a) I have history b) I can spot breakage quickly.)

[gravitar] Charles Darke

Submitted at 18:02:12 on 31 july 2009

Thanks for the incron link. I was preparing to write something to do what I wanted and didn't realise there was a nice tool that would work.


Comments are closed on posts which are more than ten days old.

Recent Posts

Recent Tags