Entries Tagged clustering

A while back I mentioned my interest in clustered storage solutions. I only mentioned this in passing because I was wary of getting bogged down in details.

In short though I'd like the ability to seamlessly replicate content across machines which I own, maintain, and operate (I will refer to "storage machines" as "client nodes" to keep in the right theme!)

In clustering terms there are a few ways you can do this. One system which I've used in anger has been drbd. drbd basically allows you to shares block devices across a network. This approach has some pros and some cons.

The biggest problem with drbd is that it allows the block device to be mounted at only one location - so you cannot replicate in a live sense. The fact that it is block-based also has implications on the size of your available storage.

Moving back to a higher level I'm pretty much decided that when I say that I want "a replicated/distributed filesystem" that is precisely what I mean: The filesystem should be replicated, rather than having the contents split up and shared about in pieces. The distinction is pretty fine but considering the case of backups this is how you want it to work: Each client node has a complete backup of the data. Of course the downside to this is that that maximum storage available is the smallest of the available nodes.

(e.g. 4 machines with 200Gb of space, and 1 machine of 100gb - you cannot write more than 100Gb to the clustered filesystem in total or the replicated contents will be too large to fit on the smaller host.)

As a concrete example, given the tree:

/shared
/shared/Text/
/shared/Text/Articles/
/shared/Text/Crypted/

I'd expect that exact tree to be mirrored and readable as a filesystem on the client nodes. In practise I want a storage node to either be complete, or broken. I don't want to worry about failure cases when dealing with things like "A file must be present on 2+ nodes out of the 5 available".

Now the cheaty way of doing this is to work like this:

# modify files
# for i in client1 client2 client3 ; do \
    rsync -vazr --in-place --delete /shared $i:/backing-store ; \
  done

i.e. Just rsync after every change. The latency would be a killer, and the level of dedication would be super-human, but if you could wire that up in a filesystem you'd be golden. And this is why I've struggled with lustre, mogilefs, and tahoe. They provide both real and complicated clustering - but they do not give you a simple filesystem you can pickup and run with from any client node in an emergency. They're more low-level. (Probably more reliable, sexy and cool. I bet they get invited to parties and stuff ;)

Thankfully somebody has written exactly what I need: chironfs. This allows you to spread writes to a single directory over multiple storage directories - each of which gets a complete copy.

This is the first example from the documentation :

# make client directories and virtual one
mkdir /client1 /client2 /real

# now do the magic
chironfs --fuseoptions allow_other /client1=/client2 /real

Once you've done this you can experiment:

skx@gold:/real$ touch x
skx@gold:/real$ ls /client1/
x
skx@gold:/real$ ls /client2/
x
skx@gold:/real$ touch fdfsf

skx@gold:/real$ ls /client1
fdfsf  x
skx@gold:/real$ ls /client2
fdfsf  x

What this demonstrates is that running "touch foo" in the "/real" directory resulted in that same operation being carried out in both the directory /client1 and "/client2".

This is looking perfect, but wait it isn't distributed?!

Thats the beauty of this system if /client1 & /client2 are NFS-mounted it works! Just the way you imagine it should!

Because I'm somewhat wary of fuse at the moment I've set this up at home - with a "/backups" directory replicated to an LVM filesystem locally and also to a remote host which is mounted via SSH.

Assuming it doesn't go mad, lose data, or fail badly if the ssh-based mount disappears I'm going to be very very happy.

So far it is looking good though, I can see entrie logged like this when I unmount behind its back:

2009/07/30 23:28 mknod failed accessing /client2 No such file or directory
2009/07/30 23:28 disabling replica failed accessing /client2

So a decent alarm on that would make it obvious when replication was failing - just like the emails that mdadm sends out a semi-regular cron-check would probably be sufficient.

Anyway enough advertising! So far I love it and I think that if it works out for me in anger that chironfs should be part of Debian. Partly because I'll want to use it more, and partly because it just plain rocks!

ObFilm: The Breakfast Club

Tags: chiron, clustering, filesystems, fuse | 13 comments

Until this time next month I'll be posting code-based discussions only.

Recently I've been wanting to explore creating clustered services, because clusters are definitely things I use professionally.

My initial attempt was to write an auto-clustering version of memcached, because that's a useful tool. Writing the core of the service took an hour or so:

Simple KeyVal.pm implementation.
Give it the obvious methods get, set, delete.
Make it more interesting by creating a read-only append-log.
The logfile will be replayed for clustering.

At the point I was done the following code worked:

use KeyVal;

# Create an object, and set some values
my $obj = KeyVal->new( logfile => "/tmp/foo.log" );
$obj->incr( "steve" );
$obj->incr( "steve" );

print $obj->get( "steve" ) # prints 2.

# Now replay the append-only log
my $replay = KeyVal->new( logfile => "/tmp/foo.log" );
$replay->replay();

print $replay->get( "steve" ) # prints 2.

In the first case we used the primitives to increment a value twice, and then fetch it. In the second case we used the logfile the first object created to replay all prior transactions, then output the value.

Neat. The next step was to make it work over a network. Trivial.

Finally I wanted to autodetect peers, and deploy replication. Each host would send out regular messages along the lines of "Do you have updates made since $time?". Any that did would replay the logfile from the given unixtime offset.

However here I ran into problems. Peer discovery was supposed to be basic, and I figured I'd write something that did leader election by magic. Unfortunately Perls threading code is .. unpleasant:

I wanted to store all known-peers in a singleton.
Then I wanted to create threads that would announce and receive updates.

This failed. Majorly. Because you cannot launch the implementation of a class-method as a thread. Equally you cannot make a variable which is "complex" shared across threads.

I wrote some demo code which works without packages and a shared singleton:

https://github.com/skx/peer-discovery

The Ruby version, by contrast, is much more OO and neater. Meh.

I've now shelved the project.

My next, big, task was to make the network service utterly memcached compatible. That would have been fiddly, but not impossible. Right now I just use a simple line-based network protocol.

I suspect I could have got what I wanted using EventMachine, or similar, but that's a path I've not yet explored, and I'm happy enough with that decision.

Tags: clustering, perl, ruby, software | 2 comments

Entries tagged `clustering`

I'm begging you, take a shot. Just one hit.

So I failed at writing some clustered code in Perl

Recent Posts

Entries tagged clustering

I'm begging you, take a shot. Just one hit.

So I failed at writing some clustered code in Perl

Recent Posts

Entries tagged `clustering`