I've accidentally written a replication-friendly filesystem

This evening I was mulling over a recurring desire for a simple, scalable, and robust replication filesystem. These days there are several out there, including Gluster.

For the past year I've personally been using chironfs for my replication needs - I have /shared mounted upon a number of machines and a write to it on any will be almost immediately reflected in the others.

This evening, when mulling over a performance problem with Gluster I was suddenly struck by the idea "Hey, Redis is fast. Right? How hard could it be?".

Although Redis is another one of those new-fangled key/value stores it has several other useful primitives, such as "SETS" and "LISTS". Imagine a filesystem which looks like this:

 /
 /srv
 /tmp
 /var/spool/tmp/

Couldn't we store those entries as members of a set? So we'd have:

  SET ENTRIES:/              -> srv, tmp, var
  SET ENTRIES:/var/spool     -> tmp
  SET ENTRIES:/var/spool/tmp -> (nil)

If you do that "readdir(path):" becomes merely "SMEMBERS entries:$path" ("SMEMBERS foo" being "members of the set named foo"). At this point you can add and remove directories with ease.

The next step, given an entry in a directory "/tmp", called "bob", is working out the most important things:

Is /tmp/bob a directory?
- Read the key DIRECTORIES:/tmp/bob - if that contains a value it is.
What is the owner of /tmp/bob?
- Read the key FILES:/tmp/bob:UID.
If this is a file what is the size? What are the contents?
- Read the key FILES:/tmp/bob:size for the size.
- Read the key FILES:/tmp/bob:data for the contents.

So with a little creative thought you end up with a filesystem which is entirely stored in Redis. At this point you're thinking "Oooh shiny. Fast and shiny". But then you think "Redis has built in replication support..."

Not bad.

My code is a little rough and ready, using libfus2 & the hiredis C API for Redis. If there's interest I'll share it somewhere.

It should be noted that currently there are two limitations with Redis:

All data must fit inside RAM.
Master<->Slave replication is trivial, and is the only kind of replication you get.

In real terms the second limitation is the killer. You could connect to the Redis server on HostA from many locations - so you get a true replicated server. Given that the binary protocol is simple this might actually be practical in the real-world. My testing so far seems "fine", but I'll need to stress it some more to be sure.

Alternatively you could bind the filesystem to the redis server running upon localhost on multiple machines - one redis server would be the master, and the rest would be slaves. That gives you a filesystem which is read-only on all but one host, but if that master host is updated the slaves see it "immediately". (Does that setup even have a name? I'm thinking of master-write, slave-read, and that gets cumbersome.)

ObQuote: Please, please, please drive faster! -- Wanted

Tags: filesystems, fuse, redis | 6 comments

I've accidentally written a replication-friendly filesystem

Comments on this entry

Recent Posts