About Archive Tags RSS Feed


I've accidentally written a replication-friendly filesystem

29 July 2010 21:50

This evening I was mulling over a recurring desire for a simple, scalable, and robust replication filesystem. These days there are several out there, including Gluster.

For the past year I've personally been using chironfs for my replication needs - I have /shared mounted upon a number of machines and a write to it on any will be almost immediately reflected in the others.

This evening, when mulling over a performance problem with Gluster I was suddenly struck by the idea "Hey, Redis is fast. Right? How hard could it be?".

Although Redis is another one of those new-fangled key/value stores it has several other useful primitives, such as "SETS" and "LISTS". Imagine a filesystem which looks like this:


Couldn't we store those entries as members of a set? So we'd have:

  SET ENTRIES:/              -> srv, tmp, var
  SET ENTRIES:/var/spool     -> tmp
  SET ENTRIES:/var/spool/tmp -> (nil)

If you do that "readdir(path):" becomes merely "SMEMBERS entries:$path" ("SMEMBERS foo" being "members of the set named foo"). At this point you can add and remove directories with ease.

The next step, given an entry in a directory "/tmp", called "bob", is working out the most important things:

  • Is /tmp/bob a directory?
    • Read the key DIRECTORIES:/tmp/bob - if that contains a value it is.
  • What is the owner of /tmp/bob?
    • Read the key FILES:/tmp/bob:UID.
  • If this is a file what is the size? What are the contents?
    • Read the key FILES:/tmp/bob:size for the size.
    • Read the key FILES:/tmp/bob:data for the contents.

So with a little creative thought you end up with a filesystem which is entirely stored in Redis. At this point you're thinking "Oooh shiny. Fast and shiny". But then you think "Redis has built in replication support..."

Not bad.

My code is a little rough and ready, using libfus2 & the hiredis C API for Redis. If there's interest I'll share it somewhere.

It should be noted that currently there are two limitations with Redis:

  • All data must fit inside RAM.
  • Master<->Slave replication is trivial, and is the only kind of replication you get.

In real terms the second limitation is the killer. You could connect to the Redis server on HostA from many locations - so you get a true replicated server. Given that the binary protocol is simple this might actually be practical in the real-world. My testing so far seems "fine", but I'll need to stress it some more to be sure.

Alternatively you could bind the filesystem to the redis server running upon localhost on multiple machines - one redis server would be the master, and the rest would be slaves. That gives you a filesystem which is read-only on all but one host, but if that master host is updated the slaves see it "immediately". (Does that setup even have a name? I'm thinking of master-write, slave-read, and that gets cumbersome.)

ObQuote: Please, please, please drive faster! -- Wanted



Comments on this entry

icon Jason Cook at 22:42 on 29 July 2010

We had a similar idea at Wikia that we are working on, except we are using riak as the underlying store. Our interest is primarily oriented towards medium to small blob storage, but it looks promising so far.


icon Matt Secoske at 02:44 on 30 July 2010

With virtual memory support you can have data stored to disk as well. Also note that Redis blocks on requests, so if you are processing a large file, it will essentially lock the process until the request is returned. (There may be non-blocking support in redis now... I haven't looked in a few months and Antirez is crazy productive).

icon Pieter Noordhuis at 13:03 on 30 July 2010

@Matt: Redis doesn't block when sending the reply back to the client (and never has, as far as I know). I think you're confusing it with atomic execution of commands. Redis writes the response back to the client whenever the client is ready for the response. In the meantime, other clients are served. In the light of this project, Redis will be perfectly able to serve large files to multiple clients simultaneously.

icon Aaron Ucko at 18:36 on 30 July 2010

Cute. I presume your actual schema uses inode numbers, though?

icon Steve Kemp at 08:41 on 31 July 2010

Riak looks cool, and "properly" distributed, but its not something I've gotten round to investigating yet.

@Aaron: I've not got any notion of an inode yet, although I could certainly simplify the schema by using one, and will move to that kind of scheme shortly:

e.g. Rather than having FILE:MODE:/etc/passwd or DIR:MODE:/etc I'd store "FILE:MODE:1".

That will be particularly useful when I add symlink support.

icon sabueso at 00:38 on 2 August 2010

Gluster doesn't perform good on small files over tcp-ip, but you could use infiniband transport ;-)
Only one thing , remember that Gluster is a clustered file system with replication support , but IMHO this feature it's not the big point of the clustered filesystem,while scalability its :-)!