About Archive Tags RSS Feed

 

Entries tagged fuse

It is an army bred for a single purpose

9 February 2009 21:50

It is funny the way things work out when you're looking for help.

Recently I was working on a Ruby + FUSE based filesystem and as part of the development I added simple diagnostic output via trivial code such as this:

@debug && puts "called foo(#{param});"

That was adequate for minimal interactive use, but not so good for real live use. In real live use I started outputing messages to a dedicated logfile, but in practise became overwhelmed by thousands of lines of output describing everything ever applied to the filesystem.

I figured the natural solution was to have a ring-buffer. (Everybody knows what a ringbuffer is, right?) It could keep the last 500 messages and newer debug information would just replace older entreis. That'd be just enough to be useful if I had a problem, but not so overwhelming it would get ignored.

In Perl I found a nice ringbuffer library, but for Ruby nothing. Locking a region of shared memory via shmget, shmset and keeping an array of a few hunded strings would be simple, but it seems odd I have to code this myself.

I started searching around and I accidentally stumbled upon the unrelated IPC::DirQueue perl module. Not useful for my ringbuffer logging problem, but beautifully useful.

There is no package for Debian but that was easily created:

dh-make-perl --build --cpan IPC::DirQueue

Already I have a million and one uses for it - not least to solve my problem of maintaining a centralised quarantine for all the spam mail rejected by N MX machines. (Which currently uses a combination of rsync and lockfiles.)

This is the reason why sites like Perl Advent Calendar are useful - they introduce a useful module every day or two, and introduce you to thinks that you can use in the future.

Of course keeping a sustainable site like that up and running is hard which is why sites like debaday struggle to attract contributors, for example.

Anyway random happyness.

ObFilm: Lord of the rings: Two Towers

| 3 comments

 

I'm begging you, take a shot. Just one hit.

31 July 2009 21:50

A while back I mentioned my interest in clustered storage solutions. I only mentioned this in passing because I was wary of getting bogged down in details.

In short though I'd like the ability to seamlessly replicate content across machines which I own, maintain, and operate (I will refer to "storage machines" as "client nodes" to keep in the right theme!)

In clustering terms there are a few ways you can do this. One system which I've used in anger has been drbd. drbd basically allows you to shares block devices across a network. This approach has some pros and some cons.

The biggest problem with drbd is that it allows the block device to be mounted at only one location - so you cannot replicate in a live sense. The fact that it is block-based also has implications on the size of your available storage.

Moving back to a higher level I'm pretty much decided that when I say that I want "a replicated/distributed filesystem" that is precisely what I mean: The filesystem should be replicated, rather than having the contents split up and shared about in pieces. The distinction is pretty fine but considering the case of backups this is how you want it to work: Each client node has a complete backup of the data. Of course the downside to this is that that maximum storage available is the smallest of the available nodes.

(e.g. 4 machines with 200Gb of space, and 1 machine of 100gb - you cannot write more than 100Gb to the clustered filesystem in total or the replicated contents will be too large to fit on the smaller host.)

As a concrete example, given the tree:

/shared
/shared/Text/
/shared/Text/Articles/
/shared/Text/Crypted/

I'd expect that exact tree to be mirrored and readable as a filesystem on the client nodes. In practise I want a storage node to either be complete, or broken. I don't want to worry about failure cases when dealing with things like "A file must be present on 2+ nodes out of the 5 available".

Now the cheaty way of doing this is to work like this:

# modify files
# for i in client1 client2 client3 ; do \
    rsync -vazr --in-place --delete /shared $i:/backing-store ; \
  done

i.e. Just rsync after every change. The latency would be a killer, and the level of dedication would be super-human, but if you could wire that up in a filesystem you'd be golden. And this is why I've struggled with lustre, mogilefs, and tahoe. They provide both real and complicated clustering - but they do not give you a simple filesystem you can pickup and run with from any client node in an emergency. They're more low-level. (Probably more reliable, sexy and cool. I bet they get invited to parties and stuff ;)

Thankfully somebody has written exactly what I need: chironfs. This allows you to spread writes to a single directory over multiple storage directories - each of which gets a complete copy.

This is the first example from the documentation :

# make client directories and virtual one
mkdir /client1 /client2 /real

# now do the magic
chironfs --fuseoptions allow_other /client1=/client2 /real

Once you've done this you can experiment:

skx@gold:/real$ touch x
skx@gold:/real$ ls /client1/
x
skx@gold:/real$ ls /client2/
x
skx@gold:/real$ touch fdfsf

skx@gold:/real$ ls /client1
fdfsf  x
skx@gold:/real$ ls /client2
fdfsf  x

What this demonstrates is that running "touch foo" in the "/real" directory resulted in that same operation being carried out in both the directory /client1 and "/client2".

This is looking perfect, but wait it isn't distributed?!

Thats the beauty of this system if /client1 & /client2 are NFS-mounted it works! Just the way you imagine it should!

Because I'm somewhat wary of fuse at the moment I've set this up at home - with a "/backups" directory replicated to an LVM filesystem locally and also to a remote host which is mounted via SSH.

Assuming it doesn't go mad, lose data, or fail badly if the ssh-based mount disappears I'm going to be very very happy.

So far it is looking good though, I can see entrie logged like this when I unmount behind its back:

2009/07/30 23:28 mknod failed accessing /client2 No such file or directory
2009/07/30 23:28 disabling replica failed accessing /client2

So a decent alarm on that would make it obvious when replication was failing - just like the emails that mdadm sends out a semi-regular cron-check would probably be sufficient.

Anyway enough advertising! So far I love it and I think that if it works out for me in anger that chironfs should be part of Debian. Partly because I'll want to use it more, and partly because it just plain rocks!

ObFilm: The Breakfast Club

| 13 comments

 

I've accidentally written a replication-friendly filesystem

29 July 2010 21:50

This evening I was mulling over a recurring desire for a simple, scalable, and robust replication filesystem. These days there are several out there, including Gluster.

For the past year I've personally been using chironfs for my replication needs - I have /shared mounted upon a number of machines and a write to it on any will be almost immediately reflected in the others.

This evening, when mulling over a performance problem with Gluster I was suddenly struck by the idea "Hey, Redis is fast. Right? How hard could it be?".

Although Redis is another one of those new-fangled key/value stores it has several other useful primitives, such as "SETS" and "LISTS". Imagine a filesystem which looks like this:

 /
 /srv
 /tmp
 /var/spool/tmp/

Couldn't we store those entries as members of a set? So we'd have:

  SET ENTRIES:/              -> srv, tmp, var
  SET ENTRIES:/var/spool     -> tmp
  SET ENTRIES:/var/spool/tmp -> (nil)

If you do that "readdir(path):" becomes merely "SMEMBERS entries:$path" ("SMEMBERS foo" being "members of the set named foo"). At this point you can add and remove directories with ease.

The next step, given an entry in a directory "/tmp", called "bob", is working out the most important things:

  • Is /tmp/bob a directory?
    • Read the key DIRECTORIES:/tmp/bob - if that contains a value it is.
  • What is the owner of /tmp/bob?
    • Read the key FILES:/tmp/bob:UID.
  • If this is a file what is the size? What are the contents?
    • Read the key FILES:/tmp/bob:size for the size.
    • Read the key FILES:/tmp/bob:data for the contents.

So with a little creative thought you end up with a filesystem which is entirely stored in Redis. At this point you're thinking "Oooh shiny. Fast and shiny". But then you think "Redis has built in replication support..."

Not bad.

My code is a little rough and ready, using libfus2 & the hiredis C API for Redis. If there's interest I'll share it somewhere.

It should be noted that currently there are two limitations with Redis:

  • All data must fit inside RAM.
  • Master<->Slave replication is trivial, and is the only kind of replication you get.

In real terms the second limitation is the killer. You could connect to the Redis server on HostA from many locations - so you get a true replicated server. Given that the binary protocol is simple this might actually be practical in the real-world. My testing so far seems "fine", but I'll need to stress it some more to be sure.

Alternatively you could bind the filesystem to the redis server running upon localhost on multiple machines - one redis server would be the master, and the rest would be slaves. That gives you a filesystem which is read-only on all but one host, but if that master host is updated the slaves see it "immediately". (Does that setup even have a name? I'm thinking of master-write, slave-read, and that gets cumbersome.)

ObQuote: Please, please, please drive faster! -- Wanted

| 6 comments

 

I updated my redis-based filesystem

2 March 2011 21:50

In July last year I made a brief post about a simple filesystem I'd put together which used Redis for the storage.

At that time I thought it was a cute hack, and didn't spend too much time with it. But recently I found a use for it so I cleaned it up, synced up the C client for Redis which I used and generally started to care again.

If it is useful you can now find it online:

The basic idea is the same as it was before, except I did eventually move to an INODE-like system. Each file/directory entry receives a unique identifier (integer) - and then I store the meta-data in a key based off that name.

This means for a file I might have keys, and values,like this:

KeyValue
INODE:1:NAMEThe name of the file (e.g. "passwd").
INODE:1:SIZEThe size of the file (e.g. "1661" )
INODE:1:GIDThe group ID of the file's owner (e.g. "0")
INODE:1:UIDThe user ID of the file's owner (e.g. "0")
INODE:1:MODEThe mode of the file (e.g. 0755)

To store these things I use a Redis "SET" which allows me to easily iterate over all the entries in each directory.

ObQuote: "They fuck up, they get beat. We fuck up, they give us pensions. " - The Wire

| 3 comments

 

A filesystem for known_hosts

19 April 2018 12:01

The other day I had an idea that wouldn't go away, a filesystem that exported the contents of ~/.ssh/known_hosts.

I can't think of a single useful use for it, beyond simple shell-scripting, and yet I couldn't resist.

 $ go get -u github.com/skx/knownfs
 $ go install github.com/skx/knownfs

Now make it work:

 $ mkdir ~/knownfs
 $ knownfs ~/knownfs

Beneat out mount-point we can expect one directory for each known-host. So we'll see entries:

 ~/knownfs $ ls | grep \.vpn
 builder.vpn
 deagol.vpn
 master.vpn
 www.vpn

 ~/knownfs $ ls | grep steve
 blog.steve.fi
 builder.steve.org.uk
 git.steve.org.uk
 mail.steve.org.uk
 master.steve.org.uk
 scatha.steve.fi
 www.steve.fi
 www.steve.org.uk

The host-specified entries will each contain a single file fingerprint, with the fingerprint of the remote host:

 ~/knownfs $ cd www.steve.fi
 ~/knownfs/www.steve.fi $ ls
 fingerprint
 frodo ~/knownfs/www.steve.fi $ cat fingerprint
 98:85:30:f9:f4:39:09:f7:06:e6:73:24:88:4a:2c:01

I've used it in a few shell-loops to run commands against hosts matching a pattern, but beyond that I'm struggling to think of a use for it.

If you like the idea I guess have a play:

It was perhaps more useful and productive than my other recent work - which involves porting an existing network-testing program from Ruby to golang, and in the process making it much more uniform and self-consistent.

The resulting network tester is pretty good, and can now notify via MQ to provide better decoupling too. The downside is of course that nobody changes network-testing solutions on a whim, and so these things are basically always in-house only.

| 3 comments