Continuing the theme from the last post I made, I've recently started working my way down the list of existing object-storage implementations.
tahoe-LAFS is a well-established project which looked like a good fit for my needs:
- Simple API.
- Good handling of redundancy.
Getting the system up and running, on four nodes, was very simple. Setup a single/simple "introducer" which is a well-known node that all hosts can use to find each other, and then setup four deamons for storage.
When files are uploaded they are split into chunks, and these chunks are then distributed amongst the various nodes. There are some configuration settings which determine how many chunks files are split into (10 by default), how many chunks are required to rebuild the file (3 by default) and how many copies of the chunks will be created.
The biggest problem I have with tahoe is that there is no rebalancing support: Setup four nodes, and the space becomes full? You can add more nodes, new uploads go to the new nodes, while old ones stay on the old. Similarly if you change your replication-counts because you're suddenly more/less paranoid this doesn't affect existing nodes.
In my perfect world you'd distribute blocks around pretty optimistically, and I'd probably run more services:
- An introducer - To allow adding/removing storage-nodes on the fly.
- An indexer - to store the list of "uploads", meta-data, and the corresponding block-pointers.
- The storage-nodes - to actually store the damn data.
The storage nodes would have the primitives "List all blocks", "Get block", "Put block", and using that you could ensure that each node had sent its data to at least N other nodes. This could be done in the background.
The indexer would be responsible for keeping track of which blocks live where, and which blocks are needed to reassemble upload N. There's probably more that it could do.