This evening I've been mostly playing with removing duplicate content. I've had this idea for the past few days about object-storage, and obviously in that context if you can handle duplicate content cleanly that's a big win.
The naive implementation of object-storage involves splitting uploaded files into chunks, storing them separately, and writing database-entries such that you can reassemble the appropriate chunks when the object is retrieved.
If you store chunks on-disk, by the hash of their contents, then things are nice and simple.
The end result is that you might upload the file /etc/passwd, split that into four-byte chunks, and then hash each chunk using SHA256.
This leaves you with some database-entries, and a bunch of files on-disk:
In my toy-code I wrote out the data in 4-byte chunks, which is grossly ineffeciant. But the value of using such small pieces is that there is liable to be a lot of collisions, and that means we save-space. It is a trade-off.
So the main thing I was experimenting with was the size of the chunks. If you make them too small you lose I/O due to the overhead of writing out so many small files, but you gain because collisions are common.
The rough testing I did involved using chunks of 16, 32, 128, 255, 512, 1024, 2048, and 4096 bytes. As sizes went up the overhead shrank, but also so did the collisions.
Unless you could handle the case of users uploading a lot of files like /bin/ls which are going to collide 100% of the time with prior uploads using larger chunks just didn't win as much as I thought they would.
I wrote a toy server using Sinatra & Ruby, which handles the splitting/hashing/and stored block-IDs in SQLite. It's not so novel given that it took only an hour or so to write.
The downside of my approach is also immediately apparent. All the data must live on a single machine - so that reassmbly works in the simple fashion. That's possible, even with lots of content if you use GlusterFS, or similar, but it's probably not a great approach in general. If you have large capacity storage avilable locally then this might would well enough for storing backups, etc, but .. yeah.
28 May 2015 21:50
Continuing the theme from the last post I made, I've recently started working my way down the list of existing object-storage implementations.
tahoe-LAFS is a well-established project which looked like a good fit for my needs:
- Simple API.
- Good handling of redundancy.
Getting the system up and running, on four nodes, was very simple. Setup a single/simple "introducer" which is a well-known node that all hosts can use to find each other, and then setup four deamons for storage.
When files are uploaded they are split into chunks, and these chunks are then distributed amongst the various nodes. There are some configuration settings which determine how many chunks files are split into (10 by default), how many chunks are required to rebuild the file (3 by default) and how many copies of the chunks will be created.
The biggest problem I have with tahoe is that there is no rebalancing support: Setup four nodes, and the space becomes full? You can add more nodes, new uploads go to the new nodes, while old ones stay on the old. Similarly if you change your replication-counts because you're suddenly more/less paranoid this doesn't affect existing nodes.
In my perfect world you'd distribute blocks around pretty optimistically, and I'd probably run more services:
- An introducer - To allow adding/removing storage-nodes on the fly.
- An indexer - to store the list of "uploads", meta-data, and the corresponding block-pointers.
- The storage-nodes - to actually store the damn data.
The storage nodes would have the primitives "List all blocks", "Get block", "Put block", and using that you could ensure that each node had sent its data to at least N other nodes. This could be done in the background.
The indexer would be responsible for keeping track of which blocks live where, and which blocks are needed to reassemble upload N. There's probably more that it could do.