About Archive Tags RSS Feed

 

Entries tagged object-storage

On de-duplicating uploaded file-content.

7 May 2015 21:50

This evening I've been mostly playing with removing duplicate content. I've had this idea for the past few days about object-storage, and obviously in that context if you can handle duplicate content cleanly that's a big win.

The naive implementation of object-storage involves splitting uploaded files into chunks, storing them separately, and writing database-entries such that you can reassemble the appropriate chunks when the object is retrieved.

If you store chunks on-disk, by the hash of their contents, then things are nice and simple.

The end result is that you might upload the file /etc/passwd, split that into four-byte chunks, and then hash each chunk using SHA256.

This leaves you with some database-entries, and a bunch of files on-disk:

/tmp/hashed/ef267892ee080862c96a8d2d05de62f48e20f0875f27379e7d58c73ea4455bf1
/tmp/hashed/a378977155fb42bb006496321cbe31f74cbda803c3f6ca590f30e76d1afad921
..
/tmp/hashed/3805b0245bc8375be7125ae228eef711552ac082ffb9bf8756e2964a2393a9de

In my toy-code I wrote out the data in 4-byte chunks, which is grossly ineffeciant. But the value of using such small pieces is that there is liable to be a lot of collisions, and that means we save-space. It is a trade-off.

So the main thing I was experimenting with was the size of the chunks. If you make them too small you lose I/O due to the overhead of writing out so many small files, but you gain because collisions are common.

The rough testing I did involved using chunks of 16, 32, 128, 255, 512, 1024, 2048, and 4096 bytes. As sizes went up the overhead shrank, but also so did the collisions.

Unless you could handle the case of users uploading a lot of files like /bin/ls which are going to collide 100% of the time with prior uploads using larger chunks just didn't win as much as I thought they would.

I wrote a toy server using Sinatra & Ruby, which handles the splitting/hashing/and stored block-IDs in SQLite. It's not so novel given that it took only an hour or so to write.

The downside of my approach is also immediately apparent. All the data must live on a single machine - so that reassmbly works in the simple fashion. That's possible, even with lots of content if you use GlusterFS, or similar, but it's probably not a great approach in general. If you have large capacity storage avilable locally then this might would well enough for storing backups, etc, but .. yeah.

| 7 comments

 

A brief examination of tahoe-lafs

28 May 2015 21:50

Continuing the theme from the last post I made, I've recently started working my way down the list of existing object-storage implementations.

tahoe-LAFS is a well-established project which looked like a good fit for my needs:

  • Simple API.
  • Good handling of redundancy.

Getting the system up and running, on four nodes, was very simple. Setup a single/simple "introducer" which is a well-known node that all hosts can use to find each other, and then setup four deamons for storage.

When files are uploaded they are split into chunks, and these chunks are then distributed amongst the various nodes. There are some configuration settings which determine how many chunks files are split into (10 by default), how many chunks are required to rebuild the file (3 by default) and how many copies of the chunks will be created.

The biggest problem I have with tahoe is that there is no rebalancing support: Setup four nodes, and the space becomes full? You can add more nodes, new uploads go to the new nodes, while old ones stay on the old. Similarly if you change your replication-counts because you're suddenly more/less paranoid this doesn't affect existing nodes.

In my perfect world you'd distribute blocks around pretty optimistically, and I'd probably run more services:

  • An introducer - To allow adding/removing storage-nodes on the fly.
  • An indexer - to store the list of "uploads", meta-data, and the corresponding block-pointers.
  • The storage-nodes - to actually store the damn data.

The storage nodes would have the primitives "List all blocks", "Get block", "Put block", and using that you could ensure that each node had sent its data to at least N other nodes. This could be done in the background.

The indexer would be responsible for keeping track of which blocks live where, and which blocks are needed to reassemble upload N. There's probably more that it could do.

| 2 comments

 

We're all about storing objects

21 June 2015 21:50

Recently I've been experimenting with camlistore, which is yet another object storage system.

Camlistore gains immediate points because it is written in Go, and is a project initiated by Brad Fitzpatrick, the creator of Perlbal, memcached, and Livejournal of course.

Camlistore is designed exactly how I'd like to see an object storage-system - each server allows you to:

  • Upload a chunk of data, getting an ID in return.
  • Download a chunk of data, by ID.
  • Iterate over all available IDs.

It should be noted more is possible, there's a pretty web UI for example, but I'm simplifying. Do your own homework :)

With those primitives you can allow a client-library to upload a file once, then in the background a bunch of dumb servers can decide amongst themselves "Hey I have data with ID:33333 - Do you?". If nobody else does they can upload a second copy.

In short this kind of system allows the replication to be decoupled from the storage. The obvious risk is obvious though: if you upload a file the chunks might live on a host that dies 20 minutes later, just before the content was replicated. That risk is minimal, but valid.

There is also the risk that sudden rashes of uploads leave the system consuming all the internal-bandwith constantly comparing chunk-IDs, trying to see if data is replaced that has been copied numerous times in the past, or trying to play "catch-up" if the new-content is larger than the replica-bandwidth. I guess it should possible to detect those conditions, but they're things to be concerned about.

Anyway the biggest downside with camlistore is documentation about rebalancing, replication, or anything other than simple single-server setups. Some people have blogged about it, and I got it working between two nodes, but I didn't feel confident it was as robust as I wanted it to be.

I have a strong belief that Camlistore will become a project of joy and wonder, but it isn't quite there yet. I certainly don't want to stop watching it :)

On to the more personal .. I'm all about the object storage these days. Right now most of my objects are packed in a collection of boxes. On the 6th of next month a shipping container will come pick them up and take them to Finland.

For pretty much 20 days in a row we've been taking things to the skip, or the local charity-shops. I expect that by the time we've relocated the amount of possesions we'll maintain will be at least a fifth of our current levels.

We're working on the general rule of thumb: "If it is possible to replace an item we will not take it". That means chess-sets, mirrors, etc, will not be carried. DVDs, for example, have been slashed brutally such that we're only transferring 40 out of a starting collection of 500+.

Only personal, one-off, unique, or "significant" items will be transported. This includes things like personal photographs, family items, and similar. Clothes? Well I need to take one jacket, but more can be bought. The only place I put my foot down was books. Yes I'm a kindle-user these days, but I spent many years tracking down some rare volumes, and though it would be possible to repeat that effort I just don't want to.

I've also decided that I'm carrying my complete toolbox. Some of the tools I took with me when I left home at 18 have stayed with me for the past 20+ years. I don't need this specific crowbar, or axe, but I'm damned if I'm going to lose them now. So they stay. Object storage - some objects are more important than they should be!

| No comments

 

A brief look at the weed file store

10 August 2015 21:50

Now that I've got a citizen-ID, a pair of Finnish bank accounts, and have enrolled in a Finnish language-course (due to start next month) I guess I can go back to looking at object stores, and replicated filesystems.

To recap my current favourite, despite the lack of documentation, is the Camlistore project which is written in Go.

Looking around there are lots of interesting projects being written in Go, and so is my next one the seaweedfs, which despite its name is not a filesystem at all, but a store which is accessed via HTTP.

Installation is simple, if you have a working go-lang environment:

go get github.com/chrislusf/seaweedfs/go/weed

Once that completes you'll find you have the executable bin/weed placed beneath your $GOPATH. This single binary is used for everything though it is worth noting that there are distinct roles:

  • A key concept in weed is "volumes". Volumes are areas to which files are written. Volumes may be replicated, and this replication is decided on a per-volume basis, rather than a per-upload one.
  • Clients talk to a master. The master notices when volumes spring into existance, or go away. For high-availability you can run multiple masters, and they elect the real master (via RAFT).

In our demo we'll have three hosts one, the master, two and three which are storage nodes. First of all we start the master:

root@one:~# mkdir /node.info
root@one:~# weed master -mdir /node.info -defaultReplication=001

Then on the storage nodes we start them up:

root@two:~# mkdir /data;
root@two:~# weed volume -dir=/data -max=1  -mserver=one.our.domain:9333

Then the second storage-node:

root@three:~# mkdir /data;
root@three:~# weed volume -dir=/data -max=1 -mserver=one.our.domain:9333

At this point we have a master to which we'll talk (on port :9333), and a pair of storage-nodes which will accept commands over :8080. We've configured replication such that all uploads will go to both volumes. (The -max=1 configuration ensures that each volume-store will only create one volume each. This is in the interest of simplicity.)

Uploading content works in two phases:

  • First tell the master you wish to upload something, to gain an ID in response.
  • Then using the upload-ID actually upload the object.

We'll do that like so:

laptop ~ $ curl -X POST http://one.our.domain:9333/dir/assign
{"fid":"1,06c3add5c3","url":"192.168.1.100:8080","publicUrl":"192.168.1.101:8080","count":1}

client ~ $ curl -X PUT -F file=@/etc/passwd  http://192.168.1.101:8080/1,06c3add5c3
{"name":"passwd","size":2137}

In the first command we call /dir/assign, and receive a JSON response which contains the IPs/ports of the storage-nodes, along with a "file ID", or fid. In the second command we pick one of the hosts at random (which are the IPs of our storage nodes) and make the upload using the given ID.

If the upload succeeds it will be written to both volumes, which we can see directly by running strings on the files beneath /data on the two nodes.

The next part is retrieving a file by ID, and we can do that by asking the master server where that ID lives:

client ~ $ curl http://one.our.domain:9333/dir/lookup?volumeId=1,06c3add5c3
{"volumeId":"1","locations":[
 {"url":"192.168.1.100:8080","publicUrl":"192.168.1.100:8080"},
 {"url":"192.168.1.101:8080","publicUrl":"192.168.1.101:8080"}
]}

Or, if we prefer we could just fetch via the master - it will issue a redirect to one of the volumes that contains the file:

client ~$ curl http://one.our.domain:9333/1,06c3add5c3
<a href="http://192.168.1.100:8080/1,06c3add5c3">Moved Permanently</a>

If you follow redirections then it'll download, as you'd expect:

client ~ $ curl -L http://one.our.domain:9333/1,06c3add5c3
root:x:0:0:root:/root:/bin/bash
..

That's about all you need to know to decide if this is for you - in short uploads require two requests, one to claim an identifier, and one to use it. Downloads require that your storage-volumes be publicly accessible, and will probably require a proxy of some kind to make them visible on :80, or :443.

A single "weed volume .." process, which runs as a volume-server can support multiple volumes, which are created on-demand, but I've explicitly preferred to limit them here. I'm not 100% sure yet whether it's a good idea to allow creation of multiple volumes or not. There are space implications, and you need to read about replication before you go too far down the rabbit-hole. There is the notion of "data centres", and "racks", such that you can pretend different IPs are different locations and ensure that data is replicated across them, or only within-them, but these choices will depend on your needs.

Writing a thin middleware/shim to allow uploads to be atomic seems simple enough, and there are options to allow exporting the data from the volumes as .tar files, so I have no undue worries about data-storage.

This system seems reliable, and it seems well designed, but people keep saying "I'm not using it in production because .. nobody else is" which is an unfortunate problem to have.

Anyway, I like it. The biggest omission is really authentication. All files are public if you know their IDs, but at least they're not sequential ..

| No comments

 

Accidental data-store ..

18 May 2016 21:50

A few months back I was looking over a lot of different object-storage systems, giving them mini-reviews, and trying them out in turn.

While many were overly complex, some were simple. Simplicity is always appealing, providing it works.

My review of camlistore was generally positive, because I like the design. Unfortunately it also highlighted a lack of documentation about how to use it to scale, replicate, and rebalance.

How hard could it be to write something similar, but also paying attention to keep it as simple as possible? Well perhaps it was too easy.

Blob-Storage

First of all we write a blob-storage system. We allow three operations to be carried out:

  • Retrieve a chunk of data, given an ID.
  • Store the given chunk of data, with the specified ID.
  • Return a list of all known IDs.

 

API Server

We write a second server that consumers actually use, though it is implemented in terms of the blob-storage server listed previously.

The public API is trivial:

  • Upload a new file, returning the ID which it was stored under.
  • Retrieve a previous upload, by ID.

 

Replication Support

The previous two services are sufficient to write an object storage system, but they don't necessarily provide replication. You could add immediate replication; an upload of a file could involve writing that data to N blob-servers, but in a perfect world servers don't crash, so why not replicate in the background? You save time if you only save uploaded-content to one blob-server.

Replication can be implemented purely in terms of the blob-servers:

  • For each blob server, get the list of objects stored on it.
  • Look for that object on each of the other servers. If it is found on N of them we're good.
  • If there are fewer copies than we like, then download the data, and upload to another server.
  • Repeat until each object is stored on sufficient number of blob-servers.

 

My code is reliable, the implementation is almost painfully simple, and the only difference in my design is that rather than having an API-server which allows both "uploads" and "downloads" I split it into two - that means you can leave your "download" server open to the world, so that it can be useful, and your upload-server can be firewalled to only allow a few hosts to access it.

The code is perl-based, because Perl is good, and available here on github:

TODO: Rewrite the thing in #golang to be cool.

| 1 comment

 

Accidental data-store .. is go!

19 May 2016 21:50

A couple of days ago I wrote::

The code is perl-based, because Perl is good, and available here on github:

..

TODO: Rewrite the thing in #golang to be cool.

I might not be cool, but I did indeed rewrite it in golang. It was quite simple, and a simple benchmark of uploading two million files, balanced across 4 nodes worked perfectly.

https://github.com/skx/sos/

| 2 comments

 

A mixed weekend

30 May 2016 21:50

This past seven days have been a little mixed:

  • I updated documentation on my simple object store.
  • I created a simplified alerting system.
    • Heavily inspired by something we use at work.
    • My version is much much simpler, but still useful enough to alert me of outages (via hearbeats) and unread email. (Both of which are sent via pushover notifications.)
  • I bought a pair of cheap USB "game controllers"
    • And have spend several hours playing SNES games such as Bomberman 2, and Super Mario Brothers 3.
    • I'm using mednafan, as it supports cheats, fullscreen, sound, and is pretty easy to drive.

Finally I spent the tail end of the weekend being a little red, sore, and itchy. . I figured this was a surprising outbreak of Dyshidrosis on my hands, and eczema on my body. Instead I received a diagnosis of Scarlet Fever. So now I feel somewhat Dickensian!

Apparently this infection is on the rise!

| 2 comments