About Archive Tags RSS Feed


A diversion on off-site storage

26 March 2014 21:50

Yesterday I took a diversion from thinking about my upcoming cache project, largely because I took some pictures inside my house, and realized my offsite backup was getting full.

I have three levels of backups:

  • Home stuff on my desktop is replicated to my wifes desktop, and vice-versa.
  • A simple server running rsync (content-free http://rsync.io/).
  • A "peering" arrangement of a small group of friends. Each of us makes available a small amount of space and we copy to-from each others shares, via rsync / scp as appropriate.

Unfortunately my rsync-based personal server is getting a little too full, and will certainly be full by next year. S3 is pricy, and I don't trust the "unlimited" storage people (backblaze,etc) to be sustainable and reliable long-term.

The pricing on Google-drive seems appealing, but I guess I'm loathe to share more data with Google. Perhaps I could dedicated a single "backup.account@gmail.com" login to that, separate from all-else.

So the diversion came along when I looked for Amazon S3-comptible, self-hosted, servers. There are a few, most of them are PHP-based, or similarly icky.

So far cloudfoundry's vlob looks the most interesting, but the main project seems stalled/dead. Sadly using s3cmd to upload files failed, but certainly the `curl` based API works as expected.

I looked at Gluster, CEPH, and similar, but didn't yet come up with a decent plan for handling offsite storage, but I know I have only six months or so before the need becomes pressing. I imagine the plan has to be using N-small servers with local storage, rather than 1-Large server, purely because pricing is going to be better that way.

Decisions decisions.



Comments on this entry

icon Andy Cater at 18:44 on 26 March 2014

HP Microserver and 8TB worth of disks might be suitable and small enough to carry around.

icon Steve Kemp at 18:47 on 26 March 2014

When I think off-site I think "remote", rather than remembering to juggle hardware around manually, though as you suggest external drives and similar will work for that.

(+/- pi, cube, microserver to drive them.)

icon Charles Darke at 22:32 on 26 March 2014

What about using google and just encrypting all the data?

icon Steve Kemp at 22:42 on 26 March 2014

Encryption is pretty much a given anyway, regardless of who hosts things.

The trade-off is control vs. privacy and reliability.

icon yuval at 04:10 on 27 March 2014

Have you seen camlistore.org ?

icon Uli Martens at 06:39 on 27 March 2014

Depending on your data, you might take a look at git-annex (https://git-annex.branchable.com/walkthrough/ for a good start).

It will probably work better with larger files than millions of small ones, but that's not an issue of whether it works, but rather of how fast. :)

I'm using a mix of local storage, usb storage and remote storage to keep backups and media files both directly available and redundantly offsite/offline, which works great as each repository knows which data content (it's hashed) should be there, and i can configure how often (and where) the actual data is stored.

icon Steve Kemp at 08:55 on 27 March 2014

I spent a while looking at camlistore, and xtreemfs, beyond that I've not been thinking of git-annex because I'm thinking more about the storage side than the transport.

I figure a filesystem would be ideal, but equally a naive "blob-server" with a good client support and ability to retrieve will work. (Which is why I was looking at S3-compatible servers as well as replicating filesystems.)

My only immediate comment is that Ceph feels fragile, camlistore seems nice, and Gluster is best avoided.

icon Paulo Almeida at 10:42 on 27 March 2014

Any particular reason why Ceph feels fragile? I only ask because I'm looking into deploying it myself.

icon Steve Kemp at 10:49 on 27 March 2014

Perhaps fragile isnt' the right word. There are lots of components to the system, the object store, the index store, and I don't have a good feeling for which ones you can kill and restart and which you can't.

Instead I should say that it feels like the software is broken down into distinct parts that seem to make sense, from the outside looking in, but I've not yet sure how failures at different levels are handled, and I can't see a decent discussion of that.

(i.e. If you're storing data on, say cheap virtual machines, it seems obvious that one or two will die every now and again, so you need to be failure tolerant and I'm not sure how ceph handles that.)

icon Nux at 19:26 on 28 March 2014

Also check out https://tahoe-lafs.org/

icon Tobe at 09:50 on 29 March 2014

Backupsy works out well for me

icon Paul Walker at 15:02 on 1 April 2014


Just wondering if you'd already considered/discarded rsync.net for some reason...? (Cost is a valid reason.)


icon Steve Kemp at 14:25 on 2 April 2014

I think the reason I avoided rsync.net was price, but I know I didn't look at it recently. I'm not sure why actually, I should go look.