About Archive Tags RSS Feed


On de-duplicating uploaded file-content.

7 May 2015 21:50

This evening I've been mostly playing with removing duplicate content. I've had this idea for the past few days about object-storage, and obviously in that context if you can handle duplicate content cleanly that's a big win.

The naive implementation of object-storage involves splitting uploaded files into chunks, storing them separately, and writing database-entries such that you can reassemble the appropriate chunks when the object is retrieved.

If you store chunks on-disk, by the hash of their contents, then things are nice and simple.

The end result is that you might upload the file /etc/passwd, split that into four-byte chunks, and then hash each chunk using SHA256.

This leaves you with some database-entries, and a bunch of files on-disk:


In my toy-code I wrote out the data in 4-byte chunks, which is grossly ineffeciant. But the value of using such small pieces is that there is liable to be a lot of collisions, and that means we save-space. It is a trade-off.

So the main thing I was experimenting with was the size of the chunks. If you make them too small you lose I/O due to the overhead of writing out so many small files, but you gain because collisions are common.

The rough testing I did involved using chunks of 16, 32, 128, 255, 512, 1024, 2048, and 4096 bytes. As sizes went up the overhead shrank, but also so did the collisions.

Unless you could handle the case of users uploading a lot of files like /bin/ls which are going to collide 100% of the time with prior uploads using larger chunks just didn't win as much as I thought they would.

I wrote a toy server using Sinatra & Ruby, which handles the splitting/hashing/and stored block-IDs in SQLite. It's not so novel given that it took only an hour or so to write.

The downside of my approach is also immediately apparent. All the data must live on a single machine - so that reassmbly works in the simple fashion. That's possible, even with lots of content if you use GlusterFS, or similar, but it's probably not a great approach in general. If you have large capacity storage avilable locally then this might would well enough for storing backups, etc, but .. yeah.



Comments on this entry

icon Nicolas at 21:20 on 7 May 2015

I may be stating the obvious there, but using 4-byte chunks (and for that matter anything lower than 64-byte chunks) is not only inefficient but also not space-saving at all.

You're also storing the names of chunks needed to reassemble the objects, each name using 256 bits or 32 bytes of space (or as you're probably storing hex strings, 64 bytes). So with any chunk size lower than or equal to 64 bytes, the space needed to store chunk names for an object is bigger or equal to the object size.

icon Steve McIntyre at 22:50 on 7 May 2015

Hey Steve,

You're describing something quite similar to Venti - see http://en.wikipedia.org/wiki/Venti

icon Steve Kemp at 06:56 on 8 May 2015

Yes, using 4-byte chunks was really a test of the worst-possible setup, rather than a serious suggestion.

As you note the overhead makes it ludicrious since the filename of my chunks is 64-bytes meaning the space goes up by 8-times even before we consider the database.

Venti I'd never heard of before, so thanks for the link!

Anyway my conclusion was that this approach has merit, but if you're uploading random files then you're lucky if you get any useful collision and space-saving using chunk-sizes that seem sane (512-2048 bytes in size) - and that actually surprised me.

icon Bruno at 12:59 on 8 May 2015

I'm not sure what you're trying to build, and you may already know all this, but when doing de-duplication using fixed-size chunks, it is usually a problem that files that only differ by some additions/removals still lead to a lot of duplication because the chunk boundaries do not align.

The usual remedy for this problem is to use variable-size chunks; the key idea is that when splitting up the file in chunks you look at the actual data for some kind of marker pattern (say the sequence "000") where you end the current chunk and start a new chunk. By aligning the chunk boundaries on patterns in the data, the de-duplication becomes much more robust against small additions and removals.

When implementing this idea, a marker pattern is usually detected using a rolling checksum, which is a checksum of the last window of N bytes; this can be calculated very efficiently (just a few operations for each new byte that you read). When the checksum matches a specific bitmask M, the last N bytes amount to a marker, and it's time to start a new chunk.

By configuring N and M you can tune the statistical frequency of markers in the data, and thus the average chunk size. There, you still have the same tradeoff (smaller chunks have better deduplication but more overhead).

Some modern backup solutions such as Attic or zbackup variable-size chunk deduplication.

icon Steve Kemp at 16:21 on 8 May 2015

I've done some reading recently, Bruno, because all that was new to me. I always assumed that collisions were done on a fixed-block basis, but having seen how poorly that performs I guess rolling makes more sense.

I will have to experiment with that approach and see how it changes things.

icon Jonathan at 16:42 on 14 May 2015

@Bruno, I've often wondered how a rolling-checksum scheme would solve the offset problem. Thanks for the info! Have you got any references?

icon Jeffrey Plum at 21:25 on 15 May 2015

This problem is why de-duplication is often not recommended under ZFS. It is a feature of ZFS, but Allan Jude, of the TechSNAP podcast often states it doesn't help in most cases. ZFS now supports direct compression of data as a more useful space saving measure. Deduplication seems to be an early but not generally useful feature of ZFS. It may have been used by certain customers, just a neat feature natural to ZFS development.

It seems better suited as a following stage to another encoding which handles the indexing of variable chunks. In big data situations, the index to an encoding dictionary could be very large. Spectral data, a full Unicode aware dictionary or molecular analysis could generate huge indexes. Yet data sets of the same kind would have similar huge indexes. It seems this method could deduplicate the raw indexes of wide spectrum data sets. Experience would guide scaling the first and second stage encodings.