I feel a hate crime coming on.

Sunday, 6 December 2009

Recently I've been spidering the internet, merrily downloading content for the past few days.

The intention behind the spidering is to record, in a database, the following pieces of information for each image it stumbles across:

  • The page that contained the link to this image. (i.e. the image parent)
  • The image URL.
  • The MD5sum of the image itself.
  • The dimensions of the image.

I was motivated by seeing an image upon a website and thinking "Hang on I've seen that before - but where?".

Thus far I've got details of about 30,000 images and I can now find duplicates or answer the question "Does this image appear on the internet and if so where?".

Obviously this is going to be foiled trivially via rotations, cropping, or even resizing. But I'm going to let the spider run for the next few days at least to see what interesting things the data can be used for.

In other news I'm a little behind schedule but I'm going to be moving from Xen to KVM over the next week or ten days.

My current plan is to setup the new host on Monday, move myself there that same day. Once that's been demonstrated to work I can move the other users over one by one, probably one a day. That will allow a little bit of freedom for people to choose their downtime window, and will ensure that its not an all-or-nothing thing.

The new management system is pretty good, but I have the advantage here in that I've worked upon about four systems for driving KVM hosting. The system allows people to enable/disable VNC access, use the serial console, and either use one of a number of pre-cooked kernels or upload their own. (Hmmm security you say?)

ObFilm: Chasing Amy



Comments On This Entry

[gravitar] Xenu

Submitted at 09:24:55 on 6 december 2009

Sounds a bit like what http://www.tineye.com/ is doing except they calculate a more complex hash so it can deal with resizing, etc. It's a good idea. Let us know what happens.

[gravitar] Tollef Fog Heen

Submitted at 09:31:50 on 6 december 2009

Presumably you already know about tineye.com which does this for you? (Not affilated in any way, just stumbled across them some weeks back.)

[gravitar] Greg Auger

Submitted at 09:58:43 on 6 december 2009

Seen TinEye?


[gravitar] Obey Arthur Liu

Submitted at 10:14:38 on 6 december 2009

Certainly you know about reverse image search services such as TinEye (http://www.tineye.com) ?

[gravitar] Jan Larres

Submitted at 10:40:01 on 6 december 2009

TinEye does something similar and even handles resized images:

[gravitar] fumble

Submitted at 10:53:34 on 6 december 2009

Try this instead: http://www.tineye.com/

[gravitar] Robert Norris

Submitted at 10:59:36 on 6 december 2009

> I was motivated by seeing an image upon a website and thinking "Hang on I've seen that before - but where?".

Have you seen http://www.tineye.com/ ?

[gravitar] David Newgas

Submitted at 11:11:25 on 6 december 2009

Have you ever seen http://www.tineye.com/ ? It does pretty good job of finding image duplicates.

[gravitar] DanielCheng

Submitted at 11:28:21 on 6 december 2009

You may want to use an image signature/hash (install imagemagick, man identify; or perl use Image::Signature).

[author] Steve Kemp

Submitted at 11:45:24 on 6 december 2009

Hrm .. I guess I'm slow, stupid and have wasted my time. I did not know about tineye.com - which looks like it does even more than I could have hoped for.

Thanks Daniel - the idea of using a Image::Signature in the future is one I will certainly remember.

[gravitar] David T.

Submitted at 16:43:58 on 6 december 2009

There's also pHash(.org), which could benefit your spider.

[gravitar] Tiago Bortoletto Vaz

Submitted at 22:46:16 on 6 december 2009

BTW, I'm packaging pHash for Debian. I'm just waiting for the next release which wont depend on a specific CImg version.

[gravitar] Jon

Submitted at 09:17:19 on 7 december 2009


[author] Steve Kemp

Submitted at 12:34:26 on 7 december 2009

Tiago I'll look forward to the Debian package(s) when they are deemed stable then, thanks a lot!

[gravitar] Jason Spears

Submitted at 17:34:57 on 7 december 2009

The Google can help as well: http://similar-images.googlelabs.com/


Comments are closed on posts which are more than ten days old.

Recent Posts

Recent Tags