About Archive Tags RSS Feed

 

A brief twitter experiment

13 July 2014 21:50

So I've recently posted a few links on Twitter, and I see followers clicking them. But also I see random hits.

Tonight I posted a link to http://transient.email/, a domain I use for "anonymous" emailing, specifically to see which bots hit the URL.

Within two minutes I had 15 visitors the first few of which were:

IP User-Agent Request
199.16.156.124Twitterbot/1.0;GET /robots.txt
199.16.156.126Twitterbot/1.0;GET /robots.txt
54.246.137.243python-requests/1.2.3 CPython/2.7.2+ Linux/3.0.0-16-virtualHEAD /
74.112.131.243Mozilla/5.0 ();GET /
50.18.102.132Google-HTTP-Java-Client/1.17.0-rc (gzip)HEAD /
50.18.102.132Google-HTTP-Java-Client/1.17.0-rc (gzip)HEAD /
199.16.156.125Twitterbot/1.0;GET /robots.txt
185.20.4.143Mozilla/5.0 (compatible; TweetmemeBot/3.0; +http://tweetmeme.com/)GET /
23.227.176.34MetaURI API/2.0 +metauri.comGET /
74.6.254.127Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp);GET /robots.txt

So what jumps out? The twitterbot makes several requests for /robots.txt, but never actually fetches the page itself which is interesting because there is indeed a prohibition in the supplied /robots.txt file.

A surprise was that both Google and Yahoo seem to follow Twitter links in almost real-time. Though the Yahoo site parsed and honoured /robots.txt the Google spider seemed to only make HEAD requests - and never actually look for the content or the robots file.

In addition to this a bunch of hosts from the Amazon EC2 space made requests, which was perhaps not a surprise. Some automated processing, and classification, no doubt.

Anyway beer. It's been a rough weekend.

| 2 comments

 

Comments on this entry

icon Wichert Akkerman at 20:48 on 13 July 2014
http://www.wiggy.net

The HEAD requests are not from Google, but something running on AWS using a Google Java library.

icon Steve Kemp at 20:58 on 13 July 2014
http://steve.org.uk/.

I should know better than to trust User-Agent strings!