So I've recently posted a few links on Twitter, and I see followers clicking them. But also I see random hits.
Tonight I posted a link to http://transient.email/, a domain I use for "anonymous" emailing, specifically to see which bots hit the URL.
Within two minutes I had 15 visitors the first few of which were:
IP | User-Agent | Request | |
---|---|---|---|
199.16.156.124 | Twitterbot/1.0; | GET /robots.txt | |
199.16.156.126 | Twitterbot/1.0; | GET /robots.txt | |
54.246.137.243 | python-requests/1.2.3 CPython/2.7.2+ Linux/3.0.0-16-virtual | HEAD / | |
74.112.131.243 | Mozilla/5.0 (); | GET / | |
50.18.102.132 | Google-HTTP-Java-Client/1.17.0-rc (gzip) | HEAD / | |
50.18.102.132 | Google-HTTP-Java-Client/1.17.0-rc (gzip) | HEAD / | |
199.16.156.125 | Twitterbot/1.0; | GET /robots.txt | |
185.20.4.143 | Mozilla/5.0 (compatible; TweetmemeBot/3.0; +http://tweetmeme.com/) | GET / | |
23.227.176.34 | MetaURI API/2.0 +metauri.com | GET / | |
74.6.254.127 | Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp); | GET /robots.txt |
So what jumps out? The twitterbot makes several requests for /robots.txt, but never actually fetches the page itself which is interesting because there is indeed a prohibition in the supplied /robots.txt file.
A surprise was that both Google and Yahoo seem to follow Twitter links in almost real-time. Though the Yahoo site parsed and honoured /robots.txt the Google spider seemed to only make HEAD requests - and never actually look for the content or the robots file.
In addition to this a bunch of hosts from the Amazon EC2 space made requests, which was perhaps not a surprise. Some automated processing, and classification, no doubt.
Anyway beer. It's been a rough weekend.
Tags: twitter 2 comments
http://www.wiggy.net
The HEAD requests are not from Google, but something running on AWS using a Google Java library.