Some history of the blogspam service:
Back in 2008 I was annoyed by the many spam-comments that were being submitted to my Debian Administration website. I added some simple anti-spam measures, which reduced the flow, but it was a losing battle.
In the end I decided I should test comments, as the users submitted them, via some kind of external service. The intention being that any improvements to that central service would benefit all users. (So I could move to testing comments on my personal blog too, for example).
Ultimately I registered the domain-name "blogspam.net", and set up a simple service on it which would test comments and judge them to be "SPAM" or "OK".
The current statistics show that this service has stopped 20 million spam comments, since then. (We have to pretend I didn't wipe the counters once or twice.)
I've spent a while now re-implementing most of the old plugins in node.js, and I think I'll be ready to deploy the new service over the weekend. The new service will have to handle two different kinds of requests:
- New Requests
These will be submitted via HTTP POSTed JSON data, and will be handled by node.js. These should be nice and fast.
- Legacy Requests
These will come in via XML-RPC, and be proxied through the new node.js implementation. Hopefully this will mean existing clients won't even notice the transition.
I've not yet deployed the new code, but it is just a matter of time. Hopefully being node.js based and significantly easier to install, update, and tweak, I'll get more contributions too. The dependencies are happily very minimal:
- A redis-server for maintaining state:
- The number of SPAM/OK comments for each submitting site.
- An auto-expiring cache of blacklisted IP adddresses. (I cache the results of various RBL results for 48 hours).
The only significant outstanding issue is that I need to pick a node.js library for performing CIDR lookups - "Does 10.11.12.23 lie within 10.11.12.0/24?" - I'm surprised that functionality isn't available out of the box, but it is the only omission I've missed.
I've been keeping load & RAM graphs, so it will be interesting to see how the node.js service competes. I expect that if clients were using it, in preference to the XML-RPC version, then I'd get a hell of a lot more throughput, but with it hidden behind the XML-RPC proxy I'm less sure what will happen.
I guess I also need to write documentation for the new/preferred JSON-based API...