About Archive Tags RSS Feed

 

You didn't faint. That's always a good sign

18 January 2009 21:50

Joey has implemented blogspam support for ikiwiki, which you can find here.

He was also good enough to provide valuable & useful feedback so I can present a moderately updated API - most notably with the ability to train comments when/if they are incorrectly marked.

I can imagine several other useful additions, but mostly I think the code should be cleaned up (written in an evening, posted to keep myself honest, didn't really expect anybody to be terribly interested) and then it should just tick over until comment submitters adapt and improve to the extent that my sites suffer again ..

ObFilm: Kissed

| 5 comments

 

Comments on this entry

icon Adam Trickett at 19:23 on 18 January 2009
ikiwiki is a really good wiki engine. It's nice to see it pick up some extra anti-spam features as spam is ruining everything on the Internet thesedays...
icon Anonymous at 19:50 on 18 January 2009
What stops a spammer from training a pile of spam messages as non-spam?
icon Steve Kemp at 20:01 on 18 January 2009

I've no idea how ikiwiki in particular suffers from spam problems, but I know that lots of wiki sites (especially those without user-registrations) suffer from sapm.

As for bogus training? Currently there is nothing to stop it, but if it becomes a problem then the service will adapt.

The obvious way to do that is to require a "key" to be used to both detect and train comments - that would mean only the site owner would be able to train things.

Currently the bayasian database is global, but another simple solution would be to make it per-client, or per-site, so that a spammy site owner, even registered, would only affect the comment scoring on their own site.


icon Daniel Pope at 10:39 on 29 January 2009
I had a look at blogspam.net because the other day I updated Mauvesoft (http://www.mauvesoft.co.uk/) and it picked up half a dozen spams within a couple of days. It's written with Django's standard comment app and evidently spammers are clever enough to avoid filling a hidden honeypot field named 'honeypot'. I've not hooked it up yet though.
I do have some feedback based on reviewing the code for the plugins.
1. You need to assess classification rates manually. It's not enough to notice that it sorts obvious spam from obvious ham: the test of a service like this is the misclassification rate.
2. stopwords.pm is a recipe for false positives. I don't know what your word list is, but methodologically, Bayes outperforms this hands down. The reason? Legitimate users say "Viagra" or "penis" sometimes.
3. This is perhaps the most subtle one. OR-ing together several different classifiers can produce a worse classifier (in terms of misclassification) than just relying on the better classifiers alone. The misclassification rate of the entire system is WORSE than the misclassification rate of the WORST plugin. Spamassassin gets around this by summing weighted scores, a simple way of combining classifiers, certainly, but one that's rather better than just accepting the first classifier to reject a message.
icon Steve Kemp at 12:07 on 29 January 2009

Feedback is always welcome, but I think this is probably not the best place to leave it.

Indeed the false positive, and false negative, rates are things to keep an eye upon. I daily go through and reclassify comments that are caught - although this is too late to help the submitters. Short of a more complex API which would allow falling back to captchas or similar I can't think of a good solution to that.

Stopwords.pm - If you've found the code for that then in the file ../etc/stopwords you'll find the list.

Your point about OR-ing tests I'd not considered. I do run each comment through each plugin and terminate when one returns SPAM. I could sum and weight them, but thus far hadn't seen this as being an issue. Definitely food for thought though.