About Archive Tags RSS Feed

 

The problem with the English language is all those pesky words

29 June 2008 21:50

There exist many bayasian/statistical spam filters, ranging from products such as spambayes, and spamassassin, to crm114. Each of them works in their own way. Having used and tested almost all of them I've noticed a common flaw.

The vast majority of spam-filters struggle to correctly classify "419 scam" mails, lottery fraud, and similar mails.

Why is that? In general, having read hundreds of these mails, I can see several things that are common in these kind of the mails:

  • Mention of currency in both numeric and word forms. ($1,000,000 + 1 million US dollars)
  • Mention of a country / nationality (Sierra Lione, Nigerian)
  • Mention of a reference/claim number and often "official address".
  • Christian references.
  • Greetings such as "dear friend", and mentions of discretion/secrecy.
  • Size. (A scam mail is typically greater in length than an average spam mail).

Whilst none of these individually are indicative of a scam mail it is interesting to count their combined occurance.

I've written a toy program to count these things, and so far the success rate is >60% which is a reasonable start - providing this kind of detection occurs after normal filtering.

I may experiment further, but I figured a public query on scam detection might be appropriate.

Whilst the detecting a scam mail is a subset of detecting a spam email there are probably simplifications that may be made, and exploring those wouldn't be a bad thing.

ObQuote: Buffy.

| 3 comments

 

Comments on this entry

icon cstamas at 18:12 on 29 June 2008
Did you tried dspam? Maybe it's worth a try.
icon Marek at 09:17 on 30 June 2008
Hi Steve,
I filter my mailbox using just crm114, and I have to say it's a long time since I remember a 419 making it into my Ham folder. I glanced through the few days of Spam (15k messages) and there are plenty of them there, so I guess my crm114 CSS tables have learned "millions", "dollars", "dear friend", "confidential", "inheritance", "address" and "passport" in close proximity are bad signs.
Stay crispy,
Marek

icon Steve at 14:29 on 30 June 2008

I've tried dspam. In my use, and testing, CRM114 was better. But even that fails with a lot of the spam that I'm seeing.

Right now I'm detecting maybe 2 million spam mails a month, and having 10-20,000 messages be "unsure".