Spammers Pillage Classic Novels

Looks like text from the Project Gutenberg archives can bypass the Google Spam filter. I received a standard spam for a pump-and-dump penny stock scam today containing some strange prose. Turns out its the text of a 1928 novel by Joseph C. Lincoln called Silas Bradford’s Boy. The ad itself was an animated gif (complete with split-second subliminal BUY BUY BUY flashes) sporting the details of “National Healthcare Logistics”, a hearty firm with shares at $.024.

I was confused a few months ago by these strange spam messages, some even lacking any “Ad”. Slashdot asked the same question as well, along with some particularly deranged (or satirical) message board posters.

The truth is these nonsensical or obtuse messages are an attempt to beat an algorithm, in this case the Bayesian filter of spam blockers. Much as a Markov Chain can scramble text in the same “flavor” as the original, Bayesian filters use statistical analysis to determine the content of an email. Certain giveaways (Viagra, penny stocks, shady weblinks) will tip a threshold – leading to the spam repository. In this case, the ever-resourceful spammers have found a vast database of high-minded prose that’s pretty much the antithesis of lowbrow adverts.

Of course, Google could always incorporate the Project Gutenberg archives into its spam filter, but this potentially leads to false positives – email I WANT to receive ending up as Spam. It’s a classic dilemma of information processing and artificial intelligence.

Potential solutions would be to include the content of received emails as test for validity – if I’ve received large chunks of classic novels in the past, let them stay. Also, let spam fall on a sliding scale, not all-or-nothing folders. Unsolicited newsletters would be on a different “tier” then poorly spelled V14GR4 dumps. Gmail has done a good job of including a “Mark as Spam” button. This allows users to “train” the filter. Whether this button contributes to a global or personalized filter, I don’t know, but I think the latter would be ideal. As information on the net is distilled and remixed, it will be increasingly difficult for any sort of universal solution.

The best approach is organically grown.

