A Plan for Spam: In the last few months, we’ve talked about Bayesian-this and Bayesian-that quite a bit, but what does “Bayesian” mean? Here’s a great article that explains the concepts behind Bayesian filtering for spam. It’s long, but a worthwhile read.
The Achilles heel of the spammers is their message. They can circumvent any other barrier you set up. They have so far, at least. But they have to deliver their message, whatever it is. If we can write software that recognizes their messages, there is no way they can get around that. […]
Words that occur disproportionately rarely in spam (like “though” or “tonight” or “apparently”) contribute as much to decreasing the probability as bad words like “unsubscribe” and “opt-in” do to increasing it. So an otherwise innocent email that happens to include the word “sex” is not going to get tagged as spam.
So, where did the term “Bayesian” come from? There was a preacher named Thomas Bayes who lived in the 18th Century and developed something called Bayes Theorem. Sadly, this is where my accurate understanding leaves off, and my “fake it from context” understanding picks up…
Through all my reading, my understanding is that Bayesian analysis scores the existance of words and word combinations against a known sample to determine the content of a chunk of text.
For instance, if you search for the word “penguin,” you may get results back about Linux, because a penguin is the Linux mascot, or about hockey, because there’s a team named the Penguins. But if your search tool knows that you want information about the animal penguin, then it would assign more weight to results that contained both “penguin” and “zoo” or “Antarctica.”
How does the filter know that “penguin” is likely to be about the animal when combined with “zoo” or “Antarctica”? Because you’ve given it a sample of 200 documents about penguins (the animal) and it analyzed those documents to figure out what you felt was an accurate result for a search for “penguin” (the animal).
Again, let me tell you that the above explanation is a stunning simplification, and may, in fact, be stunningly wrong. So to make it simpler and safer, let’s just stick to the fact that a Bayeisan filter reads text and knows what it’s looking for, so it knows when it finds it.
Given that, let me apologize in advance to those of you who know what you’re talking about and implore you to comment with more intelligence than I’ve demonstrated here.