Spam: filtering

Got it! This site "robinsnyder.com" uses cookies. You consent to this by clicking on "Got it!" or by continuing to use this website. Note: This appears on each machine/browser from which this site is accessed.

This page looks at some fundamentals of SPAM and SPAM filtering.

In particular, the idea of a simple Bayesian filter is used. SPAM is junk email. By most estimates, about 85.0% of email traffic is SPAM. Some form of SPAM filtering software can be used to avoid overloading users with SPAM in their inbox.

A SPAM rating is a rating of the probability that a given email message is SPAM. This rating is often prefixed into the subject line of the email message. This allows the SPAM rating software to be transparent to the actual email system used. Since the nature and recognition of SPAM changes over time, either a service is needed to identify SPAM or the user must somehow "train" the SPAM recognizer. This is often done with what is called Bayesian filtering as, at its core, it uses Bayes Rule to make decisions on updating probabilities.

The difficult part about SPAM is the following.

A meaningful message might be classified as SPAM (type 1 error, false positive, not good).
A SPAM message might not be classified as SPAM (type 2 error, false negative, annoying)

In the absence of perfect information (i.e., the real world), reducing one of these errors automatically increases the other error.

A usual strategy is to greatly reduce SPAM messages while providing a way for users to "recover" the SPAM should they need to find a meaningful message.

Google GMAIL does a very good job with spam identification and elimination because they can use global knowledge from a large number of email messages. Once once source (e.g., IP and message) is identified as SPAM to multiple targets, it can be eliminated for every user of GMAIL. Local systems have much less information with which to make SPAM rating decisions.