The Spam Folder

Server-side vs. client-side spam filtering.

After my post about what causes mail to go to the spam folder, a reader1 asked:

So why did I have to tell my new computer and new email system a dozen times that Facebook posts of various types were not spam before I could get it to stop throwing them all in my spam folder.

This raises the issue of server-side, vs. client-side filtering (particularly in the context of spam filtering).

My earlier post addressed the server-side of things only. Once you get past any anti-spam protections and filtering a particular ISP may provide at the server level, your mail is in the end-user’s INBOX (this is in all caps for a reason.) If the end-user views their email via a web interface (how most users of Gmail, Yahoo!, Hotmail read their mail) there’s not much more you (as a sender) need to worry about. All of these services allow users to configure their own filters, to automatically shunt mail that meets user-defined criteria to other folders. These can include filters that send certain types of mail to the junk folder, such as all mail from a certain email address or domain, or any mail with certain words or specific combinations of words in the subject or body of the message.

Those using an email program on their computer, such as Microsoft Outlook, Microsoft Live Mail (formerly Microsoft Outlook Express), Thunderbird, Apple Mail, The Bat, etc, are a different matter. These programs, typically called Mail User Agents (MUAs) by mail administrators, allow the end-user to download mail from their INBOX on the mail server directly to their computer, where they typically read the mail, then either delete it or manually move it to folders inside the program. All of these MUAs include their own spam filters. These days these filters typically use what is known as “Bayesian filtering“. While the specifics of Bayesian filtering are beyond my ability to explain (see the link to Wikipedia), and I don’t understand the mathematics behind it myself, this essentially boils down to a filter that examines various aspects of the email to calculate the probability that the email is spam. If it calculates that the probability that the message is spam is above a certain threshold determined by the creators of the program, the program will move the message to the junk folder. These are usually called ‘learning’ filters as well. Almost all of these programs have a button the user can click while viewing the mail to tell the system that the message is spam. The message is usually moved to the junk folder, and the program remembers that the user said that message was spam. In the future, when a message that looks very similar to it arrives, the Bayesian filter will increase the probability that the message is spam. If it sees enough similar messages arrive that the user has indicated are spam, eventually it will figure out the messages are spam on its own and automatically toss them into the junk folder.

The reverse is also true, and hence where we circle back around and address the reader’s question: Sometimes it makes a “false positive” decision, throwing something that isn’t spam into the junk folder. If the user checks their junk folder and finds mail they don’t consider spam, they can then move it out of the junk folder back into their Inbox (or other folder), or they may have another “This is not spam” button to click. Either way, the Bayesian filter examines the message and adjusts accordingly. In both cases, it “learns” what is, and is not, spam. Over time (in theory) the filter gets smarter and smarter, making ever more refined decisions about what is spam and what isn’t.

In this case, the reader is dealing with a brand new computer, with a fresh install of Microsoft Outlook. Thus the spam filters built into Outlook are set to their factory defaults and hasn’t yet “learned” anything more than it’s already been programmed. Facebook notifications, apparently, look very much like spam to the filters, so it tosses them in the junk folder. As the user kept finding these desired messages in the junk folder and moved them out, it eventually learned messages from Facebook are OK and stops tossing them to the junk folder. It can be frustrating at first, when mail you expect to see in your mailbox and think is clearly not spam keeps ending up in the junk folder, but eventually it “gets it” and stops putting throwing your desired mail away.

Note that all of these MUAs also allow the end-user to define their own filtering rules, just like on the web-based interfaces. These filters include the ability to throw matching email in the Junk folder, so even if the Bayesian filter never does figure out that mail from “friend@public.com” on it’s own, you can still throw it away automatically.
Also note that the bulk of the mail filtering rules the big ISPs mentioned above use on the server are, in fact, Bayesian. They just have a LOT of “corpus” to work with to “teach” their filters, so they tend to be REALLY smart. There’s just only so much (current) computer intelligence can do.

1Full disclosure: this was my mother, as comment on the cross-post to Facebook.