Recently a former colleague reached out to me on Linkedin to ask:
I have a question regarding email delivery. What cause emails to go into someone’s spam email box? I understand that there maybe(sic) filters that looks at the content to make that determination. I would think there are many other factors.
Yes, there’s quite a number of things that can cause mail to go to the spam folder. The contents of the message are a big factor. Of course every ISP applies different rules, so what causes mail to go into the spam folder of a Yahoo! mailbox will differ from what matches the rules on Gmail, or Hotmail, etc. Some ISPs will allow certain mail through, but put it in the Spam folder that other ISPs would just reject outright when the sending mail server connects to send it.
Are you having a specific problem that you’re trying to solve?
I don’t have a specific problem. Just interested in understanding how spam filtering works. Since I know an expert, why not ask directly.
Are there headers the ISP look at to validate the email?
I wrote up a quick primer on some of the esoterica of spam filtering.
This is by no means comprehensive, and not guaranteed 100% accurate.
There are several headers ISPs will look at to detect spam, none of them 100% definitive. Every ISP uses different rules and puts more weight on some factors than others.
Almost all ISPs (Gmail is a large exception) use some form of IP address / and or domain name black and / or white lists. They usually maintain their own lists; many will also consult external lists such as the ones published by Barracuda Networks or SpamHaus, etc. The IP address of a connecting mail server will be checked against these lists and may be rejected at any point during the SMTP transaction (immediately upon connect, after the helo/ehlo (depending what hostname the connecting server uses to say helo), upon the “mail from“, after sending the mail content data, which includes the “From:” header.)
- Blacklists and DNS checks
- Accepting or rejecting the message; SMTP codes
They will check if the connecting IP address is listed on a blacklist (either theirs, or external), if the connecting IP address has a reverse DNS record (PTR record), if the hostname returned by the reverse record lookup has a forward record (“A” record), if an IP address returned by the forward lookup matches the connecting IP address.
Next, when the connecting server sends a “helo” or “ehlo” in response to the SMTP banner, the ISP’s systems will check the blacklists to see if they have a record for the hostname the connecting server sends. They will also check if that hostname matches the DNS they checked in the previous step.
Next, the connecting server will say “mail from: <email@address>” and “rcpt to: <email@example.com> (These are referred to as the “envelope from” and “envelope to“. These are not the “From:” and “To:” you see when you read the message. More on that later.) At this point the ISP mail servers will check if the connecting host is authorized to send mail “from” the domain of that email address. They do this by checking the SPF records published in that domain’s DNS. Most will treat this as only “advisory” and will not bounce mail coming from a host that is not “authorized” by the domain’s SPF records. They will also check their domain-based blacklists for the “sending” domain.
At this point the receiving mail server will respond either “200 OK” and the sending server will move on to the next step, or the receiving server will respond with an SMTP status code indicating why they are not accepting the mail (recipient’s address doesn’t exist (551), recipient’s mailbox is full (552), receiving server is too busy to process the request, you’ve been blocked by our spam filters, you’re sending too many messages at a time / within the last-minute / hour / day, etc). Usually there will be some text along with the SMTP status code explaining what the code means. SMTP codes are much like HTTP response codes. They are 3 digits, indicating the type of rejection: 4XX is a temporary rejection and the sending server can reconnect and re-try sending the message later. 5XX is a permanent failure and the sending server should not bother trying again, and SHOULD send a bounce message back to the sender at that point. Of course, not all sending servers properly respond to the error codes, especially spam software. Many spam senders won’t ever bother to retry the message on a 4XX response (it’s just not economical for them to keep re-trying when they have thousands or millions of messages to get through); worse ones are the opposite and will keep retrying even on a 5XX message.
A common anti-spam technique is to respond with a 4XX code (“try again later”) the first time a given IP address connects and attempts to send FROM a given domain, TO a given domain. This is called “greylisting“, where the receiving server tells the sender to try again later. Most spam servers won’t bother trying again later. A well-behaved sender will, and if they waited long enough (usually about 5 minutes), then the receiving server will let the mail through (if it doesn’t fail for other reasons).
If the mail is signed with DKIM, the receiving server will check the signature on the message with the public key published in the DNS of the “sending” domain.
If the “sending” domain publishes a DMARC record, they will check that the sending host is authorized. This is also generally treated as “advisory”, though in theory it should be definitive.
Next the sending server will send “data” and the receiving server should respond “OK”. At that point the sending server is clear to send the message. This will include the “headers”. There should always be “From:”, “To:”, “Subject:”, “Date:” and “Message-ID:”. There are usually a number of others. USUALLY the “envelope from” and the “From:” should match, as should the “envelope to” and “To:”, though not always. A “Date:” that hasn’t occurred yet is a pretty dead giveaway that the message is spam. As you might imagine, “Subject:” will be checked against spam filters, though not as much as many people think. After all the headers is the body of the message, which is, of course, checked against spam filters. Various key words are looked for, and usually some sort of hash will be computed against the message content, and that hash checked in a database of hash values of previously identified spam. Also, of course, any file “attached” to the message will be hashed and checked, and will usually be run through anti-virus checks. Often certain extensions will be blocked outright (.exe, .pif, etc.) though this is more common on corporate mail servers than ISPs.
Major ISPs will also pay attention to what the recipient actually does with the message once it is delivered to their Inbox. If lots of users start clicking that your message is spam, that will weigh against you when you connect to send more mail. (Yahoo! and Gmail both are known for this. This will weight VERY heavily in the overall rules with these ISPs.) Also, if your recipients don’t even bother opening your mail, but delete them without reading, that will affect you negatively (though not as bad as if they are sending it to the spam folder.) If recipients actually open the message, and even better click on your links in the message, that has the opposite effect: both indicate they are engaged and actually want your mail, so it will have a positive effect on your “score” in their filters.
He asked for more information about DKIM:
If the email is DKIM signed, how close is it to 100% not being spam? I would think very high like >90% from a secure domain.
Actually DKIM says nothing about whether or not the email is spam. It only confirms that the email comes from the domain it claims to come from, if that domain publishes a DKIM record. It could still be a spammy domain, or it could be a domain with an otherwise good reputation, with a compromised server that is sending through their legit mail servers. DKIM is about preventing spoofing.
An ISP will have high confidence that the email really did come from the sending domain if the domain publishes both an SPF record and a DKIM record, the connecting server conforms to the SPF record and the receiving server authenticate the DKIM signature on the email with the published DKIM public key.
Keep in mind that most major ISPs still treat both SPF and DKIM as merely advisory though. Gmail, for example, will NOT send mail to the spam folder that only violates SPF and DKIM. As long as there is nothing else “spammy” about the mail, it will go to the Inbox. (Actually I’m not so sure on DKIM. An un-signed email doesn’t violate DKIM, as the standard so far is along the lines of “IF you see a DKIM signature on mail purporting to be from us, here’s the public key”, but does not REQUIRE there to be a DKIM signature, even if they publish a key. Also, mail that is DKIM signed, but the domain doesn’t publish a key is also not a violation.
There is a more complex standard that ties DKIM, SPF and a few other things together, called DMARC, and a deprecated standard called ADSP. Both are standards whereby a sending domain can publish their policies on what email should be accepted if it purports to come from them. e.g. in theory they can state “If it says it’s from us, it MUST be DKIM signed and MUST conform to our SPF record. If it fails either, you should treat it as spam.” However the receiving server is still free to do as they wish with the email, regardless of DMARC. (i.e., like both SPF and DKIM, it’s treated as advisory.)
To sum up SPF, DKIM and DMARC:
SPF and DKIM authenticates a message, it really was sent by the domain it says it was sent by.
DMARC repudiates messages, indicating the message was not sent by the domain it claims to be from.