How our SPAM detection systems work

February 22, 2013

ExchangeDefender General

ExchangeDefender SPAM protection is a layered process that includes many components which together determine if a message is SPAM or not. While there are too many layers to list and the combination of what we use changes on an hourly basis, we wanted to share some detail of how our process works so you can understand it and explain it if necessary.

RBL – We use several free, commercial and our proprietary realtime blacklists that contain IP addresses of known spammers. These IP addresses typically belong to dialup, cable and DSL/fiber client addresses that shouldn’t be running a mail server, compromised/infected mail servers, unused IP ranges, compromised workstations and devices. These IP addresses account for a bulk of all SPAM mail and most of them will never even make it inside of the ExchangeDefender network.

Reputation blacklist – We use commercial and our proprietary reputation blacklists that indicate how trustworthy the IP address is – have they sent us a lot of SPAM in the past? Are they suddenly sending us hundreds of thousands of messages if they only accounted for 10 in the past month? Do they have legitimate DNS, reverse DNS.

URL blacklist – We use several commercial and our proprietary URL blacklists that identify web site addresses used in previously confirmed SPAM. If you’ve ever received a SPAM message you know that it had an external image or a link to a web site: we look at all web site links in an email and compare it to a list of known SPAM site targets.

Distributed checksums – We use several commercial and proprietary statistics models (warehouses) to determine the likelyhood of bulk mail. Because the only way spammers can distribute mail efficiently is through massive blasts, nearly all message bodies are identical. Each message has a unique signature (an MD5 checksum) that can be compared with other messages and when identical checksums are found it’s more likely that bulk mail is being sent.

Proprietary header checks – We use proprietary header and message checks to determine if the message is a part of an existing conversation between a third party and our own client. We also check if messages have been spoofed or if they have made their way through several gateways, if the language in the message does not match the language of the machine it was sent from, etc.

SPAM keyword and heuristic checks – We use a wide array of SPAM characteristic checks that take into the account the size of the message, subjects, fonts and images. For example, legitimate email typically doesn’t come without a subject or a subject that contains a lot of special characters, it doesn’t come without a persons name in the From line or other “weird stuff”. The “weird stuff” category is so wide and so contextual that it takes most of the time to process.

Now that you are familiar with some of our processes you’re probably getting the idea that SPAM filtering is very similar to the way virus scanning works – we use patterns from known SPAM messages and existing spammers to build a statistical model that tells us if a message is SPAM or not. It also explains why certain messages are delivered instantly while others may take up to a minute or longer to process (delayed header checks, suspicious web site in the email, conversation thread checks, temporary DNS failures or large attachments that require scanning)

We build most of our proprietary infrastructure based on your feedback – we look at the pattern of messages that you release which builds a model that adjusts scores along the way specifically for you – if you release a TON of messages with impotency drugs you are more likely to receive a Viagra message than another user that only releases financial newsletters. This is why your feedback through ExchangeDefender Outlook Addin is so valuable to us. When you hit “Report SPAM” that message is dispatched to us and reviewed by a live human being that generates a scan rule to eliminate that specific message in the future.

ExchangeDefender SPAM engine uses third party scanning engines with realtime data feeds and our own proprietary engine is updated hourly.

That may seem excessive but keep in mind that spammers adapt their message content for each batch – adding a different subject, different web site, different spacing and subjects every few thousand messages. In order for us to keep up the need for both the realtime update and constant reengineering of the SPAM engine itself is cruical to eliminating as much of the annoyance as possible.

How come SPAM messages slip through?

There is no such thing as a “slip through” when it comes to SPAM, all mail that is not sent by a sender on an RBL that passes our virus scanning is considered legitimate until proven otherwise. Tens of thousands of checks later (within a split second) calculate a score that identifies if message is SPAM (90% confidence) or SureSPAM (99.9%). There is no person on the gateway reading each message, the score is assigned by the computer based on the statistical model – so even if you looked at the message and could clearly tell it’s SPAM, the artificial intelligence is not quite there yet.

How come legitimate messages end up in SPAM quarantines?

Bad scanning methodology. For example, say you’re receiving a lot of “lotto” SPAM. Someone at ExchangeDefender may adjust your model to get really aggressive when it encounters the word “lotto” in the subject or headers. It works flawlessly, until you are sent a message by your business partner and the message is also copied to Lisa Lotto. This is clearly an oversimplification but it happens.

How come updates happen every hour and why does it take so long sometimes to make changes?

Most of the implied delay is due to the size and the scope of the ExchangeDefender network – to protect millions of people we have a very large network and very sophisticated layered infrastructure. Smaller updates are done very frequently (within an hour) but are staggered (happening at the different minute of the hour) because during engine reloads individual nodes cannot process inbound mail. If we restarted the engine on every node at the exact same time we’d basically shut the network down for a few seconds every hour.

Some large-scale changes (when we add a plugin or change our model analysis or whitelists or blacklists or new detection procedures) can take on the order of weeks or months. Typically we don’t deploy new software across the whole network – it is done in stages to eliminate anomalies in the deployment. Working with large scale distributed systems is different than managing individual servers, clusters or even networks: Changes to a globally distributed network with linked load balancers (while great for redundancy and service availability) is a challenge that requires careful monitoring and rollout procedures to minimize destabilization across the whole network.

We get hack attempts and DDoS attacks thousands of times an hour – the systems have an automated process for dealing with that activity – so when we roll out new software that makes the network overload quickly (and make it think it’s under attack) we have a big mess on our hands. This is why the rollouts are staggered and have a procedure that is at times excruciatingly long (especially if you’re sitting in the NOC watching the queues and your blood pressure goes up with every uptick in the percent utilization)

How come whitelists don’t always work and I have to whitelist the whole domain?

This is an inconsistent behavior that is reported from time to time and it has to do with the senders email address and client software. ExchangeDefender looks at the From line (the same one you see in Outlook) and whitelists only the senders actual address. This works 99.9999% of the time.

Some senders use impersonalization or sending on behalf of a different email address or use specific taglines in the From address that end up being randomly generated for each email. While Outlook may see message from Vlad Mazek <vlad@ownwebnow.com> the message itself may have come From: vlad+2381ekr259@ownwebnow.com with a fingerprint used to track read receipts.

Long story short, if you have a person that is using these systems or CRM to alter the message for tracking/sales/marketing purposes, you can’t trust their From line for a whitelist, you will have to whitelist their whole domain. Good news is, this is typical for smaller domains and you’ll never be whitelisting all of @aol.com or @gmail.com

How come there are delivery delays?

Our support team can help you with that.

Half the time the issue is related to the DNS and the other half is due to the temporary network or Exchange issues on the client side. Almost all of the tests we have done for our clients fall into these two categories and almost all of the things you can do to minimize delays are outlined in our deployment guide.

In the event that the issue is our fault due to network congestion, filter failure, virus scanner malfunction, DDoS, routing issue, misconfiguration, etc – the issue is noted at http://www.exchangedefender.com/noc – please subscribe to the feed or @xdnoc or post your cell phone in our portal – we text, tweet, blog and (everything except email, for obvious reasons) notify you of any ongoing network issue that might impact service.

So much for tech..

Technology is only a piece of the whole puzzle.

This is going to sound cliché but I absolutely mean it 100%: At ExchangeDefender killing SPAM is our passion. Every piece of junk that passes through our network is unwelcome and we have people here around the clock working on eliminating it from making it’s way to your inbox. It’s a human process and it’s a technology process and the very practical implementation of artificial intelligence that learns and adapts in realtime to fight with SPAM. But just like training a dog takes time, training a computer is a challenge that requires consistency and strict implementation of the process, rules, management and monitoring. With millions and millions of messages passing through our gateways, even a slight insignificant modification can impact SPAM statistics models. It’s not something that’s “broken” that could be “fixed”, it’s just a process of continuous training that we take very seriously and enjoy very much.

Thank you for your business.

Sincerly,
Vlad Mazek, MCSE
CEO, Own Web Now Corp
vlad@ownwebnow.com

GET STARTED NOW

Premium Support