ExchangeDefender LoadBalancer

ExchangeDefender LoadBalancer

On a random basis, one of the most pressing issues faced by our customers was mail delivery delays. These random delays happen to only about 12% of the nodes, however, due to the sheer volume of mail processed by our inbound network that 12% would inevitably cause our staff and customers quite an inconvenience.

As more and more companies begin to depend on email for the main source of communication for their business, mail delivery time becomes a major factor when partnering with a hosting vendor. Because of the critical need for instant delivery, we had to quickly overcome our growth and produce an immediate solution.

The Unforeseen Weakness in Round Robin

After the ATS power outage earlier this quarter we were forced to reevaluate or solutions across the board and make drastic improvements. One of the areas that desperately needed a redesign was our inbound mail load balancer.

In the early days of ExchangeDefender we utilized a round robin based load balancer. In short how it works is the MX records for ExchangeDefender clients are pointed to both our Dallas datacenter and our Los Angeles data center. After the SMTP connection hit either data center, the connection was then forwarded off to any random inbound node in the virtual server list.

Picture1

Until earlier this year, the round robin design worked quite well, however, as the number of messages being processed grew, so did the delivery delays. We started to notice that the load balancer that was able to previously balance the connections was no longer balancing at all. Day after day, we saw some inbound servers having upwards of 200 concurrent connections at a time.  More than half of the other inbound nodes in the respective data center had no connections at all.

The biggest issue preventing the round robin configuration from working was the randomized assignment of which data center would be used and which inbound server that would receive the connection.  To begin to tackle the issue, we had to re-evaluate the entire load balancing solution because you can never properly balance a round robin based connection. We switched our load balancers to use a weighted least connection based routing scheme.  Upon activation it seemed to balance connections a bit better than the round robin connection.  Nearly an hour after activation however, we saw a large queue size being placed on a few inbound nodes.

Brand New Logic

To completely resolve the issue, we had to introduce additional logic to the load balancer. The recurring issue we faced was the basic nature of SMTP. An SMTP connection for one “message” being transmitted could equal four concurrent open connections. Therefore, naturally, the connection count cannot be relied upon for load balancing. We then decided to leverage our queue reporting service which reports the number of queued messages and open conversations with unique IPs. Finally, we created a PHP script that runs on the load balancer and splits itself to check the connection counts across all nodes every 10 seconds. We used a very simplistic formula for load balancing:

If(($numActiveServersInSite >= ($numServersInSite / 2)) && ($nodeConnections >= ($nodeAverageConnectionCount + 50)) && ($nodeConnections <= $highThreshold)){Shutoff new connections}

  Picture2

In non-code terms, we now calculate the number of servers in each site (DC). If there are more than half of the servers offline in a site, the load balancer will no longer shut off new node connections. If the current node connection count is either greater than the high threshold or has 50 active connections above the average connection count of servers in the site then it will shut off the node.

The End Result

After implementing the new load balancer algorithm we saw drastic improvements of balanced connections and delivery times for inbound messages. The most notable improvement took place on September 10th where we processed 2.8 million messages and we saw minuscule delays across the board and received zero complaints about deliver delays.

Travis Sheldon
VP Network Operations, ExchangeDefender
(877) 546-0316 x757
travis@ownwebnow.com