As you may be aware, we had our first network-wide outage in our primary data center (Dallas) back in August and our redundancy plan did not deliver on the promise that we should expect. Our CEO went over the details of the outage and during his webinar, we made a promise to revamp our network and infrastructure to prevent the impact this had on our partners moving forward. Once all services were restored, we began playing the outage backwards to figure exactly what went wrong and how we can eliminate the possibility of it ever happening again.
The items that floated to the top were LiveArchive’s failure and an unforeseen level of reliance on Dallas by our Los Angeles ExchangeDefender network. Our Exchange Architect and lead engineer, Travis Sheldon, outlines the changes made to our LiveArchive structure to ensure that if Dallas goes lights out, LiveArchive can keep on to ticking without missing a beat.
That leaves changes to our mail processing. Once the outage happened, our ExchangeDefender NOC team was overseeing the mail flow going to Los Angeles, however, the failure point was the processing speed. The processing speed delay basically blew out all of our disaster recovery metrics that showed us that LA could handle the load on its own. The team’s task quickly became finding the delay within the process. In this case, our transparency became our worst enemy. The fact that we log everything to make it accessible to our partners for their clients was causing the delay. Our Los Angeles nodes were attempting to write data back to an unresponsive database. Think about that for a second, every step of the way for every email we process, was now attempting to update data to an unresponsive network. Needless to say this delay in processing caused all of our queues to grow explosively, thus causing massive delays in delivery.
The other major part of the outage was our outbound network. At the time, our multihomed smart host cluster was at the time housed primarily in Dallas, with standby nodes in Los Angeles. Here’s the problem, since these nodes weren’t in production they did not have active access tables, which basically rendered them useless. So now we had our trail of blood and tears of how and why we failed, but we also had a blue print on how to ensure history didn’t repeat itself. Here is what LiveArchive side of LA looks like now:
So in addition to the new LiveArchive Exchange 2010 cluster in Los Angeles, here’s what we did:
1. We increased the hardware and bandwidth capacity by 80% in Los Angeles. This brought it on par with our Dallas network, which has in the past and with ease handled the load during peak on its own without delays. In addition, we also increased our capacity in Dallas an additional 20%, this is accommodate future growth without having to experience any processing delays.
2. We deployed master to master replication of all of our core processing databases. So now our ExchangeDefender Los Angeles nodes do not rely on any Dallas resources for mail processing. This means two big gaps were closed, if there were a Dallas outage, processing speed would not suffer and that our logging would not need to be sacrificed in the name of processing speed.
3. We doubled our outbound capacity in Los Angeles and they’re live servers instead of standby. This way if an outage occurs, all routing/access/archiving/encrypting rules are already up to date and ready to go. As an added benefit this expansion increased our outbound processing speed and capacity by 80%.
The lessons we learned from this event will only allow us to provide you with better and faster service, so you can in turn deliver the same to your clients. We’ve already successfully stress tested our new infrastructure on multiple occasions with great success and we’re confident about the future of our solutions.
I have worked with many of our partners through the years and, if you don’t mind, I wanted to share some resources that I find most of our partners are not aware of. As the person that oversees all of our support services I can tell you that frustration is a part of the game and we can extend a much better service when you’re plugged into everything that the support side of ExchangeDefender as an organization offers. Here are some tips:
1. Make sure all of your employees are reading our NOC site. http://www.exchangedefender.com/noc While the NOC alerts sync up with Twitter, Facebook and our support portals, our NOC site provides a lot of useful information about how to explain the problems that happen. Our clients are aware things will go down from time to time so keeping them in the loop is critical.
2. Make sure all your employees are in the support portal and that the information is correct. Have you ever had the frustrating experience of opening a ticket only to be asked for more information? We find that most tickets that aren’t addressed by us immediately (and our average response time is a topic for a different blog) can drag on and a quick call can often resolve it. We’re in support, we won’t call you to sell you stuff. But if we have a number to call we can figure things out very rapidly.
3. Rely on the portal. https://support.ownwebnow.com is where we live. All of our communications, alerts, NOC, service monitoring and staff have this open whenever they are at their desk. Monitors around the office show how many issues are being worked on, what everyone is up to, which changes are being made. While we love to talk to our partners we have to take your security into account so make sure that everything goes through there. Yes, we do escalate stuff to Vlad from time to time.
4. Most importantly: Save the info below.
I have been with Own Web Now for years and I often like to tell people that I’ve known Vlad Mazek before he knew how to speak English. If there is anything on the support side that I can help with please feel free to contact me. So much for the introductions, look forward to with you all.
VP Support Services, ExchangeDefender
(877) 546-0316 x737