ExchangeDefender USA cloud outage incident report

August 11, 2011

General

ExchangeDefender USA cloud outage incident report

Dear Own Web Now Partners & Clients,

At approximately noon yesterday the data center hosting the hub of our US operations suffered a major power failure that momentarily interrupted all network services. More redundant services resumed operations immediately (ExchangeDefender) while others started recovering through emergency systems around 5 PM EST. We are incredibly sorry for the impact this has made on your operations and promise not just to compensate you through service credits but also enhance service redundancy so that this never happens again.

On behalf of the whole ExchangeDefender team, I am sorry we put you into this situation. Much like you we’ve spent a part of yesterday in the dark with no clear ETA on when the services would be restored or how this failure could have happened in the first place.

The facility that our main US operations are in boasts N+2 redundancy, over 30 days of generator fuel and is one of the largest and most reliable in the world. As a matter of fact, we consolidated our central operations here due to the issues we had in California and Florida. The facility has had an incredible service record and has allowed us to provide the same level of service to you. Yesterday the facility experienced a failure in Automatic Transfer Switch (ATS) equipment designed to switch the power feed from live utility power to the power generators in the event of power loss. This was the piece of equipment that was designed to keep power available and while it is also redundant with A/B feeds, the data center distribution routers were not connected to both power banks. While our equipment remained powered on, the network connectivity remained down. This was the technical root of the issue.

Thankfully spare ATS was available and the utility, data center and supporting vendors were all available on site within the hour and completed the replacement within 5 hours of the service interruption. We did our best to keep everyone informed of everything we knew every step of the way through our Facebook page (http://www.facebook.com/ExchangeDefender) and our Twitter @xdnoc and @ExchangDefender.

Our operating procedures also call for use of emergency failover systems should the primary systems be down for more than 4 hours. At roughly 3:30 we began restoring services to our web sites, redundant Exchange clusters and continued restoring services well into the night as the data center facility restored full operations.

ExchangeDefender inbound service was not affected by this incident as it’s massively redundant through multiple data centers. However, a disruption to the major central control in Dallas effectively flooded the failover sites and some of our partners reported email delays from minutes to even two hours. Worse, our Exchange hosting clients were impacted for 4-6 hours and with the lack of ExchangeDefender LiveArchive to back them up, it completely failed them. Again, we are sorry for this issue and will address it immediately.

At approximately 10PM EST I held a webinar for all our partners to explain in detail what happened, how we responded, what we learned and what we intend to do to fix it going forward. You can watch the webinar here (requires GoToMeeting codec)

Going Forward

First of all, we will be providing service credits to everyone that was affected which includes our entire USA client base.

Second, we will begin deployment of redundant control systems for ExchangeDefender: placing additional admin servers across our failover sites, adding more capacity to the existing ones and most importantly providing geographic redundancy to ExchangeDefender LiveArchive.

Finally, we will be adding redundancy to our Exchange 2010 networks in USA.

Expect to see major changes this quarter. While this issue never occurred before and we don’t expect it to occur again, we have learned the hard way that we need to greatly improve certain areas of the product in particular LiveArchive.

Personally, I put my name and reputation on the product and on the service we deliver. I believe we are the best and the solutions that we offer in LiveArchive and ExchangeDefender feature-wise are without comparison. We will make sure that all of the features, not just the inbound mail processing, live up to the 100% uptime expectation you should have and we have maintained on our inbound service for the past decade.

My staff has worked tirelessly throughout the day and night to keep you informed and restore service as fast as possible. I want to personally thank you for your professionalism and the way you treated us during this difficult time. While it’s easy to lose composure and patience when services are down and there is limited visibility/ETA on resolution, almost universally the comments included “Well that sucks but I’m glad it’s you dealing with this and not me.” While I appreciate it, I do feel we failed you.

Our operations will remain in Dallas at the existing data center facility because simply put – they are the best. Even with the power incident which was the first one that we’ve experienced in nearly a decade of working with them, this is the kind of an issue Microsoft, Google and Amazon experience on a weekly basis. Cloud services are about providing an affordable IT solution through massively scalable equipment which is incredibly complex – it is not foolproof nor easy to fix when it goes down but the benefits are that you and your clients are not on the hook for a repair bill, equipment or the amount of manpower required to manage it all. Best of all, problems such as those experienced yesterday can be minimized and we will begin to work on that today.

Again, I apologize for the inconvenience this has caused you and your clients. I know we are fortunate to earn your business and the trust you put in us and our features that are designed to keep you up and running when your systems are affected. I’m attaching some resources that you can pass on to your staff or your clients as you see fit. I look forward to talking to everyone that was affected by this and while I work my way through the messages, emails and callback requests I hope the videos and resources we provided so far offer some clarity as to what happened and what we intend to do next.

Thank you for your business.

Sincerely,

Vlad Mazek, MCSE

CEO Own Web Now Corp

Resources:

ExchangeDefender NOC: http://www.ownwebnow.com/noc

XD Twitter Feed: http://www.twitter.com/xdnoc

ExchangeDefender Twitter Feed: http://www.twitter.com/exchangdefender

Partner Webinar about the incident: http://www.ownwebnow.com/media/Cloudpocalypse.wmv