On 7/20/2011 around 3:35 PM Eastern we started experiencing random packet loss across various services including Hosted Exchange and OWN Websites. Roughly around 3:45 PM Eastern, the random packet loss turned into a wide-spread service outage and lasted until 4:12 PM Eastern.
The incident appears to be faulted network driver on a Exchange monitoring server. Upon automatic recovery of the driver, the machine began to flood nearby network switches with invalid requests. Unfortunately the internal floods prevented access to the network analytic servers behind the DMZ. Since all machines received and responded to the request, all machines showed up as ‘flooding’ to the router and IDS was unable to determine the ‘source’ IP.
All services were essentially taken offline when the IDS started blocking traffic from the internal hosts. After we disabled the offending machine from the network and cleared IDS we were able to resume service across the board.
The biggest area of concern was the inability to contact us as the outage was occurring as the outage took down our support board and primary phone lines. We deeply apologize for the grief and trouble that this unexpected event caused and without saying, this has been the most impacting network event that we’ve experienced. We’ve implemented a new redundancy plan to our phone systems to handle global outages as this was the first time our phone systems were completely offline during a critical event.
We appreciate everything that our partners do for us and the patience that was extended yesterday as we definitely know that it was a very stressful event for our partners and their end users. As always we will continue to bring improvements to our solution stacks and address the areas where we may fall short.