On February 9th – February 11th 2012 the ExchangeDefender staff performed maintenance on the Rockerduck cluster in which two separate individual ‘outages’ affected client access on the late half of the evenings and early half of the following morning on February 9th -10th and February 10th-11th .
On the eve of Feb. 9th 2012 (~11:30 PM Eastern) we began upgrades on a failed/failing VPN device that is used to connect ROCKERDUCK:DAL and ROCKERDUCK:LA active directory and internal communication between sites. During the upgrade we began to notice random network related events in which communication seemed saturated and sluggish and randomly affected across the entire network. After various attempts (and configurations) to bring the new VPN router online we determined that the new VPN device was occasionally malfunctioning and flooding the network with ‘dead packets’. Unfortunately the massive flood of packets from the VPN device caused the Database Availability Group (DAG) on ROCKERDUCK to lose communication between nodes and eventually lose quorum. Once quorum was lost between nodes all databases between both sites were automatically dismounted as the DAG was considered unhealthy to Exchange. For the next few hours we worked to restore service to RD clients by replacing the failed VPN routers with our backup VPNs (new vendor) and restoring communication with Los Angeles. After communication was re-established clients were able to access their mailboxes. This outage affected all clients and lasted between the hours of midnight and roughly 3:15 AM.
On the eve of Feb. 10th (~10:30 PM Eastern) we began work to finalize the VPN communication by consolidating both VPN devices in California to the one backup vendor VPN device. The reason we elected to replace the ‘working’ VPN device in California was due to the fear of the abnormal workings of the similar VPN device in Dallas. As part of our protocol to ‘down’ a data center in Exchange hosting we paused SMTP services on Rockerduck. After replacing the VPN device in California we resumed all services (including SMTP) and mail resumed normal flow. Around 5:30 AM Eastern we started to receive alerts about back pressured queues in Rockerduck which would amount to delivery delays. Upon investigation it was discovered that the issue was mail delivery between the EDGE server network and the HUB server network on RD. After two hours of investigating the issue internally (and opening a case with Microsoft) we were able to determine that our course of action would be reapplying the SP2 update to the edge networks. Once SP2 was reapplied to all EDGE nodes mail delivery returned on ROCKERDUCK by 9:15 AM Eastern.
Finally there were about 5% of users who were left in a disconnected state through Outlook but had service through OWA (and some through active sync) between Saturday and Sunday as the database their mailboxes were housed was moved to Los Angeles for the content index database in Dallas to rebuild for RDDB9. Service was restored to these users by noon Eastern.
VP, Network Operations, ExchangeDefender
(877) 546-0316 x757