Over my next two blogs I will be overviewing the fail over procedures for Rockerduck and what clients should expect should a fail over occur. This blog post will go over the actual back end process and what factors influence whether we activate our fail over procedure. The next blog post will review the client experience once an outage occurs, the fail over and the recovery.
First, let’s qualify the differences between an “issue” and an “outage”. Issues are typically minor inconveniences or temporary “unavailability” such as a router reboot, temporary power outage, or network ‘blip’. Outages/failures can occur outright or can manifest from a minor issues. By rule of thumb, if the service is expected to be impacted for more than an hour, we consider the situation to require a fail over. Our fail over procedure is not ‘automated’ as we’ve elected to run the Database Availability Group for Rockerduck in DAC (Datacenter Activation Coordination) mode. When DAG’s run in DAC mode the secondary data center must be manually activated to mitigate an outage. This is done to prevent ‘Split brain syndrome’ where both data centers concurrently activate the same mailbox database.
There is a very specific reason we do not activate our fail over procedure for minor ‘issues’.
The fail over procedure by nature is risky and can lead to longer ‘down time’ if the issue is resolved before the fail over procedure completes or if an unforeseen event occurs during fail over.
For instance, imagine that our Dallas data center has a network issue and goes completely offline from the internet. Before receiving complete details on the outage from our data center, we decide to activate our Los Angeles data center. During the process of activating the LA copy, we switch DNS records to point away from Dallas to Los Angeles. Shortly after modifying DNS, imagine that our Dallas data center comes online and tries to take back control of the DAG (as communication was only lost to the internet). Dallas would then control the DAG databases while our entry point records would point to Los Angeles. This would yield poor results for clients as they would be proxying requests through LA to Dallas.
So what really goes on during a fail over?
After qualifying that any issue requires activation of our fail over procedure, immediately we will notify partners about the fail over activation. Before any changes get made, we review the health of our Los Angeles network and servers to ensure stability of the fail over. Once all services receive approval, we perform the following steps:
Step 1 – Modify cas.rockerduck.exchangedefender.com to point to the IP for cas.la.rockerduck.exchangedefender.com (TTL 5 Minutes)
Step 2 – Stop services on all Dallas mailbox servers
Step 3 – Restore DAG quorum in California
Step 4 – Mount databases in California
Step 5 – Modify inbound.rockerduck.exchangedefender.com to point to the multihomed MX record for Rockerduck LA (TTL 5 Minutes).
By keeping the TTL record for cas.rockerduck.exchangedefender.com at 5 minutes clients should automatically connect to the California data center to resume service without any modifications. In the same token, mail flow should automatically queue up in ExchangeDefender and upon update of the DNS records queued mail and new mail should deliver to Rockerduck LA.
VP Network Operations, ExchangeDefender
(877) 546-0316 x757