The Database Availability Group is Supposed to be Completely Fault Tolerant…

The Database Availability Group is Supposed to be Completely Fault Tolerant…

Earlier this week we created a NOC entry/notification for partners about a maintenance interval we scheduled for ROCKERDUCK. The entry outlined an issue we faced where on DB (DB7) was running on the logs drive instead of the DB drive and our proposed outline of the work to be completed. Unfortunately, because the issue affects all database copies, correcting the issue would involve reducing DB7 to a single mailbox server, moving the database, which would take DB7 offline, and then re-seeding the copies to all passive nodes.

Shortly after posting the NOC entry I received an email from a partner demanding that I explain to them why the Database Availability Group (DAG) could not prevent service interruption for users on DB7.

So why does the DAG not protect from every single event possible?

Simply said; all servers in a DAG must be identical in terms of storage location for databases and logs across all servers. In a DAG, only one mailbox can act as the “active” mailbox database and all other copies on other nodes are purely “database copies” that can be switched to the active/primary database.

In the case of moving the database path, we cannot switch the current active database over to a passive node, move the DB, then switch it back to the original primary as this would break the DAG and we would then have split copies of ‘active’ data. We cannot use passive copies to keep service active while we physically modify the database properties/layout of the ‘active’ copy.

If this was a case where a database experienced a failure on the active copy or there was a network communication issue, the DAG would mount the passive copy of the database and continue providing service to users.

JibberJabberSo all this jibber-jabber means what?

In short, we would remove all copies of DB7 across all nodes except the primary node. After all copies are removed, we would start the move of DB7 to the proper location and then remount the database. By calculation of the DB size, service would be interrupted for about 10-15 minutes. Finally, after the move completes we would re-add the database copy across each node and then bring service back into full redundancy. A fifteen minute outage is unfortunately a necessary evil to provide an overall more redundant solution to our partners and their clients.

Travis Sheldon
VP Network Operations, ExchangeDefender
(877) 546-0316 x757
travis@ownwebnow.com