First off we’d like to apologize for the issues on LOUIE earlier today and we appreciate your patience with us while we work on the problem.
Earlier today we experienced an issue with one of the components of the Database Availability Group cluster for LOUIE. The intermittent connectivity was tied to the Windows Cluster service randomly stopping in an round robin fashion across the LOUIE DAG network.
Generally this is tied to things network IP conflicts, network latency, or failure across multiple members of the Cluster simultaneously. Now with our deployment, none of these were a possibility since no new hardware was added recently, the DAG network is handled via two VLANs connected with a dedicated VPN, and during the problems all members of the DAG network were online.
It took sometime to parse to the logs for the cluster and it appears that the NIC used by Replication network (we segregate Replication and Client Access)for LOUIEMBOX7 was causing other interfaces on that range to unregister and then register. Since we have multiple live copies of the Mailbox Databases, we decided to remove the Node from the Cluster outright to stabilize the replication network. This took care of all the connectivity issues.
To conclude the maintenance window tonight, we replaced the motherboard and network components on the server. Replication across the entire network has been re-established and stable. In addition, to avoid this from happening in the future we’ve added an exception in monitoring for this type of error.
We would like to take this opportunity to remind everyone about LiveArchive. It’s for issues like these that we make it available in the Outlook Banner via our Addin available here
One of the features that differentiates our platform is that on top of the uptime promise we include Business Continuity on top of that. The ability to continue working is literary inside of Outlook one click away.