We’re currently investigating an issue with one of the cas nodes on LOUIE. Currently all metrics appear in good health but we’re investigating end user reports of erratic connectivity.
[2:10 PM] We’ve tracked the issue down to slugglish DC lookups. We’re working to address these as quickly as possible.
[2:41 PM] We’re still working on address the Domain Controller lookup issue for the Outlook clients. This is a courtesy update to let our partners know we’re still working on the issue urgently.
[2:58 PM]We’ve traced the issue to a problem with the HUB role interfacing with the DOMAIN Controller, we’re putting a change through and should have more information in the next 10 minutes.
Please remember that LiveArchive is accessible.
[3:06 PM]There are reports of mail delivery delays which, unfortunately will naturally happen after clients begin to reconnect and mail is pushed from local clients. We are also going to restart the information store on LOUIEMBOX2 as that seems to be the only queues which are in excess.
[3:17 PM] We will also restart the load balancer to rebalance connections before the information store comes back up.
[3:27 PM] We are remounting the databases for LOUIEMBOX2 and will shortly resume mail flow.
[3:38 PM] We have resumed mail flow for users on LOUIEMBOX1. The database files are replaying the log files from the past half hour to ensure all data is present before we resume service.
[3:48 PM] Service has been fully restored and we are monitoring the traffic as the queues get processed
We’ve concluded our investigation and the root cause of the latency on LOUIE, which prompted us to stop service was a running backup job on MBOX2 from 12:30 AM. The jobs are configured to terminate if they do not complete by 7AM Eastern, which seems to have not occurred. We will be working with the backup vendor to resolve the issue with backup jobs not terminating. Since the original report from partners and our monitoring was related to RPC / OWA / Outlook performance, we started with the CAS servers and worked backwards. We will reevaluate our monitoring checks in an attempt to avoid mistaking CAS related latency and overall RPC and network call latency.