We will be performing a Database switch on the DAG for ROCKERDUCK from RDMBOX2 to RDMBOX1. Users currently being serviced by RDMBOX2 will automatically switch to RDMBOX1. No changes are required by the end user to continue service and the switch over should be transparent to the end users. This switch is estimated to last for an hour as we investigate performance counter issues on RDMBOX2
July 28, 2011
July 27, 2011
At 3 PM Eastern (8 PM GMT) we will be rebooting servers in the DELLA cluster to apply windows updates which require a restart. We anticipate service to be slightly interrupted for up to 10 minutes while the reboot occurs.
[3:26 PM] We’ve prepared the servers for the reboot and will begin rebooting the MBOX servers shortly.
[3:41 PM] We are bringing all servers in DELLA back online.
[3:50 PM] Service has been restored to DELLA
July 26, 2011
We’re currently investigating an issue with one of the cas nodes on LOUIE. Currently all metrics appear in good health but we’re investigating end user reports of erratic connectivity.
[2:10 PM] We’ve tracked the issue down to slugglish DC lookups. We’re working to address these as quickly as possible.
[2:41 PM] We’re still working on address the Domain Controller lookup issue for the Outlook clients. This is a courtesy update to let our partners know we’re still working on the issue urgently.
[2:58 PM]We’ve traced the issue to a problem with the HUB role interfacing with the DOMAIN Controller, we’re putting a change through and should have more information in the next 10 minutes.
Please remember that LiveArchive is accessible.
[3:06 PM]There are reports of mail delivery delays which, unfortunately will naturally happen after clients begin to reconnect and mail is pushed from local clients. We are also going to restart the information store on LOUIEMBOX2 as that seems to be the only queues which are in excess.
[3:17 PM] We will also restart the load balancer to rebalance connections before the information store comes back up.
[3:27 PM] We are remounting the databases for LOUIEMBOX2 and will shortly resume mail flow.
[3:38 PM] We have resumed mail flow for users on LOUIEMBOX1. The database files are replaying the log files from the past half hour to ensure all data is present before we resume service.
[3:48 PM] Service has been fully restored and we are monitoring the traffic as the queues get processed
We’ve concluded our investigation and the root cause of the latency on LOUIE, which prompted us to stop service was a running backup job on MBOX2 from 12:30 AM. The jobs are configured to terminate if they do not complete by 7AM Eastern, which seems to have not occurred. We will be working with the backup vendor to resolve the issue with backup jobs not terminating. Since the original report from partners and our monitoring was related to RPC / OWA / Outlook performance, we started with the CAS servers and worked backwards. We will reevaluate our monitoring checks in an attempt to avoid mistaking CAS related latency and overall RPC and network call latency.
July 25, 2011
Today we will be conducting fail over scenarios for our PBX system in response to the outage from last week. Throughout the test we will be disabling our main PBX to allow calls to fall over to our failover numbers. Clients should not experience any issues in reaching parties on our end during the PBX testing.
July 24, 2011
At 10 PM EST tonight we will begin the upgrade of our network to PHP 5.3. This security upgrade is required for many modern PHP applications such as Wordpress, phpMySQL and Joomla. In order to be able to provide the latest applications and the security patches they require, PHP upgrade is required.
We will begin the upgrade process at 10 PM EST tonight and expect it to take roughly an hour. Between 10PM and 11 PM EST there may be intermittent outages on our web servers while the upgrades are being performed and services restarted.
We will upgrade this NOC post once all the work has been completed.
Update: 10:45 PM EST – Upgrade to PHP 5.3 has failed. We will attempt again shortly and update this advisory.
July 21, 2011
On 7/20/2011 around 3:35 PM Eastern we started experiencing random packet loss across various services including Hosted Exchange and OWN Websites. Roughly around 3:45 PM Eastern, the random packet loss turned into a wide-spread service outage and lasted until 4:12 PM Eastern.
The incident appears to be faulted network driver on a Exchange monitoring server. Upon automatic recovery of the driver, the machine began to flood nearby network switches with invalid requests. Unfortunately the internal floods prevented access to the network analytic servers behind the DMZ. Since all machines received and responded to the request, all machines showed up as ‘flooding’ to the router and IDS was unable to determine the ‘source’ IP.
All services were essentially taken offline when the IDS started blocking traffic from the internal hosts. After we disabled the offending machine from the network and cleared IDS we were able to resume service across the board.
The biggest area of concern was the inability to contact us as the outage was occurring as the outage took down our support board and primary phone lines. We deeply apologize for the grief and trouble that this unexpected event caused and without saying, this has been the most impacting network event that we’ve experienced. We’ve implemented a new redundancy plan to our phone systems to handle global outages as this was the first time our phone systems were completely offline during a critical event.
We appreciate everything that our partners do for us and the patience that was extended yesterday as we definitely know that it was a very stressful event for our partners and their end users. As always we will continue to bring improvements to our solution stacks and address the areas where we may fall short.
July 20, 2011
We’re experiencing packet loss in our primary DC. It appears that services are beginning to come online but we have not yet considered the issue resolved.
[4:57 PM] All services have been restored with the Exception of LiveArchive. We should have this service online shortly.
July 18, 2011
We’ve received alerts about queues on Rockerduck backing up to 100 messages. Upon investigation, the Edge server was stuck in a reboot phase after Windows updates. We’ve restarted the server and mail is once again flowing on Rockerduck.
We’ve been working on Matilda throughout the night. It appears to be an ISP level issue as we’re awaiting word from our AUS DC on further details. Please remember to use LiveArchive during these times. It is available at https://livearchive.exchangedefender.com
07:14 am. This is a courtesy update, unfortunately we still have not received a resolution to the issue from our DC. Please continue to use LiveArchive.
This issue has been resolved.
Powered by WordPress