We’re performing maintenance on Backup73. We expect it to last 3 hours. We’re performing the maintenance during business hours as our usage records show most backup jobs are scheduled for before and after business hours.
We’re performing maintenance on Backup73. We expect it to last 3 hours. We’re performing the maintenance during business hours as our usage records show most backup jobs are scheduled for before and after business hours.
Between 4:30 PM and 5:30 PM few clients may have received bounce back messages referencing delivery to delivery to xd286#### from LiveArchive.
These notices were sent in error and was caused by LiveArchive tests that we’re conducting to setup redundancy for LiveArchive to Los Angeles.
We’ve added maintenance tonight for the ROCKERDUCK cluster, specifically on the DAG. During maintenance users should not experience any interruption in service, however, service interruption is possible. We will monitor connection statuses during the maintenance and will update this post if any interruption is detected.
Update 1:45 AM Eastern: We will have to stop the cluster service on all nodes which will cause a momentary interruption in service as we repair communication.
Update 2:23 AM Eastern: The DAG on Rockerduck has been repaired and is back online.
Tonight we will be performing maintenance on DEWEYMBOX2 to address the performance issues reported by partners throughout the day.
Maintenance is expected to last between 10PM and 2AM.
1:46 AM EST – All Maintenance on DEWEYMBOX2 has been completed succesfully. All services have been restored and queued messages are flushing into the user’s mailbox.
Again we’d like to take this opportunity to apologize for this issue and we appreciate your patience with us through out the process.
Update 8:05 AM: We are going to reboot the CAS servers for DEWEY to clear up any issues before the start of business. Service may be interrupted for 15 mins
Update 8:15 AM: The domain controllers for DEWEY have come back online from the restart. We are still waiting for the primary CAS and MBOX server to come back online.
Update 8:45 AM: The primary CAS and MBOX server came up at 8:30AM and we’ve confirmed service access for RPC, MAPI, ActiveSync and POP/IMAP
99% of all of our services have been restored. Unfortunately, we have 3 servers that are still affected by the outage. Remembering that it was a power outage and Windows Server’ capability of handling such outages is not on par with Linux, we’ve had some servers take longer to come back online. Services currently affected:
Dewey MBOX2 – It’s a secondary mailbox server on the DEWEY cluster, as such its a small user count but if your clients are on it, their mailbox is inaccessible.
Daisy – Legacy Exchange 2003 server
VS4 – One of our Virtual servers
These services affect a small but equally important portion of the client base, as such we’re all hands on deck on restoring services on these servers ASAP. For the Hosted Exchange users still affected please remember LiveArchive is online, if you’re having any issues authenticating to it or otherwise please open a support request and we’ll get you back online on LiveArchive as quickly as possible.
All services have been restored, there is still spooled mail being delivered but new mail is being delivered in real time. With that said our CEO Vlad Mazek held a GoToWebinar outlining all the important facts about this event. We invite you to listen to the points covered as well as use the provided information the in the PPTx. And once again we’d like to apologize for the inconvenienced caused to you and your customers.
We’re employing an emergency recovery plan, essential services will be coming online but they’re not running on their full infrastructure. Please expect services to come back online slowly as services are running on our emergency failover (and the datacenter has NOT fully restored the outage). Latency on the servers, that are online, is expected as customers try to reconnect to the servers. We’ll provide more details as they come available. We will be keeping folks available on the phone to give the information below live. Again this is a continuation of the information found on our twitter feed at twitter.com/XDNOC
We’re currently polling all the data from the twitter feed to present it all in one view.
Below you will find a redacted version of the twitter feed so updates can be read without cross conversations from partners during the outage, please note (like headers these chronologically from the bottom up):
Our failover systems are kicking in and service is restored to support portals, web sites, outbound ExchangeDefender, louie, & rockerduck.
Please note:THESE ARE EMERGENCY FAILOVER systems, not the real thing. Services will be restored by the utility/power/electricians/etc.
ExchangeDefender outbound service has been re-established as well as Exchange 2010 LOUIE and ROCKERDUCK
DC/Electrical teams have established a provisional return of services for 6:30 PM EST. We will update this advisory at that time.
We are working with the DC to move around equipment for a temporary solution. OWN sites and ED outbound will be up soon.
DC Update “There has been an issue affecting one of our 6 service entrances. The actual ATS is having an issue and all vendors are on site.”
The datacenter staff has confirmed an outage with the power plant and has individuals on staff attempting to redirect power around the core
Service is still affected and the latest from the DC reports that the backup EPO overloaded and tripped. The issue is still being addressed
The issue has been identified as power related in the DC. Services are slowly coming online. We will update when service is fully restored.
Routing issues in Dallas at the moment. If you’re having issues accessing and have Level3 in your way its going to take some patience today.
In addition we’re polling updates from our DC’s status to ensure that we’re providing as much detail as possible on the outage itself (times are CST):
Our team and electricians are working diligently to get the temporary ATS installed, wired and tested to allow power to be restored. As the ATS involves high-voltage power, we are following the necessary steps to ensure the safety of our personnel and your equipment housed in our facility.
Based on current progress the electricians expect to start powering the equipment on between 6:15 – 7:00pm Central. This is our best estimated time currently. We have thoroughly tested and don’t anticipate any issues in powering up, but there is always the potential for unforeseen issues that could affect the ETA so we will keep you posted as we get progress reports. Our UPS vendor has checked every UPS, and the HVAC has checked every unit and found no issues. Our electrical contractor has also checked everything.
We realize how challenging and frustrating that it has been to not have an ETA for you or your customers, but we wanted to ensure we shared accurate and realistic information. We are working as fast as possible to get our customers back online and to ensure it is done safely and accurately. We will provide an update again within the hour.
While the team is working on the fix, I’ve answered some of the questions or comments that have been raised:
1. ATSs are pieces of equipment and can fail as equipment sometimes does, which is why we do 2N power in the facility in case the worst case scenario happens.
2. There is no problem with the electrical grid in Dallas or the heat in Dallas that caused the issue.
3. Our website and one switch were connected to two PDUs, but ultimately the same service entrance. This was a mistake that has been corrected.
4. Bypassing an ATS is not a simple fix, like putting on jumper cables. It is detailed and hard work. Given the size and power of the ATS, the safety of our people and our contractors must remain the highest priority.
5. Our guys are working hard. While we all prepare for emergencies, it is still quite difficult when one is in effect. We could have done a better job keeping you informed. We know our customers are also stressed.
6. The ATS could be repaired, but we have already made the decision to order a replacement. This is certainly not the cheapest route to take, but is the best solution for the long-term stability.
7. While the solution we have implemented is technically a temporary fix, we are taking great care and wiring as if it were permanent.
8. Colo4 does have A/B power for our routing gear. We identified one switch that was connected to A only which was a mistake. It was quickly corrected earlier today but did affect service for a few customers.
9. Some customers with A/B power had overloaded their circuits, which is a separate and individual versus network issue. (For example, if we offer A/B 20 amp feeds and the customer has 12 amps on each, if one trips, the other will not be able to handle the load.)
As you could imagine, this is the top priority for everyone in our facility. We will provide an update as quickly as possible.
Thank you for your patience as we work to address the ATS issue with our #2 service entrance. We apologize for the situation and are working as quickly as possible to restore service.
We have determined that the repairs for the ATS will take more time than anticipated, so we are putting into service a backup ATS that we have on-site as part of our emergency recovery plan. We are working with our power team to safely bring the replacement ATS into operation. We will update you as soon as we have an estimated time that the replacement ATS will be online.
Later, once we have repaired the main ATS, we will schedule an update window to transition from the temporary power solution. We will provide advance notice and timelines to minimize any disruption to your business.
Again, we apologize for the loss of connectivity and impact to your business. We are working diligently to get things back online for our customers. Please expect another update within the hour.
It has been determined that the ATS will need repairs that will take time to perform. Fortunately Colo4 has another ATS that is on-site that can be used as a spare. Contractors are working on a solution right now that will allow us to safely bring that ATS in and use it as a spare while that repair is happening.
That plan is being developed now and we should have an update soon as to the time frame to restore temporary power. We will need to schedule another window when the temp ATS is brought offline and replaced by the repaired ATS.
There has been an issue affecting one of our 6 service entrances. The actual ATS (Automatic Transfer Switch) is having an issue and all vendors are on site. Unfortunately, this is affecting service entrance 2 in the 3000 Irving facility so it is affecting a lot of the customers that have been here the longest.
The other entrance in 3000 is still up and working fine as well as the 4 entrances in 3004. Customers utilizing A/B should have access to their secondary link. It does appear that some customers were affected by a switch that had a failure in 3000. That has been addressed and should be up now.
This is not related to the PDU maintenance we had in 3004 last night. Separate building, service entrance, UPS, PDU, etc.
We will be updating customers as we get information from our vendors so that they know the estimated time for the outage. Once this has been resolved we also distribute a detailed RFO to those affected.
Our electrical contractors, UPS maintenance team and generator contractor are all on-site and working to determine what the best course of action is to get this back up.
One of our pop/imap servers used for WebHosting freebie mailboxes primarily appears to have been attacked. We’re resolving the issue as we speak there may be some delays found in mail flow from today. Measures have been taken to ensure the root cause does not repeat itself.
We have received reports of a 5 minute window where there was packet loss to our Datacenter in Dallas. It appears this issue was on the ISP level of the Network and has been resolved in its entirety.
Additional Info from our DC:
The network issues experienced today began at approximately 12:22pm CST were caused by an issue within Level3’s network. This issue affected Level3 customers nationwide, and is not isolated to Colo4.
We will be performing a Database switch on the DAG for ROCKERDUCK from RDMBOX2 to RDMBOX1. Users currently being serviced by RDMBOX2 will automatically switch to RDMBOX1. No changes are required by the end user to continue service and the switch over should be transparent to the end users. This switch is estimated to last for an hour as we investigate performance counter issues on RDMBOX2
At 3 PM Eastern (8 PM GMT) we will be rebooting servers in the DELLA cluster to apply windows updates which require a restart. We anticipate service to be slightly interrupted for up to 10 minutes while the reboot occurs.
[3:26 PM] We’ve prepared the servers for the reboot and will begin rebooting the MBOX servers shortly.
[3:41 PM] We are bringing all servers in DELLA back online.
[3:50 PM] Service has been restored to DELLA
We’re currently investigating an issue with one of the cas nodes on LOUIE. Currently all metrics appear in good health but we’re investigating end user reports of erratic connectivity.
[2:10 PM] We’ve tracked the issue down to slugglish DC lookups. We’re working to address these as quickly as possible.
[2:41 PM] We’re still working on address the Domain Controller lookup issue for the Outlook clients. This is a courtesy update to let our partners know we’re still working on the issue urgently.
[2:58 PM]We’ve traced the issue to a problem with the HUB role interfacing with the DOMAIN Controller, we’re putting a change through and should have more information in the next 10 minutes.
Please remember that LiveArchive is accessible.
[3:06 PM]There are reports of mail delivery delays which, unfortunately will naturally happen after clients begin to reconnect and mail is pushed from local clients. We are also going to restart the information store on LOUIEMBOX2 as that seems to be the only queues which are in excess.
[3:17 PM] We will also restart the load balancer to rebalance connections before the information store comes back up.
[3:27 PM] We are remounting the databases for LOUIEMBOX2 and will shortly resume mail flow.
[3:38 PM] We have resumed mail flow for users on LOUIEMBOX1. The database files are replaying the log files from the past half hour to ensure all data is present before we resume service.
[3:48 PM] Service has been fully restored and we are monitoring the traffic as the queues get processed
We’ve concluded our investigation and the root cause of the latency on LOUIE, which prompted us to stop service was a running backup job on MBOX2 from 12:30 AM. The jobs are configured to terminate if they do not complete by 7AM Eastern, which seems to have not occurred. We will be working with the backup vendor to resolve the issue with backup jobs not terminating. Since the original report from partners and our monitoring was related to RPC / OWA / Outlook performance, we started with the CAS servers and worked backwards. We will reevaluate our monitoring checks in an attempt to avoid mistaking CAS related latency and overall RPC and network call latency.
Today we will be conducting fail over scenarios for our PBX system in response to the outage from last week. Throughout the test we will be disabling our main PBX to allow calls to fall over to our failover numbers. Clients should not experience any issues in reaching parties on our end during the PBX testing.
At 10 PM EST tonight we will begin the upgrade of our network to PHP 5.3. This security upgrade is required for many modern PHP applications such as WordPress, phpMySQL and Joomla. In order to be able to provide the latest applications and the security patches they require, PHP upgrade is required.
We will begin the upgrade process at 10 PM EST tonight and expect it to take roughly an hour. Between 10PM and 11 PM EST there may be intermittent outages on our web servers while the upgrades are being performed and services restarted.
We will upgrade this NOC post once all the work has been completed.
Update: 10:45 PM EST – Upgrade to PHP 5.3 has failed. We will attempt again shortly and update this advisory.
On 7/20/2011 around 3:35 PM Eastern we started experiencing random packet loss across various services including Hosted Exchange and OWN Websites. Roughly around 3:45 PM Eastern, the random packet loss turned into a wide-spread service outage and lasted until 4:12 PM Eastern.
The incident appears to be faulted network driver on a Exchange monitoring server. Upon automatic recovery of the driver, the machine began to flood nearby network switches with invalid requests. Unfortunately the internal floods prevented access to the network analytic servers behind the DMZ. Since all machines received and responded to the request, all machines showed up as ‘flooding’ to the router and IDS was unable to determine the ‘source’ IP.
All services were essentially taken offline when the IDS started blocking traffic from the internal hosts. After we disabled the offending machine from the network and cleared IDS we were able to resume service across the board.
The biggest area of concern was the inability to contact us as the outage was occurring as the outage took down our support board and primary phone lines. We deeply apologize for the grief and trouble that this unexpected event caused and without saying, this has been the most impacting network event that we’ve experienced. We’ve implemented a new redundancy plan to our phone systems to handle global outages as this was the first time our phone systems were completely offline during a critical event.
We appreciate everything that our partners do for us and the patience that was extended yesterday as we definitely know that it was a very stressful event for our partners and their end users. As always we will continue to bring improvements to our solution stacks and address the areas where we may fall short.
We’re experiencing packet loss in our primary DC. It appears that services are beginning to come online but we have not yet considered the issue resolved.
[4:57 PM] All services have been restored with the Exception of LiveArchive. We should have this service online shortly.
We’ve received alerts about queues on Rockerduck backing up to 100 messages. Upon investigation, the Edge server was stuck in a reboot phase after Windows updates. We’ve restarted the server and mail is once again flowing on Rockerduck.
We’ve been working on Matilda throughout the night. It appears to be an ISP level issue as we’re awaiting word from our AUS DC on further details. Please remember to use LiveArchive during these times. It is available at https://livearchive.exchangedefender.com
07:14 am. This is a courtesy update, unfortunately we still have not received a resolution to the issue from our DC. Please continue to use LiveArchive.
This issue has been resolved.
12:28 PM (EST) on 6-22 Our engineers discovered an issue on a couple of individual nodes within the ExchangeDefender network that may have caused some temporary delay to both Inbound and Outbound messages. As of this moment our ExchangeDefender Engineers are working diligently on resolving this issue. From all of us here at OWN Web Now we would like to offer all of our partners our sincerest apologize for this unforeseen issue we experienced today but please rest assured the issue will be resolved.
12:40 PM (EST) This issue has been resolved all spooled mail has been delivered. This was caused by a delay in response between two of our core systems within ExchangeDefender and should not reoccur.
Per our previous NOC posting, we’ve been redesigning our maintenance plan for rebalancing the user distribution on HUEY. The original plan to defrag the database was abandoned as the timeframe for completion was not acceptable.
Tonight starting at 9PM Eastern we will be taking the HUEY database offline for about 5 minutes as we clear out the memory cache in preparation for mailbox moves tonight. Throughout the night we will be moving users between two new databases to even out the load. During the move, mailboxes that are actively moving will be inaccessible to users as Exchange 2007 did not feature Online moves. Upon completion, users will be able to access their mailbox on the new database. Move times will depend on the mailbox size and item count.
Update 6:45 PM Eastern: After our previous update our metric test completed and we’ve noticed that there are write lock delays on the OS drive for HUEY. We’ve made an adjustment to our above outline. Prior to starting the mailbox moves we will be performing a full database backup at the NTFS file level. Unfortunately this means we will have to offline the database as we are capturing a raw file backup instead of a VSS backup. After the backup is completed we will scan the surface error on the OS drive for HUEY for any corruption. We anticipate this entire process will take up to 4 hours to complete. We will update this post as progress is made starting at 9PM when work begins.
Update 8:30 PM Eastern: The backup job of the OS is taking a bit longer than expected. We are pushing back dismounting the database to 9:30 PM. We will update this blog after 9:00 PM if we anticipate the backup taking longer.
Update 8:50 PM Eastern: We’ve received a request from a few west coast customers asking us to postpone maintenance until 10. In the interest of disturbing service as least as possible, we will be postponing maintenance until 10 PM Eastern.
Update 10:45 PM Eastern: The backup is estimated to complete in the next hour. We will then begin the surface error test on the OS drive. This is estimated to be the longest part of the process and will require a disruption of service as we take the server offline. We estimate the entire process to be 4 hours as described earlier. We will update this post once the work begins.
Update 11:30 PM Eastern: We are beginning to dismount the mailbox databases and stop Exchange services.
Update 2:05 AM Eastern: The surface area test has revealed issues on the OS drive. We are running a repair to on the drive and monitoring the progress.
Update 4:17 AM Eastern: We’ve replaced a bad drive in HUEY on the OS drive and we are proceeding to perform integrity checks before turning on any services.
Update 5:47 AM Eastern: The integrity check failed and we will be restoring from the backup image taken prior to maintenance. We will continue to update this blog as progress is made.
To clarify the issue is specifically with the operating system and not the database integrity.
Update 9:44 AM Eastern: The restoration process is proceeding as planned, this is courtesy update to ensure partners work is continuing.
Update 12:20 PM Eastern: The restoration surface test is underway and we are looking to confirm data consistency on the OS drive.
Update 1:57 PM Eastern: In order to achieve resolution in the fastest manner possible, we are beginning to concurrently restore the backup image on a spare server to eliminate any potential issues that may be affecting the physical host.
Update 7:00 PM Eastern: The integrity check has processed half of the files on the OS drive and overall progress is about 25 percent complete
Update 9:00 PM Eastern: This is a courtesy update as the process above is still continuing successfully without any halts. We understand this is an urgent issue and we appreciate your patience with this process.
Update 1:30 AM: The integrity check has processed about 90% of the files on the drive and overall progress is near 75% completed.
Update 2:35 AM: The integrity check has completed and we’ve successfully booted windows into safe mode. We are now proceeding to boot normally and resume services on HUEY
Update 3:15 AM: Service on HUEY has been restored and all queued mail is being delivered to user mailboxes.
reTonight starting at 9:00 PM Eastern we will be taking the mailbox databases on HUEY offline to perform an offline defragmentation. We anticipate the scan will take up to 4 hours, which will leave mailbox access offline until the database is remounted. Clients are able to utilize livearchive during the maintenance schedule to continue working with live mail.
Update 8:35 PM Eastern: We will begin work in 30 minutes, starting with dismounting the database and copying it to a temporary storage drive, and then starting the offline defrag. After the defrag completes, we will mount the database from the temporary location and stress test the integrity. After we’ve assured integrity we will copy the database back to the active RAID controller.
We estimate that each step may take up to 2 hours to complete , but we will update this post along the way.
Update 9:10 PM Eastern: We are pushing back maintenance one hour as we rearrange the temporary storage iSCSI server to increase overall speed, expecting to lower the overall time.
Update 10:09 PM Eastern: We will be begin the above outlined process in 5 minutes.
Update 11:00 PM Eastern: By estimation of current progress, we do not feel that we have enough time allocated for this process even with our earlier changes. We’ve remounted the current mailbox database. We are formulating a new plan to bring an solution that will run in parallel.
Our readiness kit contains valuable resources designed specifically to help businesses with GDPR requirements.DOWNLOAD OUR GDPR READINESS KIT
Download our webinar to find out how we comply with the GDPR requirements.SEE OUR WEBINAR
Looking for custom GDPR collateral or have questions for us? Contact us, We are here to help!MORE INFORMATION