We’ve received alerts on our monitor software about faults in the memory in HUEY. We’ve dismounted the database to avoid any corruption as we test the memory and replace if needed. We currently do not have an ETA for service restoration, but we will update this blog as information is obtained.
Update 2:50 AM Eastern: Service was restored at 2:30 AM and all services have been confirmed online.
We’re going to perform an emergency reboot on DEWEY. While we’re unable to replicate the reported issues connecting to profiles on DEWEY the reports are enough that we need to ensure that all of our clients can access their profiles. We’ll be doing this in 10 minutes. We should be back online within 15 minutes of that timeframe.
In a few moments we will begin the upgrade of our network to ExchangeDefender 7. The process will be seamless to our users and we don’t expect any issues. We are prepared for the unexpected issues and here are some suggestions:
1. Sign up for the Network Operations Blog updates (http://www.ownwebnow.com/noc)
2. If you don’t like web / RSS feeds, check out our Twitter feed (http://www.twitter.com/XDNOC)
3. Make sure you have distributed the User Guides to your end users so they can navigate the new User Interface. Link to end user collateral is here.
During The Upgrade
Check here first. We will be posting updates here and to the Twitter feed as soon as we identify and document a bug.
If you see an issue that has not been addressed, open a support request at https://support.ownwebnow.com
Call us at +1 (407) 209-3276 if the issue is urgent. Note: This number will be discontinued after the launch on June 1, 2011.
May 30, 11 PM EST: All services have been upgraded to ExchangeDefender 7. All tests have confirmed that sites and services are completely operational.
At 11:00 PM Eastern we will be rebooting the LOUIEMBOX1 server to repair the windows server backup feature. During the reboot, mailboxes hosted on LOUIEMBOX1 will experience a brief service interruption as the server reboots. We understand that this is short notice, however in the interest of maintaining a healthy backup history, we must reboot the server before the next scheduled backup.
Update 11:30 PM Eastern: We are now rebooting the LOUIEMBOX1 server.
Update 11:45 PM Eastern: Service has been restored to users on LOUIEMBOX1
We are performing Active Directory maintenance on the LIVEARCHIVE network in order to resolve issues with database latency. During maintenance, users may receive warning about their mailbox being unavailable. We appreciate your patience as we make the necessary modifications in order to restore service to 100% operation.
We will be installing Exchange 2010 SP1 RU3 on the LOUIE cluster tonight beginning at 9PM Eastern.
During the upgrade on the CAS servers, users may see a brief disconnection while individual CAS nodes are upgraded, however, they should automatically connect to another available CAS server.
During the upgrade on the Mailbox servers, users may see a brief disconnection from public folders while individual Mailbox codes are upgraded, however, they should automatically connect to the next available Public Folder Replica.
During the upgrade on the HUB servers, users may see a brief delay for incoming mail, however, outgoing mail should not be affected.
This upgrade will allow us to seamlessly move mailboxes from LOUIE to our newest Exchange Cluster if partners would like to move.
Update 9:10 PM Eastern: We are beginning to upgrade the CAS servers in LOUIE
Update 10:15 PM Eastern: We have completed the upgrade to the CAS servers and will now begin on the HUB and Mailbox servers.
Update 11:04 PM Eastern: Service has been restored to the LOUIE network and we’ve confirmed access to our newest Exchange cluster.
We’re currently investigating some issues with a specific database on DEWEY there may be a quick service interruption while we research this.
[9:57 am]It appears that the system will require a restart after the reports of slowness have increased please stand by. We’ll be doing so in 10 minutes. Please remember LiveArchive is available during these windows.
[10:21 am]Mail is now delivering as expected, please allow time for the queued mail to deliver. Thank you for your patience.
Huey is experiencing some issues with some client connection services. We’ve gone through all of the services carefully to avoid this. However, it is necessary to complete this in the next 5 minutes.
Work on Huey is continuing. Please remember LiveArchive is available during these windows of time.
Per our previous NOC posting about LOUIE, tonight we will be performing maintenance as we physically move the servers around. Unfortunately we are running behind schedule and work was not able to begin at the scheduled 10PM mark. We estimate that we will be able to begin around 3AM Eastern and will update this NOC posting once work has begun.
Update 3:30 AM Eastern: We’ve received the green light to begin the move process. At 4:00 AM Eastern we will be powering off the LOUIE network to physically move the servers to our newest cage in our Dallas data center.
Update 4:05 AM Eastern: We’re beginning the move for the LOUIE servers. Customers on the LOUIE network will be unable to access their mailboxes for the next 20 minutes as the move occurs.
Update 5:05 AM Eastern: The move completed about 15 minutes ago – we are running our performance tests to ensure that the post move process completed successfully
We’re currently working on LiveArchive as some Mail databases are not accepting delivery of new mail. Up to this point access to the Mailbox had not been affected. We’re in the process of performing emergency maintenance. Please stand by for an update as to when the work is complete.
Update: 70% of the Databases are remounted and beginning to receive mail. We’ll update shortly when the work has completed.
Update: 100% of the Databases have remounted with the exception of DB8 which was already under maintenance.
On 5/19/11 at 10 PM Eastern we will be performing maintenance on the LOUIE Exchange 2010 network as we physically move the servers into a new cabinet. During the maintenance period, users will briefly loose connection and access to their mailboxes, however, service is expected to be restored shortly thereafter.
Prior to maintenance we will update our NOC blog with a more refined timeline.
Please note we’ve received a note about a possible outage for some upstream connections to our Australian DC. They’re saying only certain routes are affected. There is no current ETA.
We do apologize for any inconvenience this issue with Louie may have caused your client. Our engineers discovered that the backup jobs on the server went to truncate the log files and it forced a dismount of the database. Our monitoring software attempted to remount the database, but the backup job would continue to dismount it. As of this moment the engineers have confirmed that the backup has been completed and all services have been fully restored and are currently active.
We’re currently researching a network outage that seems to have stricken our Level3 Network into our data center.
Update 11:50: This is has been fully mitigated. Everything is back up and running! We apologize for any inconvenience to our Level3 clients, this shouldn’t happen again. We’ll be posting additional details as we complete our investigation.
Our monitoring software has alerted us to an abnormally high page count on LOUIEMBOX2. We were able to dehydrate a few processes to alleviate the pressure, however, we will need to restart the storage device for users on DB3.
At 10:00 PM Eastern we will be restarting the storage device for DB3 on LOUIEMBOX2. During the restart, users will be unable to access their mailboxes for up to 10 minutes. We expect service to be completely restored by 10:15 PM Eastern.
Update 10:00 PM Eastern: We are beginning the process of dismounting DB3 and restarting the storage array.
Update 10:10 PM Eastern: Service has been restored to users on DB3
After reviewing performance metrics for the week, we’ve discovered an issue with the load balancer for LOUIE which would cause issues with profile creation in outlook. By nature, the load balancer will tie an entire IP C class subnet affinity to a single CAS node. Normally, this isn’t an issue, but with additions to Global Catalogs in the forest, Outlook clients would automatically try to break affinity on NSPI proxy requests. Unfortunately, this provided extra difficulty in diagnosing this issue as our staff would often be unable to replicate profile creation issues.
Starting at 11:00 PM Eastern tonight (4/29/11) we will begin to take down the current load balancer for replacement. Clients using Outlook and Active Sync will be disconnected during the replacement. Clients can utilize OWA by logging into specific CAS nodes, eg, https://louiecas4.louie.exchangedefender.com/owa
Maintenance is expected to last until 11:30 PM Eastern. We understand this is an extremely short notice, however, in the interest of providing solid performance by Monday we need to utilize as much time over the weekend to stress test the new load balancer. We appreciate your patience as we strive to continue bringing a solid Hosted Exchange experience.
Update 11:00 PM Eastern: We are beginning maintenance to replace the CAS load balancer for LOUIE
Update 11:54 PM Eastern: Service was restored near 11:35 PM Eastern and we’ve confirmed all services are active and online. We are going to continue to monitor the traffic and connection rates over the weekend.
This post is a continuation of http://www.ownwebnow.com/noc/2011/04/19/huey-maintenance/
Over the past 24 hours we’ve been collecting metrics and performance reports on HUEY. Unfortunately our metrics have shown a slight soft corruption on DB1. Tonight, 4/20/11 at 10:30 PM Eastern we will be taking DB1 on HUEY offline to perform an integrity check and repair any mismatched entries in the database.
Clients on DB1 should expect to be disconnected from their mailbox for at least 2 hours as the check completes. During the check, all mail will be delivered to LiveArchive and spooled on HUEY, awaiting delivery after maintenance completes.
Update 10:36 PM Eastern: We are beginning the checks for mailboxes on DB1.
Update 6:40 AM Eastern: Unfortunately checks are still running on DB1. We are continuing to monitor the progress.
Update 7:30 AM Eastern: In the interest of service availability as we enter the start of normal business hours, we’ve cancelled the DB integrity check just shy of 50% completion. We will monitor the health of the database and performance as any corruption encountered prior to the cancel was repaired.
In the last hour our monitoring software has alerted us about slow write response times on the database log drive. In order to resolve the issue, we will have to stop all mailbox access on HUEY to test, and if needed, replace the affected drive.
At 10:00 PM Eastern tonight we will stop all network access to mailboxes for the HUEY network. Unfortunately we are unable to estimate the time required for the maintenance to complete, but we will provide estimates as progress is made.
Clients requiring access to their mailboxes after maintenance beings are urged to utilize livearchive during the maintenance cycle.
Update 9:40 PM Eastern: We are preparing to disable access to mailboxes in the HUEY network. We expect to restore service by 10:20 PM Eastern.
Throughout the day we’ve encountered issues with the hub transport service on HUEY where messages would no longer submit. We are now receiving reports from few partners about the inability to login via Outlook.
We are performing an emergency reboot on HUEY in order to clear out any lingering issues. Service is expected to be impacted for up to 15 minutes.
Clients requiring access to their mail during the reboot can leverage livearchive to check their mailboxes.
We’ve received reports from a few partners with mailboxes on LOUIEMBOX1 that searching for items via OWA returns missing or empty results. In order to diagnose the issue, we’ll have to install logging software for verbosity..unfortunately, this will require a reboot of LOUIEMBOX1.
At 9:00 PM Eastern on 4/19/11 we will reboot LOUIEMBOX1. The impact of the reboot is scheduled to only last up to 15 minutes. Clients requiring access to their mailbox during this period can utilize livearchive to monitor live mail during the reboot.
Update 4/19/11 8:55 PM Eastern: We are beginning to restart LOUIEMBOX1.
Update 9:25 PM Eastern: The reboot completed on time and service has been verified as running.