Our monitoring software has alerted us to an abnormally high page count on LOUIEMBOX2. We were able to dehydrate a few processes to alleviate the pressure, however, we will need to restart the storage device for users on DB3.
At 10:00 PM Eastern we will be restarting the storage device for DB3 on LOUIEMBOX2. During the restart, users will be unable to access their mailboxes for up to 10 minutes. We expect service to be completely restored by 10:15 PM Eastern.
Update 10:00 PM Eastern: We are beginning the process of dismounting DB3 and restarting the storage array.
Update 10:10 PM Eastern: Service has been restored to users on DB3
After reviewing performance metrics for the week, we’ve discovered an issue with the load balancer for LOUIE which would cause issues with profile creation in outlook. By nature, the load balancer will tie an entire IP C class subnet affinity to a single CAS node. Normally, this isn’t an issue, but with additions to Global Catalogs in the forest, Outlook clients would automatically try to break affinity on NSPI proxy requests. Unfortunately, this provided extra difficulty in diagnosing this issue as our staff would often be unable to replicate profile creation issues.
Starting at 11:00 PM Eastern tonight (4/29/11) we will begin to take down the current load balancer for replacement. Clients using Outlook and Active Sync will be disconnected during the replacement. Clients can utilize OWA by logging into specific CAS nodes, eg, https://louiecas4.louie.exchangedefender.com/owa
Maintenance is expected to last until 11:30 PM Eastern. We understand this is an extremely short notice, however, in the interest of providing solid performance by Monday we need to utilize as much time over the weekend to stress test the new load balancer. We appreciate your patience as we strive to continue bringing a solid Hosted Exchange experience.
Update 11:00 PM Eastern: We are beginning maintenance to replace the CAS load balancer for LOUIE
Update 11:54 PM Eastern: Service was restored near 11:35 PM Eastern and we’ve confirmed all services are active and online. We are going to continue to monitor the traffic and connection rates over the weekend.
This post is a continuation of http://www.ownwebnow.com/noc/2011/04/19/huey-maintenance/
Over the past 24 hours we’ve been collecting metrics and performance reports on HUEY. Unfortunately our metrics have shown a slight soft corruption on DB1. Tonight, 4/20/11 at 10:30 PM Eastern we will be taking DB1 on HUEY offline to perform an integrity check and repair any mismatched entries in the database.
Clients on DB1 should expect to be disconnected from their mailbox for at least 2 hours as the check completes. During the check, all mail will be delivered to LiveArchive and spooled on HUEY, awaiting delivery after maintenance completes.
Update 10:36 PM Eastern: We are beginning the checks for mailboxes on DB1.
Update 6:40 AM Eastern: Unfortunately checks are still running on DB1. We are continuing to monitor the progress.
Update 7:30 AM Eastern: In the interest of service availability as we enter the start of normal business hours, we’ve cancelled the DB integrity check just shy of 50% completion. We will monitor the health of the database and performance as any corruption encountered prior to the cancel was repaired.
In the last hour our monitoring software has alerted us about slow write response times on the database log drive. In order to resolve the issue, we will have to stop all mailbox access on HUEY to test, and if needed, replace the affected drive.
At 10:00 PM Eastern tonight we will stop all network access to mailboxes for the HUEY network. Unfortunately we are unable to estimate the time required for the maintenance to complete, but we will provide estimates as progress is made.
Clients requiring access to their mailboxes after maintenance beings are urged to utilize livearchive during the maintenance cycle.
Update 9:40 PM Eastern: We are preparing to disable access to mailboxes in the HUEY network. We expect to restore service by 10:20 PM Eastern.
Throughout the day we’ve encountered issues with the hub transport service on HUEY where messages would no longer submit. We are now receiving reports from few partners about the inability to login via Outlook.
We are performing an emergency reboot on HUEY in order to clear out any lingering issues. Service is expected to be impacted for up to 15 minutes.
Clients requiring access to their mail during the reboot can leverage livearchive to check their mailboxes.
We’ve received reports from a few partners with mailboxes on LOUIEMBOX1 that searching for items via OWA returns missing or empty results. In order to diagnose the issue, we’ll have to install logging software for verbosity..unfortunately, this will require a reboot of LOUIEMBOX1.
At 9:00 PM Eastern on 4/19/11 we will reboot LOUIEMBOX1. The impact of the reboot is scheduled to only last up to 15 minutes. Clients requiring access to their mailbox during this period can utilize livearchive to monitor live mail during the reboot.
Update 4/19/11 8:55 PM Eastern: We are beginning to restart LOUIEMBOX1.
Update 9:25 PM Eastern: The reboot completed on time and service has been verified as running.
We’re in the process of conducting a disk intensive metric test on the LOUIE mailbox and CAS servers to ensure that issues discovered earlier this week are no longer present.
During the next 10 minutes clients with large mailboxes or with multiple instances of their mailbox opened (Outlook and activesync, etc) may notice brief timeouts or sporadic error messages as we pool all resources to completing the test as soon as possible.
About 10 minutes ago we started to see larger amounts of packet loss on the RPC channel for LOUIE. To restore service immediately, we’ve disabled the CAS array and bound the IP to a specific node. Users may see a brief 2-5 minute disconnect as the ARP routing flushes from the CAS array to the specific node.
Update 11:29 AM Eastern: The ARP flush was completed and full CAS service has been restored to LOUIE. We will continue to monitor channel for any signs of RPC latency or packet loss
We’ve received report from our monitoring servers about massive packet loss on the Exchange hosting network. Upon investigation, it appears to be a routing issue with one of the main providers to the core router. We’ve initialized fail over mode and traffic is now routing through our alternative provider. Customers may have experienced a brief disconnection as Outlook lost connection through the old routes and discovered the new routing for the Exchange network.
Tonight we will be rebooting the DEWEYMbox2 server as it is no longer allowing users to resolve profile names on new mailbox setups and is randomly disconnecting users from the directory service. The reboot is scheduled for 10:00 PM Eastern and is anticipated to only take 15 minutes to complete.
On Saturday, March 19th at 10:00 PM Eastern we will begin work on LOUIE blackberry to improve email delivery speed from Exchange to the handset. During maintenance, we will be off-loading BES users to new BES servers in the network. Between the hours of 10:00 PM and 1:00 AM Eastern, users may see off and on delays for email delivery speed.
About 30 minutes ago Autodiscover started pushing updates of the node array list that included a previously removed node.. The IP that was previously used by the removed node is now in use by another server and when the update was pushed, some outlook clients would have tried to connect to louiecas3 and received SSL warnings from outbound-jr
If any users received this pop up, you can safely tell the users to click “No” on the SSL prompt as the connection is not needed. After the user closes the dialog box, they may continue to work as normal.
We will be performing service restarts (mainly, information store) on LOUIE mailbox servers to force AD permission replication. Users may experience a momentary disconnect from Outlook or Mailbox Unavailable errors in OWA. Service is expected to be fully restored by 9:45 PM Eastern
Update 9:27 PM Eastern: Service has been restored to customers on LOUIE. Users may continue to see temporary errors as all users reconnect, however, as traffic reconnects, service will go back up to speed.
Tonight starting at 11:00 PM Eastern we will be performing maintenance on the LOUIE blackberry server to improve overall service delivery speed. During the maintenance, users should expect to see a delay in inbound and outbound email or operation time outs. Service is expected to be restored by 12:30 AM Eastern.
We are in the process of installing a mandatory update for Blackberry Enterprise Server which is set to resolve recent activation issues with new handsets. Users on BES may notice a slight delay in items synching with the server until 10:45 AM Eastern.
Update 10:46 AM Eastern: The update has been successfully installed and service for LOUIE BES is coming back online.
Around 2:15 PM Eastern we had to do a quick dismount and remount of the mailbox databases in the LOUIE network to apply changes to the replication service. Unfortunately this was a security policy update and could not wait until after hours. Users on LOUIE may have experienced an outage of 5 minutes. All services are back online and login was confirmed by our staff.
Over the weekend we will be performing maintenance on EUROPE and DEWEY to resolve issues that have occurred twice this week. In short, drive space is being exhausted on both servers and preventing new messages from entering or leaving the network.
Saturday (1/29/11) at 9PM Eastern we will be performing the following maintenance
DEWEY: Moving the database path for users on DB1 from the current storage to a new storage array
EUROPE: Moving the database path for users on DB2 from the current storage to the secondary storage array.
Both moves are anticipated to take two hours to complete. During the moves, users will be unable to access their mailboxes through any medium. Mail that arrives during the transition will remain in queue until the work is completed.
Partners are strongly urged to inform clients about livearchive who require access to their mailbox during the maintenance schedule
We are currently going through the public folder databases and replicas to discover the server with the latest copies of all customer content. During the scan, clients will be unable to access their public folders. We anticipate that this task will take up to 2 hours.
Below is the new schedule for the maintenance that was originally scheduled for December 3rd.
Saturday December 11th:
2:00 AM Eastern: The blackberry BES server will be shutdown until the upgrades are complete.
2:30 AM Eastern: The HUB role will be transferred to MBOX2 as the primary HUB server and delivery point for ExchangeDefender
*Estimated* 3:00 AM Eastern: DB1 should be moved over and access to the mailboxes for users on DB1 will be restored. We will then start the move on DB2
*Estimated* 5:00 AM Eastern: DB2 should be moved and access to all user mailboxes should be restored.
6:00 AM Eastern: The reinstallation of OS on MBOX1 will begin.
7:30 AM Eastern: The installation of Exchange 2010 will begin on MBOX1
8:00 AM Eastern: The installation of Exchange will be complete and we will verify server settings
The roadmap after the installation is complete will vary based on the time of completion. To minimize customer downtime, we will begin the move back to LOUIEMBOX1 early Sunday morning.
We will provide updates on the post that will be opened when maintenance begins.
The maintenance that was scheduled for LOUIE this weekend has been postponed until next weekend (December 11th-12th). An updated timeline will be posted next week.