Mon Tue Wed Thu Fri
ExchangeDefender Inbound
ExchangeDefender Outbound
ExchangeDefender Apps Web
Exchange 2010 Hosting
Exchange 2007 Hosting
Offsite Backups
Web Hosting
Blackberry Enterprise Server
SharePoint 2007
SharePoint 2010
Exchange 2013 Hosting
SharePoint 2013
Virtual Servers
Exchange Hosting – ExchangeDefender Network Operations

May 29, 2012

Maintenance May 29th

Filed under: Exchange Hosting — travis @ 4:31 pm

The work described below is schedule to begin at 9:00 PM Eastern May 29th 2012

  • LOUIE – LOUIEMBOX1 & LOUIEMBOX2 – Update Network Driver

Will cause brief interruption, 15-30 seconds while the driver updates

  • LOUIE – update exchange to service pack 2

Will not cause interruption to clients

  •  ROCKERDUCK – Reseeding databases between RDMBOX1 and RDMBOX2 for fail over

Will not cause interruption, but OWA users may see slight delays in accessing content (including public folders) since the replication is going to use the MAPI NICs instead of the replication NIC. This will only be during night as to not flood the network during the day

  • ROCKERDUCK – Redistributing disk layout on RDMBOX3

RDMBOX3 is one of the additional fail over clusters and does not actively hold any mailbox databases


April 23, 2012

LOUIE Public Folder Changes

Filed under: Exchange Hosting — travis @ 2:55 pm

Update 5/2/2012: The reinstallation has completed and service is resumed normal operation.

Update 5/1/2012: We are in the process of reloading the operating system on LOUIEMBOX1. As of now the only affected service is mail enabled public folders, however, all mail will be queued until the reinstallation of Exchange has completed. We anticipate being done with all work by midnight.

Update : This has been scheduled for Friday at 3:00 PM Eastern

In preparation of the reload of LOUIEMBOX1 we’ve moved all mailboxes hosted on to other mailbox servers in the cluster. Last month we created a replication of all Public Folder content to all mailbox servers in LOUIE from LOUIEMBOX1. The final step is to change all ExchangeDefender delivery points for LOUIE customers away from LOUIEMBOX1. This step should fall in line with our seamless upgrades and should not be noticed by the client. Unfortunately in past experiences some public folders would not receive messages from the outside after a replica is taken offline. We believe this was a previous bug with Exchange and has since been resolved, however, we’d like to make all possibilities known ahead of time.

Due to the number of public folders and public folder content, we will be unable to validate mail delivery across all mail enabled public folders during the reload.

If any clients experience mail delivery delays to their public folders, please open a support request with the email address of the mail public folder and our support team will immediately look into the issue.


February 28, 2012

LOUIE Upgrades

Filed under: Exchange Hosting — travis @ 5:16 pm

 

During the first half of March we will be performing upgrades to the LOUIE network which include adding mailbox servers, phasing out older servers, upgrading Exchange to SP2, and most importantly, DAG redesign.

On the first week (March 5th-9th 2012) we will add two new mailbox servers for LOUIE (one intended to phase out LOUIEMBOX1).

On the second week (March 12th – 16th 2012) we will create a new DAG for LOUIE and add two new mailbox databases into the DAG. Throughout the week users hosted on LOUIEMBOX1 will be moved to the new databases in the DAG. Finally once all users are moved from LOUIEMBOX1 we will begin replicating public folder content to the new mailbox servers.

All changes are intended to be transparent to users and should not interrupt service access.


February 23, 2012

Maintenance: Rockerduck and Livearchive:Los Angeles

Filed under: Exchange Hosting — travis @ 9:52 am

Tonight beginning at 10:30 PM Eastern we will be performing the following maintenance

  • Rockerduck Load Balancer
    • Increasing physical resources. Restart required.
      • Clients will be disconnected from their mailboxes for up to 5 minutes.
  • LiveArchive: Los Angeles SP2
    • Installing Service Pack 2 on Exchange 2010
      • Client access will not be affected as we are currently running LiveArchive out of Dallas.

February 20, 2012

Della Upgrade SP2 & DAG Activation

Filed under: Exchange Hosting — travis @ 10:30 am

This weekend (02/24/12 – 02/25/12 19:00 Eastern [00:00 GMT]) we will be performing SP2 upgrades to the Europe Exchange 2010 Cluster: Della. Upgrade to Exchange 2010 SP2 will be performed on all passive nodes in Della. Upon successful upgrade clients will be moved from the active server to a passive node. . This upgrade is not expected to impact customer access however, there will be critical changes prior to the upgrade.

02/21/2012:

· New load balancer will be activated across the passive nodes.

· IP address for cas.della.exchangedefender.com will be modified to the new load balancer (Expected to be 213.229.89.253)

On Friday evening users on the active node will be moved to the passive nodes. The switch over from active to passive should be transparent to users.

Unfortunately BES services may be interrupted as BES does not detect and handle upgrades seamlessly. If BES service is interrupted we will work on restoring service after SP2 has been successfully applied.

Update 3:05 AM: Exchange 2010 SP2 has successfully been applied to DELLA.


February 15, 2012

Matilda Exchange 2010 SP2 Install

Filed under: Exchange Hosting — travis @ 9:09 am

This Friday (02/18/12) beginning at 07:00 Eastern (-5 GMT) [Saturday, 23:00 NSW] we will be upgrading the Exchange 2010 cluster Matilda to SP2. Clients should expect to see minimal downtime as services are restarted. Maintenance is expected to last one hour.

Update 2/17/12 6:50 AM Eastern: We are preparing the server for installation of SP2. We are expecting to begin the installation around 7:30 AM.

Update 2/17/12 7:30 AM Eastern: We are beginning the installation of SP2 on matilda.

Update 2/18/12 8:55 AM Eastern: SP2 installation has completed. We are testing services to verify a successful installation.

Update 9:07 AM Eastern: The installation was successfully verified.


January 25, 2012

Rockerduck Maintenance

Filed under: Exchange Hosting — travis @ 10:35 am

On Friday (1/27/12 11:00 PM Eastern – 1:00 AM Eastern) and Saturday (1/28/12 11:00 PM Eastern – 2:30 AM Eastern) we will be performing maintenance on Rockerduck to wrap up new additions to the mailbox server high availability which will disrupt service to small population.

During maintenance we will be moving 5 mailbox active databases across new storage arrays to improve overall performance. Databases will be moved one by one and only one database will lose service availability to clients at a time. During each database move users on the respective database will be unable to access their mailbox on Rockerduck. Since we will be making architectural changes to the active mailbox database we will be unable to activate the standby copy as the passive and active copies must reside in the same location across all nodes.

The current time estimates include a 30 minute buffer in case of unforeseen events. During maintenance users should expect to be disconnected from their mailbox, however, clients can utilize livearchive during the maintenance interval.

Update 11:15 PM: We are beginning work on RDMBOX1 and moving the path of RDDB1


November 1, 2011

DEWEY Outage Report

Filed under: Exchange Hosting — travis @ 10:06 am

Update 11:08 PM 11/17/11

DB3 has been mounted successfully on DEWEY. We’ve switched all users back to the original DB3 off from the temporary DB. We will be seeding in data from the temporary mailboxes to the primary mailbox.

Update 11:10 AM 11/17/11

The integrity check on DB3 completed around 10:00 PM Eastern on 11/16/11. Upon completion we began the process of running isinteg before mounting the database to ensure any fixed corruption gets remapped properly in the database. The check is currently at 22% completion and is estimated to complete tonight. Upon completion we will switch all users that were on DB3 back to the live running DB3 and we will then merge mail from the tempDB3 to DB3.

Update 9:24 AM 11/07/11

The integrity check and repair on DB2 completed early Sunday morning. After completing eseutil, we ran isinteg which completed around 6PM Eastern. Once we mounted DB2 and confirmed data, we begun to seed the data from the temporary database back to the original user database. Unfortunately we’ve had some partners who’ve imported their previous cached data into their temporary mailbox instead of attaching it as an archived PST on the user computer. We understand partners wanted to restore their customer back to a normal state, but that wasn’t the intention or purpose of the temporary mailbox. The restore process now must check some mailboxes with 36k+ items in the temporary mailbox which puts an extreme delay on the restore time.

Update 9:59 AM 11/04/11

Users on DEWEY experiencing slow speeds can switch their Outlook anywhere server to deweycas2.dewey.exchangedefender.com for an immediate performance improvement.

 

 

Update 1:09 PM Eastern

The dial tone migration has completed and users are now able to access their mailboxes on the temporary database.

Update 12:30 PM Eastern

We will be performing a dial tone migration to DEWEYMBOX2 for users on the affected databases. A dial tone migration will allow users to reconnect to their user mailbox on DEWEYMBOX2 via Outlook, OWA and Active Sync however the mailbox will have no information other than the mail from the previous day when the outage occurred and any new live running mail.

Users will see the following prompt after restarting Outlook

If the user wants to access their new mail they’ll select “Use Temporary Mailbox”

After the databases are back online we will move users back to their original databases and then restore mail from the temporary mailbox.

If the user does not receive the dial tone prompt or they stay disconnected after restarting Outlook then open their profile settings in Outlook and select ‘Check Name’ on the user.

 

——-

Original Post:

On 10/31/11 at 3:45 PM DEWEYMBOX1 suffered a major outage with the databases hosted on it. Around 3:15 PM our staff replaced a failed drive on the OS RAID for the server. As a result, the server began to rebuild the array and we seen slightly increased queue sizes in which we responded by issuing the original NOC report. Shortly after the rebuild began, the controller detected the new drive as bad and activated the global hot spare policy. Unfortunately this action is what caused the DEWEY outage.

A few months ago the Information Store logs for DEWEMYBOX1 had a drive fail in the RAID array. The RAID hot spare policy activated and automatically repaired the array.

Yesterday when the outage occurred, the global hot spare policy overrode the hot spare policy of the log drive hot spare policy and forcefully took the drive to become a spare for the DB RAID array (as this had a higher weight). Once the drive was removed from the logs array, the controller faulted and the log array went offline, causing the databases to shutdown dirty.

These series of events lead to roughly 10 uncommitted log files being lost per database. As the database knows that there are uncommitted logs, the information store wouldn’t mount any databases after we replayed the available logs. Unfortunately the only way to recover was by repairing the database.

Due to the sizes of the databases, the repair is an extremely lengthy process as each record in the database gets checked for corruption.

At this point we know of roughly 30 emails across all clients (not each) that were lost because of the automated forced removal, however these emails can be recovered from livearchive. Any mail that wasn’t committed to the transport server and delivered to deweymbox1 is still in queue and pending delivery once the databases are activated.


August 25, 2011

ROCKERDUCK Exchange Rollup Install

Filed under: Exchange Hosting — travis @ 5:28 pm

We will be updating MBOX1 and CAS2 on the ROCKERDUCK cluster tonight starting at 9:00 PM Eastern to Exchange 2010 SP1 RU4-v2

During the upgrade, users connected to CAS2 will automatically switch over to CAS1 as we will begin to drain the server connections around 8:45 PM Eastern.

Users with mailboxes on MBOX1 will have the passive copy of their database on MBOX2 activated around 8:50 PM Eastern.

The maintenance schedule should be transparent to users and should not interrupt or disrupt any service.


August 16, 2011

ROCKERDUCK DAG Maintenance

Filed under: Exchange Hosting — travis @ 2:41 pm

We’ve added maintenance tonight for the ROCKERDUCK cluster, specifically on the DAG. During maintenance users should not experience any interruption in service, however, service interruption is possible. We will monitor connection statuses during the maintenance and will update this post if any interruption is detected.

Update 1:45 AM Eastern: We will have to stop the cluster service on all nodes which will cause a momentary interruption in service as we repair communication.

Update 2:23 AM Eastern: The DAG on Rockerduck has been repaired and is back online.


August 11, 2011

DEWEYMBOX2 Maintenance

Filed under: Exchange Hosting — travis @ 3:44 pm

Tonight we will be performing maintenance on DEWEYMBOX2 to address the performance issues reported by partners throughout the day.

Maintenance is expected to last between 10PM and 2AM.

1:46 AM EST – All Maintenance on DEWEYMBOX2 has been completed succesfully. All services have been restored and queued messages are flushing into the user’s mailbox.

Again we’d like to take this opportunity to apologize for this issue and we appreciate your patience with us through out the process.

Update 8:05 AM: We are going to reboot the CAS servers for DEWEY to clear up any issues before the start of business. Service may be interrupted for 15 mins

Update 8:15 AM: The domain controllers for DEWEY have come back online from the restart. We are still waiting for the primary CAS and MBOX server to come back online.

Update 8:45 AM: The primary CAS and MBOX server came up at 8:30AM and we’ve confirmed service access for RPC, MAPI, ActiveSync and POP/IMAP


July 28, 2011

ROCKERDUCK DAG Switch

Filed under: Exchange Hosting — travis @ 11:39 am

We will be performing a Database switch on the DAG for ROCKERDUCK from RDMBOX2 to RDMBOX1. Users currently being serviced by RDMBOX2 will automatically switch to RDMBOX1. No changes are required by the end user to continue service and the switch over should be transparent to the end users. This switch is estimated to last for an hour as we investigate performance counter issues on RDMBOX2


July 27, 2011

DELLA Reboot

Filed under: Exchange Hosting — travis @ 10:48 am

At 3 PM Eastern (8 PM GMT) we will be rebooting servers in the DELLA cluster to apply windows updates which require a restart. We anticipate service to be slightly interrupted for up to 10 minutes while the reboot occurs.

[3:26 PM] We’ve prepared the servers for the reboot and will begin rebooting the MBOX servers shortly.

[3:41 PM] We are bringing all servers in DELLA back online.

[3:50 PM] Service has been restored to DELLA


July 21, 2011

Post Incident Report: 7/20/2011

On 7/20/2011 around 3:35 PM Eastern we started experiencing random packet loss across various services including Hosted Exchange and OWN Websites. Roughly around 3:45 PM Eastern, the random packet loss turned into a wide-spread service outage and lasted until 4:12 PM Eastern.

The incident appears to be faulted network driver on a Exchange monitoring server. Upon automatic recovery of the driver, the machine began to flood nearby network switches with invalid requests. Unfortunately the internal floods prevented access to the network analytic servers behind the DMZ. Since all machines received and responded to the request, all machines showed up as ‘flooding’ to the router and IDS was unable to determine the ‘source’ IP.

All services were essentially taken offline when the IDS started blocking traffic from the internal hosts. After we disabled the offending machine from the network and cleared IDS we were able to resume service across the board.

The biggest area of concern was the inability to contact us as the outage was occurring as the outage took down our support board and primary phone lines. We deeply apologize for the grief and trouble that this unexpected event caused and without saying, this has been the most impacting network event that we’ve experienced. We’ve implemented a new redundancy plan to our phone systems to handle global outages as this was the first time our phone systems were completely offline during a critical event.

We appreciate everything that our partners do for us and the patience that was extended yesterday as we definitely know that it was a very stressful event for our partners and their end users. As always we will continue to bring improvements to our solution stacks and address the areas where we may fall short.


June 9, 2011

HUEY Maintenance Continued

Filed under: Exchange Hosting — travis @ 5:27 pm

Per our previous NOC posting, we’ve been redesigning our maintenance plan for rebalancing the user distribution on HUEY. The original plan to defrag the database was abandoned as the timeframe for completion was not acceptable.

Tonight starting at 9PM Eastern we will be taking the HUEY database offline for about 5 minutes as we clear out the memory cache in preparation for mailbox moves tonight. Throughout the night we will be moving users between two new databases to even out the load. During the move, mailboxes that are actively moving will be inaccessible to users as Exchange 2007 did not feature Online moves. Upon completion, users will be able to access their mailbox on the new database. Move times will depend on the mailbox size and item count.

Update 6:45 PM Eastern: After our previous update our metric test completed and we’ve noticed that there are write lock delays on the OS drive for HUEY.  We’ve made an adjustment to our above outline. Prior to starting the mailbox moves we will be performing a full database backup at the NTFS file level. Unfortunately this means we will have to offline the database as we are capturing a raw file backup instead of a VSS backup. After the backup is completed we will scan the surface error on the OS drive for HUEY for any corruption. We anticipate this entire process will take up to 4 hours to complete. We will update this post as progress is made starting at 9PM when work begins.

Update 8:30 PM Eastern: The backup job of the OS is taking a bit longer than expected. We are pushing back dismounting the database to 9:30 PM. We will update this blog after 9:00 PM if we anticipate the backup taking longer.

Update 8:50 PM Eastern: We’ve received a request from a few west coast customers asking us to postpone maintenance until 10. In the interest of disturbing service as least as possible, we will be postponing maintenance until 10 PM Eastern.

Update 10:45 PM Eastern: The backup is estimated to complete in the next hour. We will then begin the surface error test on the OS drive. This is estimated to be the longest part of the process and will require a disruption of service as we take the server offline. We estimate the entire process to be 4 hours as described earlier. We will update this post once the work begins.

Update 11:30 PM Eastern: We are beginning to dismount the mailbox databases and stop Exchange services.

Friday 6/10/11

Update 2:05 AM Eastern: The surface area test has revealed issues on the OS drive. We are running a repair to on the drive and monitoring the progress.

Update 4:17 AM Eastern: We’ve replaced a bad drive in HUEY on the OS drive and we are proceeding to perform integrity checks before turning on any services.

Update 5:47 AM Eastern: The integrity check failed and we will be restoring from the backup image taken prior to maintenance. We will continue to update this blog as progress is made.

To clarify the issue is specifically with the operating system and not the database integrity.

Update 9:44 AM Eastern: The restoration process is proceeding as planned, this is courtesy update to ensure partners work is continuing.

Update 12:20 PM Eastern: The restoration surface test is underway and we are looking to confirm data consistency on the OS drive.

Update 1:57 PM Eastern: In order to achieve resolution in the fastest manner possible, we are beginning to concurrently restore the backup image on a spare server to eliminate any potential issues that may be affecting the physical host.

Update 7:00 PM Eastern: The integrity check has processed half of the files on the OS drive and overall progress is about 25 percent complete

Update 9:00 PM Eastern: This is a courtesy update as the process above is still continuing successfully without any halts. We understand this is an urgent issue and we appreciate your patience with this process.

Saturday 6/11/11

Update 1:30 AM: The integrity check has processed about 90% of the files on the drive and overall progress is near 75% completed.

Update 2:35 AM: The integrity check has completed and we’ve successfully booted windows into safe mode. We are now proceeding to boot normally and resume services on HUEY

Update 3:15 AM: Service on HUEY has been restored and all queued mail is being delivered to user mailboxes.


June 8, 2011

HUEY Maintenance tonight

Filed under: Exchange Hosting — travis @ 12:46 pm

reTonight starting at 9:00 PM Eastern we will be taking the mailbox databases on HUEY offline to perform an offline defragmentation. We anticipate the scan will take up to 4 hours, which will leave mailbox access offline until the database is remounted. Clients are able to utilize livearchive during the maintenance schedule to continue working with live mail.

Update 8:35 PM Eastern: We will begin work in 30 minutes, starting with dismounting the database and copying it to a temporary storage drive, and then starting the offline defrag. After the defrag completes, we will mount the database from the temporary location and stress test the integrity. After we’ve assured integrity we will copy the database back to the active RAID controller.

We estimate that each step may take up to 2 hours to complete , but we will update this post along the way.

Update 9:10 PM Eastern: We are pushing back maintenance one hour as we rearrange the temporary storage iSCSI server to increase overall speed, expecting to lower the overall time.

Update 10:09 PM Eastern: We will be begin the above outlined process in 5 minutes.

Update 11:00 PM Eastern: By estimation of current progress, we do not feel that we have enough time allocated for this process even with our earlier changes. We’ve remounted the current mailbox database. We are formulating a new plan to bring an solution that will run in parallel.


HUEY Memory replacement

Filed under: Exchange Hosting — travis @ 1:16 am

We’ve received alerts on our monitor software about faults in the memory in HUEY. We’ve dismounted the database to avoid any corruption as we test the memory and replace if needed. We currently do not have an ETA for service restoration, but we will update this blog as information is obtained.

Update 2:50 AM Eastern: Service was restored at 2:30 AM and all services have been confirmed online.


May 26, 2011

LOUIE SP1 RU3 Install

Filed under: Exchange Hosting — travis @ 7:46 pm

We will be installing Exchange 2010 SP1 RU3 on the LOUIE cluster tonight beginning at 9PM Eastern.

During the upgrade on the CAS servers, users may see a brief disconnection while individual CAS nodes are upgraded, however, they should automatically connect to another available CAS server.

During the upgrade on the Mailbox servers, users may see a brief disconnection from public folders while individual Mailbox codes are upgraded, however, they should automatically connect to the next available Public Folder Replica.

During the upgrade on the HUB servers, users may see a brief delay for incoming mail, however, outgoing mail should not be affected.

This upgrade will allow us to seamlessly move mailboxes from LOUIE to our newest Exchange Cluster if partners would like to move.

Update 9:10 PM Eastern: We are beginning to upgrade the CAS servers in LOUIE

Update 10:15 PM Eastern: We have completed the upgrade to the CAS servers and will now begin on the HUB and Mailbox servers.

Update 11:04 PM Eastern: Service has been restored to the LOUIE network and we’ve confirmed access to our newest Exchange cluster.


May 11, 2011

LOUIE Maintenance

Filed under: Exchange Hosting — travis @ 4:46 pm

On 5/19/11 at 10 PM Eastern we will be performing maintenance on the LOUIE Exchange 2010 network as we physically move the servers into a new cabinet. During the maintenance period, users will briefly loose connection and access to their mailboxes, however, service is expected to be restored shortly thereafter.

Prior to maintenance we will update our NOC blog with a more refined timeline.


May 2, 2011

LOUIEMBOX2 High Page Count

Filed under: Exchange Hosting — travis @ 6:26 pm

Our monitoring software has alerted us to an abnormally high page count on LOUIEMBOX2. We were able to dehydrate a few processes to alleviate the pressure, however, we will need to restart the storage device for users on DB3.

At 10:00 PM Eastern we will be restarting the storage device for DB3 on LOUIEMBOX2. During the restart, users will be unable to access their mailboxes for up to 10 minutes. We expect service to be completely restored by 10:15 PM Eastern.

Update 10:00 PM Eastern: We are beginning the process of dismounting DB3 and restarting the storage array.

Update 10:10 PM Eastern: Service has been restored to users on DB3


Older Posts »

Powered by WordPress