Fri Sat Sun Mon Tue
ExchangeDefender Inbound
ExchangeDefender Outbound
ExchangeDefender Apps Web
Exchange 2010 Hosting
Exchange 2007 Hosting
Offsite Backups
Web Hosting
Blackberry Enterprise Server
SharePoint 2007
SharePoint 2010
Exchange 2013 Hosting
SharePoint 2013
Virtual Servers
DEWEY Outage Report – ExchangeDefender Network Operations

November 1, 2011

DEWEY Outage Report

Filed under: Exchange Hosting — admin @ 10:06 am

Update 11:08 PM 11/17/11

DB3 has been mounted successfully on DEWEY. We’ve switched all users back to the original DB3 off from the temporary DB. We will be seeding in data from the temporary mailboxes to the primary mailbox.

Update 11:10 AM 11/17/11

The integrity check on DB3 completed around 10:00 PM Eastern on 11/16/11. Upon completion we began the process of running isinteg before mounting the database to ensure any fixed corruption gets remapped properly in the database. The check is currently at 22% completion and is estimated to complete tonight. Upon completion we will switch all users that were on DB3 back to the live running DB3 and we will then merge mail from the tempDB3 to DB3.

Update 9:24 AM 11/07/11

The integrity check and repair on DB2 completed early Sunday morning. After completing eseutil, we ran isinteg which completed around 6PM Eastern. Once we mounted DB2 and confirmed data, we begun to seed the data from the temporary database back to the original user database. Unfortunately we’ve had some partners who’ve imported their previous cached data into their temporary mailbox instead of attaching it as an archived PST on the user computer. We understand partners wanted to restore their customer back to a normal state, but that wasn’t the intention or purpose of the temporary mailbox. The restore process now must check some mailboxes with 36k+ items in the temporary mailbox which puts an extreme delay on the restore time.

Update 9:59 AM 11/04/11

Users on DEWEY experiencing slow speeds can switch their Outlook anywhere server to for an immediate performance improvement.



Update 1:09 PM Eastern

The dial tone migration has completed and users are now able to access their mailboxes on the temporary database.

Update 12:30 PM Eastern

We will be performing a dial tone migration to DEWEYMBOX2 for users on the affected databases. A dial tone migration will allow users to reconnect to their user mailbox on DEWEYMBOX2 via Outlook, OWA and Active Sync however the mailbox will have no information other than the mail from the previous day when the outage occurred and any new live running mail.

Users will see the following prompt after restarting Outlook

If the user wants to access their new mail they’ll select “Use Temporary Mailbox”

After the databases are back online we will move users back to their original databases and then restore mail from the temporary mailbox.

If the user does not receive the dial tone prompt or they stay disconnected after restarting Outlook then open their profile settings in Outlook and select ‘Check Name’ on the user.



Original Post:

On 10/31/11 at 3:45 PM DEWEYMBOX1 suffered a major outage with the databases hosted on it. Around 3:15 PM our staff replaced a failed drive on the OS RAID for the server. As a result, the server began to rebuild the array and we seen slightly increased queue sizes in which we responded by issuing the original NOC report. Shortly after the rebuild began, the controller detected the new drive as bad and activated the global hot spare policy. Unfortunately this action is what caused the DEWEY outage.

A few months ago the Information Store logs for DEWEMYBOX1 had a drive fail in the RAID array. The RAID hot spare policy activated and automatically repaired the array.

Yesterday when the outage occurred, the global hot spare policy overrode the hot spare policy of the log drive hot spare policy and forcefully took the drive to become a spare for the DB RAID array (as this had a higher weight). Once the drive was removed from the logs array, the controller faulted and the log array went offline, causing the databases to shutdown dirty.

These series of events lead to roughly 10 uncommitted log files being lost per database. As the database knows that there are uncommitted logs, the information store wouldn’t mount any databases after we replayed the available logs. Unfortunately the only way to recover was by repairing the database.

Due to the sizes of the databases, the repair is an extremely lengthy process as each record in the database gets checked for corruption.

At this point we know of roughly 30 emails across all clients (not each) that were lost because of the automated forced removal, however these emails can be recovered from livearchive. Any mail that wasn’t committed to the transport server and delivered to deweymbox1 is still in queue and pending delivery once the databases are activated.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress