Rebooting Scrooge – ExchangeDefender Network Operations

October 17, 2010

Rebooting Scrooge

Filed under: Exchange Hosting — admin @ 2:05 pm

We are currently rebooting the servers in the scrooge network. The services should be restored momentarily.

Update 6:47 AM: We’ve identified a severe network issue and are currently replacing the affected hardware (storage array). To eliminate any issues during business time or during the week, we’re doing an emergency hard drive swap right now.

Update 7:40 AM: We’ve replaced the hardware and we’re in the process of performing stress tests on the new hardware before allowing client connections. We’re on schedule to be completed in the next hour.

Update 8:30 AM: Service has been restored to SCROOGE and our stress tests were a success.

Update 9:19 AM: We had to perform an emergency reboot on the mailbox server for SCROOGE to finalize the drive sync. The mailbox server is expected to be up and fully operational after the reboot.

Update 9:45 AM: After the reboot the DB RAID array did not mount on POST. We are going to reseat all drives and reboot the server.

Update 11:30 AM: After mounting the databases, we received alerts that 10% of the mailboxes could not be loaded. We are in the process of restoring from a snapshot that was taken at 3 AM, and then we will merge any new data.

Update 12:04 PM: Due to the length that scrooge has been in production, the integrity check is taking a long time to process all the log files. At this point there is no ETA on a resolution. Customers are urged to utilize livearchive. Optionally, for an immediate solution, partners can request to be moved to louie via support request, and can export the previous mailbox if the user was in cached mode.

Update 7:20 PM: After the log replay finished, we noticed there were still CRC issues in accessing RAID array. We are in the progress of making a snapshot backup before we migrate the database to a new external RAID array. The next NOC update is scheduled for 1AM Eastern.


Update 1:24 AM: We are currently restoring from a recent backup this morning. As it stands now, no mail is expected to be lost. Once the restore is complete, an integrity check will be started to ensure reliability before the start of business. Unless there is a change in progress, the next update is scheduled around 4AM.

Update 6:20 AM: The integrity check is underway. The restore job was stopped after one hour as it showed an estimated time beyond 9AM Eastern. We reconfigured the attached drive setup which allowed for a faster restore. The integrity check will provide an estimated time in about 20 minutes.

Update 6:50 AM: The integrity check is estimated to take around 2.5 hours, putting us extremely close to the desired service restore time of 9AM Eastern. We will keep this post updated as the time changes. We will give the integrity check until 10AM Eastern until a decision is made. We do not anticipate the check failing, but if it does, we will run the day with blank mailboxes, delivering the spooled mail from yesterday. As the integrity check finishes, we will begin to migrate mail. We appreciate everyone’s patience as we are working as fast as we can to restore service without introducing more instability.

Update 7:42 AM: The integrity check on the smallest database completed successfully. Once the final two databases complete, we will attempt to mount all three databases. Time estimate still looks to complete around 10AM Eastern.

Update 9:18 AM: The integrity check has completed on all the databases successfully. We’re in the process of mounting them now. The previous ETA of 10AM Eastern is still on schedule.

Update 10:05 AM: After the integrity check completed with no errors, we attempted to mount the database and we were presented with errors. We are in the process of running a background scan, but in the interest of restoring service to clients, we’ve moved all users to a temporary database.

Users with Outlook using Cached Mode will be presented with a new option when they open outlook, warning them that their mailbox has moved and they can either use the new profile or the temporary profile. If users select temporary profile, Outlook will open the Cached profile (the profile before the outage that is cached on the local machine).

After hours we are going to try a couple different methods, but as far as we can still see, all data is still intact.

Update 5:45 PM: After running a more aggressive integrity check which resulted in no errors, the database still would not mount. We’ve restored different combinations of backups, however since this was the first Exchange 2007 server network for OWN, the backups ran incremental backups to our secondary datacenter. Sadly, with the number of backups, it would take nearly four days to rebuild the files from backup. At this point, the quickest means for restoring data is to run a local repair on the database and then seed the previous mailbox into the new, live mailbox. We began a repair on a copy of the previous database around noon Eastern, and currently it’s only at 15%. We believe that the repairs may last into tomorrow. To be on the safe side, we are sending the backups from the secondary datacenter overnight to Dallas to begin rebuilding the files from backup in the background.

Update 10:17 PM: The repair has finally made a step forward..although it’s not a grand step, the repair finally shows actual progress. We will gauge the speed of the progress and a resolution time estimate will be posted around 2 AM Eastern.



Update 4:30 AM: The repair hasn’t made any new progress, but the database has a modified timestamp from seven minutes ago. Unfortunately we cannot make an estimate on time until the next bit of progress.



Update 12:01 AM Eastern: The database repair has completed and we now proceeding with the isinteg check before attempting to mount the database.

Update 2:06 PM Eastern: The integrity check completed and we’ve successfully mounted the databases in a RSG. We are proceeding with seeding back mailbox content.



Update 10:48 AM Eastern: We are still continuing the seeding process. Everything has been successful thus far. Unfortunately, the seeding process does not give us a measurable progress bar. But please rest assured, folks are getting their data seeded we will update this blog once all seeding has been completed.



Update 2:49 PM Eastern: We have completed the mailbox data seeding process for the server for everyone.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress