Users on RDDB1 are unable to access their mailbox data as the mailbox database is currently offline. We are performing surface check on the database as the database health is indicating minor corruption. We are unable to activate the passive copies of the database without the surface check being completed as we are unsure when the corruption occurred and could have been copied to the passive copies. We anticipate that the database will be offline until at least 3PM Eastern. Only 9% of user base is affected and we are working diligently to restore access to customer mailboxes on RDDB1
Update 3:06PM Eastern: We are running two processes to try and restore access as soon as possible: repairing the live database and running a restore from the last backup (7/16/12) and then replaying the committed data to the restored backed up database to bring it up to date. We will go with whatever process completes first as both will bring the database back online in a healthy state. The restored backed up data base is copying log files and will complete around 5PM. The live data base repair is at 50% on step 1/5 and its ETA is unknown
Update 4:13 PM Eastern: We’ve successfully mounted RDDB1 on ROCKERDUCK and confirmed user access.
Unfortunately this was not something that could have been avoided by any redundancy available as this was a software level corruption which isn’t protected by a DAG. Essentially, the DAGs prevent against issues with unavailability between mailbox nodes or networking issues, but when a corrupted log file gets committed, it then gets distributed to the passive nodes. Since we couldn’t confirm if the corrupted file was or wasn’t distributed to specific nodes when caught we felt it wasn’t safe to remount the databases with the lingering possibility of encountering a more serious issue in the near future.
We are planning on introducing a lagged mailbox database copy which will ‘lag’ when it commits log files to the passive copy. By implementing a lagged copy if we ever experience corruption again we can restore instantly to the previous day as the lagged copy wouldn’t have copied/committed the corrupted data