Learning from Experience…
The past two weeks have been rather stressful and unpleasant for our partners and your end users on DEWEY due to the recent outage that left 17% of the users without full access to their entire mailbox for two weeks. Not only has this been our longest outage with any service, but it also has been one of the most “bumpy road” recoveries we’ve ever experienced. Our partners and their end users could compare almost every step of the recovery to pulling teeth; every step of the way, we let them down.
We’re extremely sorry…this was (and still is) extremely unfortunate, but we have learned a lot from this experience.
Here’s a quick break down of the experience:
10/31/11 – Users on two databases both lost connection to Exchange.
11/01/11 – Users are switched to dial tone recovery mode. Users with mobile phones lose contacts and calendar events on next sync.
11/06/11 – Users on DB2 regain access to old mail, but lose access to any new mail, contacts, or calendar entries since the outage began.
11/07/11 – Mailbox data for users on DB2 is fully restored; new and old data merged.
11/17/11 – Users on DB3 regain access to old mail, but lose access to any new mail, contacts, or calendar entries since the outage began.
11/19/11 – Mailbox data for users on DB3 is fully restored; new and old data merged.
In speaking to partners along every step of the way we heard every issue experienced by end users with the biggest issue being the dial tone recovery. During dial tone recovery any users with ActiveSync based connections will lose all Exchange contacts and calendar items on next sync after dial tone is activated.
When you break down our responsibilities and duties to our clients, at the very minimum, we need to provide a live running service as quickly as possible. Technology and software unfortunately have issues and can break, but as long as we can minimalize the direct impact faced by end users we can generally get through issues without upsetting a lot of partners. By erasing all cached access by mobile phones with the temporary mailbox, we put a lot of stress on our partners.
We are considering a disaster recovery policy control for partners..
The policy controls will give partners control of how we treat each mailbox during an outage. In giving partners the ability to control how we respond, this will greatly improve the overall experience during an outage and will allow the partner to provide direct expectations to their clients.
For instance, say partner ABC123 Computers has a client, Big Electric Company, with 10 mailboxes.
The partner wants to mark 3 users as “Do not activate dial tone” as these users are mobile and depend on their contacts and calendars and they cannot afford to lose them once dial tone is activated.
The partner then marks the CEO and CFO as “Do not reimport data after outage” as the partner plans to reimport cached data because the CFO and CEO cannot work on dial tone mail alone.
By allowing the partner to directly control our recovery process for mailboxes, the partner will then be able to set direct SLA expectations for their end users during the outage.
Plans for the future…
In the event that we experience another catastrophic failure (anticipated 6+ hours of downtime), we will wait at least one hour before activating dial tone recovery for mailboxes that are not opted out.
During the first hour, we will reach out to all partners via telephone whose mailboxes are not opted out from dial tone recovery to make them aware of the expected experience for end users. If the partner wishes to not activate dial tone recovery we will activate the mailbox option in service manager to opt out of dial tone recovery. Additionally, partners could ask (via support ticket) that the dial tone activation to be postponed for a few hours later if the partner wants to advise the end user to disable ActiveSync.
Once the original (or backup) database is back online we will once again reach out to all partners who have not opted out of mailbox data restore to let them know of the expected experience for end users. If the partner wishes to opt the mailbox out of data restore then the mailbox will remain on the dial tone database until the recovery of data for all users on the affected database is completed.
If you think the outage policy control would be a beneficial add on to our Hosted Exchange service, please let me know email@example.com.
VP Network Operations, ExchangeDefender
(877) 546-0316 x757