Support – ExchangeDefender Blog

December 14, 2011

If 2 is good, 50 is better, (Or is it)?

Filed under: Software,Support — vlad @ 12:24 pm

Over the weekend (12/09/11 – 12/10/11) we performed critical, preemptive upgrades for Rockerduck. During our upgrade cycle we were able to increase memory resources for Mailbox servers, rebalance resource distribution on Client Access servers and add additional Mailbox servers for quorum retention and additional high availability.

Mailbox, mailbox, mailbox…

By utilizing the current mailbox server layout, we were able to increase memory in Rockerduck mailbox servers in a staggering pattern without disrupting service to clients on Rockerduck. As each mailbox server was prepared for the upgrade, we moved all active mailboxes from the server to any passive mailbox node and then blocked the mailbox server from activating any database copy. After the memory upgrades were completed we then stress tested each server for 8 hours with a memory stress test for consistency. Once the upgrades were completed on the nodes, we were being the node back into the DAG and back up to availability.

Labs vs. Real World Results

Mailbox servers were not the only servers in Rockerduck to be upgrades. Over the past two weeks we’ve been monitoring the response statistics on CAS servers with a new memory / processor configuration.

Originally when we performed initial testing / scaling Rockerduck we seen the overall lowest latency and response time for RPC and Web Services from having a fewer CAS servers with higher RAM and processor. Over time, we’ve noticed the real world utilization result of overall latency on RPC was significantly outside the scope of our original Lab results causing us to reevaluate our delivery of CAS services.

All CAS servers for Rockerduck sit behind a hardware based load balancer. Each client that connects to the load balancer gets assigned to a specific CAS node for up to 5 hours on certain services (RPC, EWS) based off of the client WAN IP. Original design for the CAS nodes was 3 nodes with 8GB of RAM and 4 Processor cores available.

 

1

Unfortunately, this “least connected” model had the potential (and sometimes did) tie larger groups of users together from different IP addresses, essentially choking the server with queued requests.

 

2

The new setup for the CAS nodes is a balance of 6GB of RAM with 3 Processor cores available. This new configuration allowed us to introduce two new CAS servers to more efficiently process requests across multiple nodes without any additional “upgrades” to the CAS roles.

During our statistical collection phase, the new configuration nodes had a 40% reduction in response time on RPC requests and Address Book requests:

Originally: 22 ms

Now: 13.2 ms

Travis Sheldon
VP Network Operations, ExchangeDefender
(877) 546-0316 x757
travis@ownwebnow.com


December 7, 2011

The Database Availability Group is Supposed to be Completely Fault Tolerant…

Filed under: Development,Support — vlad @ 11:08 am

Earlier this week we created a NOC entry/notification for partners about a maintenance interval we scheduled for ROCKERDUCK. The entry outlined an issue we faced where on DB (DB7) was running on the logs drive instead of the DB drive and our proposed outline of the work to be completed. Unfortunately, because the issue affects all database copies, correcting the issue would involve reducing DB7 to a single mailbox server, moving the database, which would take DB7 offline, and then re-seeding the copies to all passive nodes.

Shortly after posting the NOC entry I received an email from a partner demanding that I explain to them why the Database Availability Group (DAG) could not prevent service interruption for users on DB7.

So why does the DAG not protect from every single event possible?

Simply said; all servers in a DAG must be identical in terms of storage location for databases and logs across all servers. In a DAG, only one mailbox can act as the “active” mailbox database and all other copies on other nodes are purely “database copies” that can be switched to the active/primary database.

In the case of moving the database path, we cannot switch the current active database over to a passive node, move the DB, then switch it back to the original primary as this would break the DAG and we would then have split copies of ‘active’ data. We cannot use passive copies to keep service active while we physically modify the database properties/layout of the ‘active’ copy.

If this was a case where a database experienced a failure on the active copy or there was a network communication issue, the DAG would mount the passive copy of the database and continue providing service to users.

JibberJabberSo all this jibber-jabber means what?

In short, we would remove all copies of DB7 across all nodes except the primary node. After all copies are removed, we would start the move of DB7 to the proper location and then remount the database. By calculation of the DB size, service would be interrupted for about 10-15 minutes. Finally, after the move completes we would re-add the database copy across each node and then bring service back into full redundancy. A fifteen minute outage is unfortunately a necessary evil to provide an overall more redundant solution to our partners and their clients.

Travis Sheldon
VP Network Operations, ExchangeDefender
(877) 546-0316 x757
travis@ownwebnow.com


December 5, 2011

LiveArchive: Why? Where? How?

Filed under: ExchangeDefender,Support — Carlos @ 10:28 am

I’m going to address an age old question from folks that do not like to read our feature pages, in hopes that you read this blog. As part of the DR (Disaster Recovery) we have two primary items that can help during and after an outage. This post will help educate your teams on the expectation of how things work, so your expectations as well as your clients are managed to the correct level.

During an outage

ss

During an outage the best place to have surefire access is to type https://livearchive.exchangedefender.com into your browser. This is the sure fire way to ensure that regardless of which cluster is live (Dallas or Los Angeles) your clients can get to it. A best practice is having a shortcut ready for your clients on their Desktop or Start Menu. If I had a penny for each time that someone’s server catches fire and it’s that juncture that a tech asks “How do I get to LiveArchive?”. You are already putting yourself in front of the barrel. If you don’t have a solution in hand and you have to “call someone else”, it’s that point that your client’s confidence starts eroding.

Where is LiveArchive?

LiveArchive is located at  https://livearchive.exchangedefender.com

What are my LiveArchive credentials?

Your LiveArchive credentials are the same as your ExchangeDefender credentials; which are your email address and your ExchangeDefender password.  Remember if you forgot this password and your email is down your best bet in an outage scenario is to open a ticket for your client in our portal and request their passwords. Sadly, folks often try their email passwords and assume that something is wrong (see above: more erosion). The key to all of this is to get the right answer on the first try.

So let’s move forward, now you either knew everything above upfront and only have to deal with your end users once or you had to go back and forth a few times to get it hammered out. Regardless, your clients have access to all of their internet mail now, now your hard job starts. Get the defibrillator and resurrect their Exchange server, obviously this can range from a simple reboot to a week long pain staking process. One thing you have in the back of your mind is, thank goodness ExchangeDefender is holding all of my mail. The most important thing to remember while you and your team are doing your best to perform thoracic surgery to the server is make sure the server is offline!!

Here’s why, by RFC rules we can only hold mail that is being deferred by your server. If your server is online and “REJECTING” mail due to bad configuration or your troubleshooting, all that mail is purged because your client’s server is telling our software this is permanent rejection. This is the biggest key in the process, luckily this doesn’t happen often but there are teams that will have the server permanently rejecting mail for a week and then ask for their mail. And even though this is digging yourself a grave, we MAY still be able to help you.

First off our Mail “Spooling” or “Bagging” service is in place for up to 7 days. The way it works is, after the initial real-time attempt to deliver your mail, your mail is moved to a retry queue. This queue in an effort to not hammer client servers reattempts to deliver from each node every 20 minutes or so, staggered. This process is fully automated and constantly running, you don’t have to call us or open a ticket saying, “Our server is up release our mail”. If your server really is up and accepting mail from our servers your mail will start to flow on its own, but it can take up to a couple of hours for all of your mail to deliver depending on your queued volume. Again, we don’t want to pound your client’s server into submission and cause it to trigger the Exchange backpressure mechanism.

Now, if you made the unfortunate mistake to bring back a server online after rebuild without the process IP restrictions and anonymous delivery settings and all of your spool was lost there is still one possibility. If the mail is in LiveArchive, due to our hub and transport design you can actually forward all that mail to your individual client’s mailboxes one by one. This is a fully manual process that can is pretty time consuming but when faced with the choice of telling a client you lost all their mail for the past x number of days or telling them you need a couple more hours to make them whole, the choice becomes easy.

Carlos Lascano
VP Support Services, ExchangeDefender
carlos@ownwebnow.com
(877) 546-0316 x737


November 28, 2011

Learning from Experience…

Filed under: Hosted Services,Support — vlad @ 10:09 am

The past two weeks have been rather stressful and unpleasant for our partners and your end users on DEWEY due to the recent outage that left 17% of the users without full access to their entire mailbox for two weeks. Not only has this been our longest outage with any service, but it also has been one of the most “bumpy road” recoveries we’ve ever experienced. Our partners and their end users could compare almost every step of the recovery to pulling teeth; every step of the way, we let them down.

We’re extremely sorry…this was (and still is) extremely unfortunate, but we have learned a lot from this experience.

Here’s a quick break down of the experience:

10/31/11 – Users on two databases both lost connection to Exchange.

11/01/11 – Users are switched to dial tone recovery mode. Users with mobile phones lose contacts and calendar events on next sync.

11/06/11 – Users on DB2 regain access to old mail, but lose access to any new mail, contacts, or calendar entries since the outage began.

11/07/11 – Mailbox data for users on DB2 is fully restored; new and old data merged.

11/17/11 – Users on DB3 regain access to old mail, but lose access to any new mail, contacts, or calendar entries since the outage began.

11/19/11 – Mailbox data for users on DB3 is fully restored; new and old data merged.

In speaking to partners along every step of the way we heard every issue experienced by end users with the biggest issue being the dial tone recovery. During dial tone recovery any users with ActiveSync based connections will lose all Exchange contacts and calendar items on next sync after dial tone is activated.

When you break down our responsibilities and duties to our clients, at the very minimum, we need to provide a live running service as quickly as possible. Technology and software unfortunately have issues and can break, but as long as we can minimalize the direct impact faced by end users we can generally get through issues without upsetting a lot of partners. By erasing all cached access by mobile phones with the temporary mailbox, we put a lot of stress on our partners.

We are considering a disaster recovery policy control for partners..

The policy controls will give partners control of how we treat each mailbox during an outage. In giving partners the ability to control how we respond, this will greatly improve the overall experience during an outage and will allow the partner to provide direct expectations to their clients.

For instance, say partner ABC123 Computers has a client, Big Electric Company, with 10 mailboxes.

The partner wants to mark 3 users as “Do not activate dial tone” as these users are mobile and depend on their contacts and calendars and they cannot afford to lose them once dial tone is activated.

The partner then marks the CEO and CFO as “Do not reimport data after outage” as the partner plans to reimport cached data because the CFO and CEO cannot work on dial tone mail alone.

By allowing the partner to directly control our recovery process for mailboxes, the partner will then be able to set direct SLA expectations for their end users during the outage.

Plans for the future…

FutureIn the event that we experience another catastrophic failure (anticipated 6+ hours of downtime), we will wait at least one hour before activating dial tone recovery for mailboxes that are not opted out.

During the first hour, we will reach out to all partners via telephone whose mailboxes are not opted out from dial tone recovery to make them aware of the expected experience for end users. If the partner wishes to not activate dial tone recovery we will activate the mailbox option in service manager to opt out of dial tone recovery. Additionally, partners could ask (via support ticket) that the dial tone activation to be postponed for a few hours later if the partner wants to advise the end user to disable ActiveSync.

Once the original (or backup) database is back online we will once again reach out to all partners who have not opted out of mailbox data restore to let them know of the expected experience for end users. If the partner wishes to opt the mailbox out of data restore then the mailbox will remain on the dial tone database until the recovery of data for all users on the affected database is completed.

If you think the outage policy control would be a beneficial add on to our Hosted Exchange service, please let me know travis@exchangedefender.com.

Travis Sheldon
VP Network Operations, ExchangeDefender
(877) 546-0316 x757
travis@ownwebnow.com


November 22, 2011

Trusting Senders and Whitelisting

Filed under: General,Support — Carlos @ 1:40 pm

StarIn the coming weeks I’m going to be taking a, “Did you know?” approach to my blog as I keep running across items that are offered in our feature set that folks find surprising that we offer that would have saved them time had they known. These items are part of ExchangeDefender that folks don’t come across often but they are there. The first topic we’ll touch are White listing and trusting senders.

One of the more common inquiries that arise out of the Trusted Sender is why does it seem to not work consistently. Here’s the logic behind the Trust Sender function in ExchangeDefender. Trust Sender will in essence white list that specific sender address to that specific recipient address. The times when folks over look these details and it turns into a problem are as follows:

Domain administrators often “Trust Sender” from SPAM Czar under the assumption that the sender is being trusted for the domain, the feature set has not changed it will still white list that sender to that specific recipient.

Senders that use disposable email addresses like carlos+is+cool+3425223@exchangedefender.com work around the mechanism, since the from address is always different, so this type of sender is better suited for a domain level whitelist.

That takes us to the options for whitelist formatting. ExchangeDefender will access the following types of entries:

user@domain.com Basic per user whitelist
domain.com Domain level whitelist
a.domain.com Third level domain whitelist
123.123.123.123 IP address whitelist
123.123.123 /24 whitelist

All this information can actually be found in our Knowledge Base inside of our support portal.

Carlos Lascano
VP Support Services, ExchangeDefender
carlos@ownwebnow.com
(877) 546-0316 x737


« Newer Posts