NOC STATUS


ExchangeDefender provides real-time alerts and information about our routine and emergency maintenance through our NOC Blog.

Support – ExchangeDefender Blog

Most Popular Products

EMAIL SECURITY

Services that protects your mail from spam, viruses, and malware.

ARCHIVING

Secure long term message storage and ediscovery reporting.

BUSINESS CONTINUITY

Constantly archiving your sent and received mail.

Over the past couple of weeks we have been researching some reports regarding encryption not handling attachments correctly. During the process, the error that kept printing on the back end processing was ““Content-Type: application/ms-tnef; name=”winmail.dat” Content-Transfer-Encoding: base64””. If Outlook sends a message using the RTF format (which is not very common outside Outlook) for bold text and other text enhancements, it includes the formatting commands in the winmail.dat file. Receiving email clients that do not understand the code therein display it as a stale attachment. To make matters worse, Outlook may also pack other, regular file attachments in the winmail.dat file. That’s the bad news, the good news is that fix is a piece of cake.

In Outlook 2010 you go through File, then Options and check the box below:

1

In Outlook 2007 you go through Tools, then Options:

2

1. Go to the Mail Format tab.

2. Under Compose in this message format:, make sure either HTML or Plain Text is selected.

3. Click Internet Format.

4. Make sure either Convert to Plain Text format or Convert to HTML format is selected under When sending Outlook Rich Text messages to Internet recipients, use this format:

5. Ok to submit.

Carlos Lascano
VP Support Services, ExchangeDefender
carlos@ownwebnow.com
(877) 546-0316 x737

Imagine it’s 4:00 PM, you’re getting ready to close for the day and your cell phone starts ringing off the hook; it is your biggest clients’ CFO and he is very upset. The incoming caller complains that “email is slow” and “it is taking forever to do xyz”, but any attempt to get more information is greeted with either hostility or an abrupt “I don’t know”.

Does that sound familiar?

Almost every single request that has very little detail from the client in terms of what is “slow” magically gets fixed and then the client is convinced that the issue is Exchange…how do you fight back? How do you know for sure that a server / network out of your control is performing up to par. How do you know that your hosting vendor is keeping redundancy healthy and performing backups? For the most part you can’t… or can you?

One of the most common inquiries that we receive from partners is “Is XYZ server experiencing delays today?” after the partner gets alerted by their client that things seem to be “slow”. Our staff then tries to qualify the phrase “slow”…is it email? Is it Outlook response? How about OWA? After we have an idea of what the client is reporting as slow then we have to dig through logs and statistics files for performance data to provide back to the client…this process takes forever.

What if we could automate it? What if we could provide partners with an “at a glance” view of the server’s health and their client’s statistics? What if we could provide you with a list of available backup dates so you can choose what date you’d like to restore from? What if we could provide you the number of messages in queue for Exchange or overall latency for clients and response times?

What if we could provide you with up to the minute stats of the CAS server your user is on, the CPU percentage used by the client, the amount of latency experienced by the client.

We can, and we will…

As far as I know this level of information and statistics has never been provided by a service provider before…

Below is the draft version of the User Monitor that will be adding to our Staff control panel and will more than likely find it’s way into service manager.

 

xyz

 

Travis Sheldon
VP, Network Operations, ExchangeDefender
(877) 546-0316 x757
travis@ownwebnow.com

In my previous blog entry I overviewed the failover procedure for Rockerduck and what ‘technically’ goes on in the background during a failover. This blog entry will focus more on the client experience during and after an outage.

Imagine that Jim and Kelly are both a part of “ABC Company LLC”. Jim is very hip with his new Apple laptop using Office 2011 and his iPhone 4s. Kelly still uses Windows along with Outlook 2007 and when she is out of the office she uses her Blackberry Torch connected through Blackberry Enterprise Server.

Currently, everything is working properly and all systems are operational.

If MBOX2 was to go offline, MBOX1 would take over actively hosting DB2 (Which was hosted by MBOX2). This type of failure is an inter-site failure and results in an immediate switch to the passive copies. Customers will see no downtime as long as there is a good copy of the database available.

 

What happens if Dallas goes offline?

 

As described in my previous blog entry, disastrous failures are not automatically failed over. At this point, both clients would be offline from their mailbox and unable to access, create or modify items.

However, in following my previous blog entry we would be able to activate our fail over procedure.

After 15 minutes of electing to activate our fail over procedure clients should receive the update DNS records for cas.rockerduck.exchangedefender.com to point to Los Angeles. All clients would then be able to reconnect to their mailboxes and service should resume as normal out of Los Angeles with the exception of Blackberry Enterprise Server which cannot be setup for fault tolerance in our network design.

After repairing/resolving any issues in Dallas, we would then begin to resynchronize the databases from Los Angeles to Dallas. Once all database copies are up to date we would then reconfigure DNS to point to Dallas and resume service as normal. All in all with a disastrous failure we would be able to recover from the event in 15 minutes once the recovery process is executed.

Travis Sheldon
VP, Network Operations, ExchangeDefender
(877) 546-0316 x757
travis@ownwebnow.com

I’ll take this week to discuss one of the more recent patterns that seemed to snowball in the past few weeks. We’ve received an influx of feedback regarding messages getting picked off by the filter that weren’t SPAM. Fortunately, I was able to find a couple of partners that were able to have the time to cooperate with us beyond the original complaint. Thanks to these folks I was able to find a couple of patterns and thus rules that needed tweaking.

600-Hot-Weather-Combat-Boot1The nature of the first tweak, we removed a rule that took into account certain special characters in the header information. This rule “used” to work well but as more MTAs have begun using and customizing header information it’s becoming more common practice than not, so that rule basically got the boot period. The rate that it was showing up in false positives was climbing to an unacceptable level.

hand-toolsThe second tweak, this one was a bit more peculiar but this rule has an excellent hit rate on Money natured SPAM. So it picks off anything from the Nigerian prince, to ancient treasure, to someone’s grandma needing money for surgery SPAM. What we found was at the end of the year a lot of folks were sending proposal type emails that included large amount of currency that were legit that were getting picked off by this rule. On this particular rule we just toned down the scoring, the logic behind it being that if the email possesses any other “SPAMMY” qualities we’re going to go ahead and tag it as such.

We’ve seen a huge decrease on the false positives since we enacted these changes 2 weeks ago and we have not seen an increase in the SPAM flow going through because of it. So as the lesson behind this fable I’d recommend that if you ever have false positive of SPAM issues, please always attach the .msg file of the original messages to your tickets. If you provide 5 or more it increases our chances for effective resolution.

Carlos Lascano
VP Support Services, ExchangeDefender
carlos@ownwebnow.com
(877) 546-0316 x737

online-backupOver my next two blogs I will be overviewing the fail over procedures for Rockerduck and what clients should expect should a fail over occur. This blog post will go over the actual back end process and what factors influence whether we activate our fail over procedure. The next blog post will review the client experience once an outage occurs, the fail over and the recovery.

First, let’s qualify the differences between an “issue” and an “outage”. Issues are typically minor inconveniences or temporary “unavailability” such as a router reboot, temporary power outage, or network ‘blip’. Outages/failures can occur outright or can manifest from a minor issues. By rule of thumb, if the service is expected to be impacted for more than an hour, we consider the situation to require a fail over. Our fail over procedure is not ‘automated’ as we’ve elected to run the Database Availability Group for Rockerduck in DAC (Datacenter Activation Coordination) mode. When DAG’s run in DAC mode the secondary data center must be manually activated to mitigate an outage. This is done to prevent ‘Split brain syndrome’ where both data centers concurrently activate the same mailbox database.

There is a very specific reason we do not activate our fail over procedure for minor ‘issues’.

The fail over procedure by nature is risky and can lead to longer ‘down time’ if the issue is resolved before the fail over procedure completes or if an unforeseen event occurs during fail over.

For instance, imagine that our Dallas data center has a network issue and goes completely offline from the internet. Before receiving complete details on the outage from our data center, we decide to activate our Los Angeles data center. During the process of activating the LA copy, we switch DNS records to point away from Dallas to Los Angeles. Shortly after modifying DNS, imagine that our Dallas data center comes online and tries to take back control of the DAG (as communication was only lost to the internet). Dallas would then control the DAG databases while our entry point records would point to Los Angeles. This would yield poor results for clients as they would be proxying requests through LA to Dallas.

So what really goes on during a fail over?

After qualifying that any issue requires activation of our fail over procedure, immediately we will notify partners about the fail over activation. Before any changes get made, we review the health of our Los Angeles network and servers to ensure stability of the fail over. Once all services receive approval, we perform the following steps:

Step 1 –  Modify cas.rockerduck.exchangedefender.com to point to the IP for cas.la.rockerduck.exchangedefender.com (TTL 5 Minutes)

Step 2 – Stop services on all Dallas mailbox servers

Step 3 – Restore DAG quorum in California

Step 4 – Mount databases in California

Step 5 – Modify inbound.rockerduck.exchangedefender.com to point to the multihomed MX record for Rockerduck LA (TTL 5 Minutes).

By keeping the TTL record for cas.rockerduck.exchangedefender.com at 5 minutes clients should automatically connect to the California data center to resume service without any modifications. In the same token, mail flow should automatically queue up in ExchangeDefender and upon update of the DNS records queued mail and new mail should deliver to Rockerduck LA.

Travis Sheldon
VP Network Operations, ExchangeDefender
(877) 546-0316 x757
travis@ownwebnow.com

stack_of_books.pngWe’ve been calling partners a lot recently to get a pulse directly from our client base on what things we can do to improve our service and one that came up, that’s pretty easy to address, is our documentation is hard to find and navigate.

We’ll go in order of relevance.

ExchangeDefender University – ExchangeDefender University is a very basic how to guide that includes some documentation links. This link is meant for a new partner that wants to know how to order or our services and deploy them.

http://exchangedefender.com/XDUniversity.php

ExchangeDefender Documentation – The documentation resources is a set of instructions is a bit more advanced as it goes beyond the standard deployments. However, these guides are scoped down to specific features so they’re more detailed and they’re more geared towards the guy that likes to print a doc and go do it to 20 machines, phones, etc.

http://exchangedefender.com/documentation.php

Support Knowledge Base – The Knowledge Base articles are for advanced users mainly. The details are provided for various custom deployments, repeat issues, advanced configurations. Now in the past while holding a lot of information, this option hasn’t been as appealing because it was not searchable. We recently made the Knowledge Base searchable which should improve its usefulness. The search is the same search box on the top right, it will now yield matching KB articles within the search results.

https://support.ownwebnow.com/list.asp?item=kb

This link does require authentication, please use your partner portal credentials.

I hope you found this information useful!

Carlos Lascano
VP Support Services, ExchangeDefender
carlos@ownwebnow.com
(877) 546-0316 x737

Over the weekend (12/09/11 – 12/10/11) we performed critical, preemptive upgrades for Rockerduck. During our upgrade cycle we were able to increase memory resources for Mailbox servers, rebalance resource distribution on Client Access servers and add additional Mailbox servers for quorum retention and additional high availability.

Mailbox, mailbox, mailbox…

By utilizing the current mailbox server layout, we were able to increase memory in Rockerduck mailbox servers in a staggering pattern without disrupting service to clients on Rockerduck. As each mailbox server was prepared for the upgrade, we moved all active mailboxes from the server to any passive mailbox node and then blocked the mailbox server from activating any database copy. After the memory upgrades were completed we then stress tested each server for 8 hours with a memory stress test for consistency. Once the upgrades were completed on the nodes, we were being the node back into the DAG and back up to availability.

Labs vs. Real World Results

Mailbox servers were not the only servers in Rockerduck to be upgrades. Over the past two weeks we’ve been monitoring the response statistics on CAS servers with a new memory / processor configuration.

Originally when we performed initial testing / scaling Rockerduck we seen the overall lowest latency and response time for RPC and Web Services from having a fewer CAS servers with higher RAM and processor. Over time, we’ve noticed the real world utilization result of overall latency on RPC was significantly outside the scope of our original Lab results causing us to reevaluate our delivery of CAS services.

All CAS servers for Rockerduck sit behind a hardware based load balancer. Each client that connects to the load balancer gets assigned to a specific CAS node for up to 5 hours on certain services (RPC, EWS) based off of the client WAN IP. Original design for the CAS nodes was 3 nodes with 8GB of RAM and 4 Processor cores available.

 

1

Unfortunately, this “least connected” model had the potential (and sometimes did) tie larger groups of users together from different IP addresses, essentially choking the server with queued requests.

 

2

The new setup for the CAS nodes is a balance of 6GB of RAM with 3 Processor cores available. This new configuration allowed us to introduce two new CAS servers to more efficiently process requests across multiple nodes without any additional “upgrades” to the CAS roles.

During our statistical collection phase, the new configuration nodes had a 40% reduction in response time on RPC requests and Address Book requests:

Originally: 22 ms

Now: 13.2 ms

Travis Sheldon
VP Network Operations, ExchangeDefender
(877) 546-0316 x757
travis@ownwebnow.com

Earlier this week we created a NOC entry/notification for partners about a maintenance interval we scheduled for ROCKERDUCK. The entry outlined an issue we faced where on DB (DB7) was running on the logs drive instead of the DB drive and our proposed outline of the work to be completed. Unfortunately, because the issue affects all database copies, correcting the issue would involve reducing DB7 to a single mailbox server, moving the database, which would take DB7 offline, and then re-seeding the copies to all passive nodes.

Shortly after posting the NOC entry I received an email from a partner demanding that I explain to them why the Database Availability Group (DAG) could not prevent service interruption for users on DB7.

So why does the DAG not protect from every single event possible?

Simply said; all servers in a DAG must be identical in terms of storage location for databases and logs across all servers. In a DAG, only one mailbox can act as the “active” mailbox database and all other copies on other nodes are purely “database copies” that can be switched to the active/primary database.

In the case of moving the database path, we cannot switch the current active database over to a passive node, move the DB, then switch it back to the original primary as this would break the DAG and we would then have split copies of ‘active’ data. We cannot use passive copies to keep service active while we physically modify the database properties/layout of the ‘active’ copy.

If this was a case where a database experienced a failure on the active copy or there was a network communication issue, the DAG would mount the passive copy of the database and continue providing service to users.

JibberJabberSo all this jibber-jabber means what?

In short, we would remove all copies of DB7 across all nodes except the primary node. After all copies are removed, we would start the move of DB7 to the proper location and then remount the database. By calculation of the DB size, service would be interrupted for about 10-15 minutes. Finally, after the move completes we would re-add the database copy across each node and then bring service back into full redundancy. A fifteen minute outage is unfortunately a necessary evil to provide an overall more redundant solution to our partners and their clients.

Travis Sheldon
VP Network Operations, ExchangeDefender
(877) 546-0316 x757
travis@ownwebnow.com

I’m going to address an age old question from folks that do not like to read our feature pages, in hopes that you read this blog. As part of the DR (Disaster Recovery) we have two primary items that can help during and after an outage. This post will help educate your teams on the expectation of how things work, so your expectations as well as your clients are managed to the correct level.

During an outage

ss

During an outage the best place to have surefire access is to type https://livearchive.exchangedefender.com into your browser. This is the sure fire way to ensure that regardless of which cluster is live (Dallas or Los Angeles) your clients can get to it. A best practice is having a shortcut ready for your clients on their Desktop or Start Menu. If I had a penny for each time that someone’s server catches fire and it’s that juncture that a tech asks “How do I get to LiveArchive?”. You are already putting yourself in front of the barrel. If you don’t have a solution in hand and you have to “call someone else”, it’s that point that your client’s confidence starts eroding.

Where is LiveArchive?

LiveArchive is located at  https://livearchive.exchangedefender.com

What are my LiveArchive credentials?

Your LiveArchive credentials are the same as your ExchangeDefender credentials; which are your email address and your ExchangeDefender password.  Remember if you forgot this password and your email is down your best bet in an outage scenario is to open a ticket for your client in our portal and request their passwords. Sadly, folks often try their email passwords and assume that something is wrong (see above: more erosion). The key to all of this is to get the right answer on the first try.

So let’s move forward, now you either knew everything above upfront and only have to deal with your end users once or you had to go back and forth a few times to get it hammered out. Regardless, your clients have access to all of their internet mail now, now your hard job starts. Get the defibrillator and resurrect their Exchange server, obviously this can range from a simple reboot to a week long pain staking process. One thing you have in the back of your mind is, thank goodness ExchangeDefender is holding all of my mail. The most important thing to remember while you and your team are doing your best to perform thoracic surgery to the server is make sure the server is offline!!

Here’s why, by RFC rules we can only hold mail that is being deferred by your server. If your server is online and “REJECTING” mail due to bad configuration or your troubleshooting, all that mail is purged because your client’s server is telling our software this is permanent rejection. This is the biggest key in the process, luckily this doesn’t happen often but there are teams that will have the server permanently rejecting mail for a week and then ask for their mail. And even though this is digging yourself a grave, we MAY still be able to help you.

First off our Mail “Spooling” or “Bagging” service is in place for up to 7 days. The way it works is, after the initial real-time attempt to deliver your mail, your mail is moved to a retry queue. This queue in an effort to not hammer client servers reattempts to deliver from each node every 20 minutes or so, staggered. This process is fully automated and constantly running, you don’t have to call us or open a ticket saying, “Our server is up release our mail”. If your server really is up and accepting mail from our servers your mail will start to flow on its own, but it can take up to a couple of hours for all of your mail to deliver depending on your queued volume. Again, we don’t want to pound your client’s server into submission and cause it to trigger the Exchange backpressure mechanism.

Now, if you made the unfortunate mistake to bring back a server online after rebuild without the process IP restrictions and anonymous delivery settings and all of your spool was lost there is still one possibility. If the mail is in LiveArchive, due to our hub and transport design you can actually forward all that mail to your individual client’s mailboxes one by one. This is a fully manual process that can is pretty time consuming but when faced with the choice of telling a client you lost all their mail for the past x number of days or telling them you need a couple more hours to make them whole, the choice becomes easy.

Carlos Lascano
VP Support Services, ExchangeDefender
carlos@ownwebnow.com
(877) 546-0316 x737

The past two weeks have been rather stressful and unpleasant for our partners and your end users on DEWEY due to the recent outage that left 17% of the users without full access to their entire mailbox for two weeks. Not only has this been our longest outage with any service, but it also has been one of the most “bumpy road” recoveries we’ve ever experienced. Our partners and their end users could compare almost every step of the recovery to pulling teeth; every step of the way, we let them down.

We’re extremely sorry…this was (and still is) extremely unfortunate, but we have learned a lot from this experience.

Here’s a quick break down of the experience:

10/31/11 – Users on two databases both lost connection to Exchange.

11/01/11 – Users are switched to dial tone recovery mode. Users with mobile phones lose contacts and calendar events on next sync.

11/06/11 – Users on DB2 regain access to old mail, but lose access to any new mail, contacts, or calendar entries since the outage began.

11/07/11 – Mailbox data for users on DB2 is fully restored; new and old data merged.

11/17/11 – Users on DB3 regain access to old mail, but lose access to any new mail, contacts, or calendar entries since the outage began.

11/19/11 – Mailbox data for users on DB3 is fully restored; new and old data merged.

In speaking to partners along every step of the way we heard every issue experienced by end users with the biggest issue being the dial tone recovery. During dial tone recovery any users with ActiveSync based connections will lose all Exchange contacts and calendar items on next sync after dial tone is activated.

When you break down our responsibilities and duties to our clients, at the very minimum, we need to provide a live running service as quickly as possible. Technology and software unfortunately have issues and can break, but as long as we can minimalize the direct impact faced by end users we can generally get through issues without upsetting a lot of partners. By erasing all cached access by mobile phones with the temporary mailbox, we put a lot of stress on our partners.

We are considering a disaster recovery policy control for partners..

The policy controls will give partners control of how we treat each mailbox during an outage. In giving partners the ability to control how we respond, this will greatly improve the overall experience during an outage and will allow the partner to provide direct expectations to their clients.

For instance, say partner ABC123 Computers has a client, Big Electric Company, with 10 mailboxes.

The partner wants to mark 3 users as “Do not activate dial tone” as these users are mobile and depend on their contacts and calendars and they cannot afford to lose them once dial tone is activated.

The partner then marks the CEO and CFO as “Do not reimport data after outage” as the partner plans to reimport cached data because the CFO and CEO cannot work on dial tone mail alone.

By allowing the partner to directly control our recovery process for mailboxes, the partner will then be able to set direct SLA expectations for their end users during the outage.

Plans for the future…

FutureIn the event that we experience another catastrophic failure (anticipated 6+ hours of downtime), we will wait at least one hour before activating dial tone recovery for mailboxes that are not opted out.

During the first hour, we will reach out to all partners via telephone whose mailboxes are not opted out from dial tone recovery to make them aware of the expected experience for end users. If the partner wishes to not activate dial tone recovery we will activate the mailbox option in service manager to opt out of dial tone recovery. Additionally, partners could ask (via support ticket) that the dial tone activation to be postponed for a few hours later if the partner wants to advise the end user to disable ActiveSync.

Once the original (or backup) database is back online we will once again reach out to all partners who have not opted out of mailbox data restore to let them know of the expected experience for end users. If the partner wishes to opt the mailbox out of data restore then the mailbox will remain on the dial tone database until the recovery of data for all users on the affected database is completed.

If you think the outage policy control would be a beneficial add on to our Hosted Exchange service, please let me know travis@exchangedefender.com.

Travis Sheldon
VP Network Operations, ExchangeDefender
(877) 546-0316 x757
travis@ownwebnow.com

GDPR - GET STARTED

Our readiness kit contains valuable resources designed specifically to help businesses with GDPR requirements.

DOWNLOAD OUR GDPR READINESS KIT

GDPR & WHY IT MATTERS

Download our webinar to find out how we comply with the GDPR requirements.

SEE OUR WEBINAR

CUSTOM GDPR COLLATERAL?

Looking for custom GDPR collateral or have questions for us? Contact us, We are here to help!

MORE INFORMATION