Rockerduck Maintenance
On February 9th – February 11th 2012 the ExchangeDefender staff performed maintenance on the Rockerduck cluster in which two separate individual ‘outages’ affected client access on the late half of the evenings and early half of the following morning on February 9th -10th and February 10th-11th .
 On the eve of Feb. 9th 2012 (~11:30 PM Eastern) we began upgrades on a failed/failing VPN device that is used to connect ROCKERDUCK:DAL and ROCKERDUCK:LA active directory and internal communication between sites. During the upgrade we began to notice random network related events in which communication seemed saturated and sluggish and randomly affected across the entire network. After various attempts (and configurations) to bring the new VPN router online we determined that the new VPN device was occasionally malfunctioning and flooding the network with ‘dead packets’. Unfortunately the massive flood of packets from the VPN device caused the Database Availability Group (DAG) on ROCKERDUCK to lose communication between nodes and eventually lose quorum. Once quorum was lost between nodes all databases between both sites were automatically dismounted as the DAG was considered unhealthy to Exchange. For the next few hours we worked to restore service to RD clients by replacing the failed VPN routers with our backup VPNs (new vendor) and restoring communication with Los Angeles. After communication was re-established clients were able to access their mailboxes. This outage affected all clients and lasted between the hours of midnight and roughly 3:15 AM.
On the eve of Feb. 9th 2012 (~11:30 PM Eastern) we began upgrades on a failed/failing VPN device that is used to connect ROCKERDUCK:DAL and ROCKERDUCK:LA active directory and internal communication between sites. During the upgrade we began to notice random network related events in which communication seemed saturated and sluggish and randomly affected across the entire network. After various attempts (and configurations) to bring the new VPN router online we determined that the new VPN device was occasionally malfunctioning and flooding the network with ‘dead packets’. Unfortunately the massive flood of packets from the VPN device caused the Database Availability Group (DAG) on ROCKERDUCK to lose communication between nodes and eventually lose quorum. Once quorum was lost between nodes all databases between both sites were automatically dismounted as the DAG was considered unhealthy to Exchange. For the next few hours we worked to restore service to RD clients by replacing the failed VPN routers with our backup VPNs (new vendor) and restoring communication with Los Angeles. After communication was re-established clients were able to access their mailboxes. This outage affected all clients and lasted between the hours of midnight and roughly 3:15 AM.  
On the eve of Feb. 10th (~10:30 PM Eastern) we began work to finalize the VPN communication by consolidating both VPN devices in California to the one backup vendor VPN device. The reason we elected to replace the ‘working’ VPN device in California was due to the fear of the abnormal workings of the similar VPN device in Dallas. As part of our protocol to ‘down’ a data center in Exchange hosting we paused SMTP services on Rockerduck. After replacing the VPN device in California we resumed all services (including SMTP) and mail resumed normal flow. Around 5:30 AM Eastern we started to receive alerts about back pressured queues in Rockerduck which would amount to delivery delays. Upon investigation it was discovered that the issue was mail delivery between the EDGE server network and the HUB server network on RD. After two hours of investigating the issue internally (and opening a case with Microsoft) we were able to determine that our course of action would be reapplying the SP2 update to the edge networks. Once SP2 was reapplied to all EDGE nodes mail delivery returned on ROCKERDUCK by 9:15 AM Eastern.
Finally there were about 5% of users who were left in a disconnected state through Outlook but had service through OWA (and some through active sync) between Saturday and Sunday as the database their mailboxes were housed was moved to Los Angeles for the content index database in Dallas to rebuild for RDDB9. Service was restored to these users by noon Eastern.
Travis Sheldon
VP, Network Operations, ExchangeDefender
(877) 546-0316 x757
travis@ownwebnow.com  
Maintenance Weekend
 This weekend a big part of our team will be doing some massive infrastructure upgrades to improve our network performance and stability in our Dallas DC. These changes should not have any effect on service if we find that the planned change may impact a service of some sort we will be sure to update the NOC blog accordingly. Remember that it’s available at http://www.exchangedefender.com/noc and it’s RSS enabled at http://www.exchangedefender.com/noc/feed/ for you to subscribe for yourself and your staff.
This weekend a big part of our team will be doing some massive infrastructure upgrades to improve our network performance and stability in our Dallas DC. These changes should not have any effect on service if we find that the planned change may impact a service of some sort we will be sure to update the NOC blog accordingly. Remember that it’s available at http://www.exchangedefender.com/noc and it’s RSS enabled at http://www.exchangedefender.com/noc/feed/ for you to subscribe for yourself and your staff.  
We will also be increasing our power capability by 20% this will help us with any required growth we may need due to business growth. We’ll also begin deploying a new 2010 cluster, but I can’t provide any additional details on that as Vlad will share that news once it’s live as far its purpose and target.
 The last change will impact one of our backup servers, backup90 as it will receive a storage upgrade to accommodate increasing demand for that service in the upcoming months. The service impact there should be minimal and will be blogged.
The last change will impact one of our backup servers, backup90 as it will receive a storage upgrade to accommodate increasing demand for that service in the upcoming months. The service impact there should be minimal and will be blogged.  
So we are basically doing a big push to create a big buffer on the availability and performance so that your teams can continue to focus on just moving the product, we will take care of the ugly part of the business for you!
Carlos Lascano
VP Support Services, ExchangeDefender
carlos@ownwebnow.com
(877) 546-0316 x737
ExchangeDefender Essentials Emergency
Last quarter we launched a slimmed down version of ExchangeDefender to be packaged with Exchange Essentials 2010 and as a standalone product. This product was launched to provide a similar price point and feature set as some less robust Spam & Security solutions out in the channel. However, we were never comfortable with not offering a bundled-in business continuity solution. Enter… ExchangeDefender Essentials Emergency.
Emergency is the business continuity solution that is now bundled in with ExchangeDefender Essentials. It will capture a copy of all incoming email in similar fashion, but only inbound mail with a retention policy of 5 days for all items. This email is accessible via web portal and POP3/IMAP4 (although currently we are limiting the ability on the POP/IMAP to just download messages (to avoid open relay situations). So your clients will be able to continue to do business with the ability to receive, reply to, and create new emails from their real email address during an outage. Remember, that our 7 day spooling/mail bagging system is still in place so a combination of the two should minimize your client’s inconvenience.
The web portal is available at:
https://emergency.exchangedefender.com
Credentials:
Your primary user email address with ExchangeDefender (you cannot log in with an alias address) and your current ExchangeDefender password.
 
  
Once you log in all of your email will be available, with your identity pre-configured for use. There is no additional set up required. You can start reading and firing off emails as quickly as you can type.
Setting up Outlook (remember Read/DL message access only currently)
Fill out the information as below:
 
  
Your email address and user name are the same as your primary address in ExchangeDefender. The POP3/IMAP4 server is emergency.exchangedefender.com. Everything is on the standard ports for both SSL and non.
 
  
Our CEO, Vlad Mazek, will be providing a broader overview on emergency and its feature set when we officially roll it out in the next 2 weeks.
Carlos Lascano
VP Support Services, ExchangeDefender
carlos@ownwebnow.com
(877) 546-0316 x737
Encrypted Attachment Issues
Over the past couple of weeks we have been researching some reports regarding encryption not handling attachments correctly. During the process, the error that kept printing on the back end processing was ““Content-Type: application/ms-tnef; name=”winmail.dat” Content-Transfer-Encoding: base64””. If Outlook sends a message using the RTF format (which is not very common outside Outlook) for bold text and other text enhancements, it includes the formatting commands in the winmail.dat file. Receiving email clients that do not understand the code therein display it as a stale attachment. To make matters worse, Outlook may also pack other, regular file attachments in the winmail.dat file. That’s the bad news, the good news is that fix is a piece of cake.
In Outlook 2010 you go through File, then Options and check the box below:
In Outlook 2007 you go through Tools, then Options:
1. Go to the Mail Format tab.
2. Under Compose in this message format:, make sure either HTML or Plain Text is selected.
3. Click Internet Format.
4. Make sure either Convert to Plain Text format or Convert to HTML format is selected under When sending Outlook Rich Text messages to Internet recipients, use this format:
5. Ok to submit.
Carlos Lascano
VP Support Services, ExchangeDefender
carlos@ownwebnow.com
(877) 546-0316 x737
Are Your Exchange Servers Experiencing Delays?
Imagine it’s 4:00 PM, you’re getting ready to close for the day and your cell phone starts ringing off the hook; it is your biggest clients’ CFO and he is very upset. The incoming caller complains that “email is slow” and “it is taking forever to do xyz”, but any attempt to get more information is greeted with either hostility or an abrupt “I don’t know”.
Does that sound familiar?
Almost every single request that has very little detail from the client in terms of what is “slow” magically gets fixed and then the client is convinced that the issue is Exchange…how do you fight back? How do you know for sure that a server / network out of your control is performing up to par. How do you know that your hosting vendor is keeping redundancy healthy and performing backups? For the most part you can’t… or can you?
One of the most common inquiries that we receive from partners is “Is XYZ server experiencing delays today?” after the partner gets alerted by their client that things seem to be “slow”. Our staff then tries to qualify the phrase “slow”…is it email? Is it Outlook response? How about OWA? After we have an idea of what the client is reporting as slow then we have to dig through logs and statistics files for performance data to provide back to the client…this process takes forever.
What if we could automate it? What if we could provide partners with an “at a glance” view of the server’s health and their client’s statistics? What if we could provide you with a list of available backup dates so you can choose what date you’d like to restore from? What if we could provide you the number of messages in queue for Exchange or overall latency for clients and response times?
What if we could provide you with up to the minute stats of the CAS server your user is on, the CPU percentage used by the client, the amount of latency experienced by the client.
We can, and we will…
As far as I know this level of information and statistics has never been provided by a service provider before…
Below is the draft version of the User Monitor that will be adding to our Staff control panel and will more than likely find it’s way into service manager.
Travis Sheldon
VP, Network Operations, ExchangeDefender
(877) 546-0316 x757
travis@ownwebnow.com  
Rockerduck: What Will My Client See During an Outage?
In my previous blog entry I overviewed the failover procedure for Rockerduck and what ‘technically’ goes on in the background during a failover. This blog entry will focus more on the client experience during and after an outage.
Imagine that Jim and Kelly are both a part of “ABC Company LLC”. Jim is very hip with his new Apple laptop using Office 2011 and his iPhone 4s. Kelly still uses Windows along with Outlook 2007 and when she is out of the office she uses her Blackberry Torch connected through Blackberry Enterprise Server.
Currently, everything is working properly and all systems are operational.
 
  
If MBOX2 was to go offline, MBOX1 would take over actively hosting DB2 (Which was hosted by MBOX2). This type of failure is an inter-site failure and results in an immediate switch to the passive copies. Customers will see no downtime as long as there is a good copy of the database available.
 
  
What happens if Dallas goes offline?
 
  
As described in my previous blog entry, disastrous failures are not automatically failed over. At this point, both clients would be offline from their mailbox and unable to access, create or modify items.
However, in following my previous blog entry we would be able to activate our fail over procedure.
After 15 minutes of electing to activate our fail over procedure clients should receive the update DNS records for cas.rockerduck.exchangedefender.com to point to Los Angeles. All clients would then be able to reconnect to their mailboxes and service should resume as normal out of Los Angeles with the exception of Blackberry Enterprise Server which cannot be setup for fault tolerance in our network design.
 
  
After repairing/resolving any issues in Dallas, we would then begin to resynchronize the databases from Los Angeles to Dallas. Once all database copies are up to date we would then reconfigure DNS to point to Dallas and resume service as normal. All in all with a disastrous failure we would be able to recover from the event in 15 minutes once the recovery process is executed.
Travis Sheldon
VP, Network Operations, ExchangeDefender
(877) 546-0316 x757
travis@ownwebnow.com
Rules…Rules…Rules…
I’ll take this week to discuss one of the more recent patterns that seemed to snowball in the past few weeks. We’ve received an influx of feedback regarding messages getting picked off by the filter that weren’t SPAM. Fortunately, I was able to find a couple of partners that were able to have the time to cooperate with us beyond the original complaint. Thanks to these folks I was able to find a couple of patterns and thus rules that needed tweaking.
 The nature of the first tweak, we removed a rule that took into account certain special characters in the header information. This rule “used” to work well but as more MTAs have begun using and customizing header information it’s becoming more common practice than not, so that rule basically got the boot period. The rate that it was showing up in false positives was climbing to an unacceptable level.
The nature of the first tweak, we removed a rule that took into account certain special characters in the header information. This rule “used” to work well but as more MTAs have begun using and customizing header information it’s becoming more common practice than not, so that rule basically got the boot period. The rate that it was showing up in false positives was climbing to an unacceptable level.  
 The second tweak, this one was a bit more peculiar but this rule has an excellent hit rate on Money natured SPAM. So it picks off anything from the Nigerian prince, to ancient treasure, to someone’s grandma needing money for surgery SPAM. What we found was at the end of the year a lot of folks were sending proposal type emails that included large amount of currency that were legit that were getting picked off by this rule. On this particular rule we just toned down the scoring, the logic behind it being that if the email possesses any other “SPAMMY” qualities we’re going to go ahead and tag it as such.
The second tweak, this one was a bit more peculiar but this rule has an excellent hit rate on Money natured SPAM. So it picks off anything from the Nigerian prince, to ancient treasure, to someone’s grandma needing money for surgery SPAM. What we found was at the end of the year a lot of folks were sending proposal type emails that included large amount of currency that were legit that were getting picked off by this rule. On this particular rule we just toned down the scoring, the logic behind it being that if the email possesses any other “SPAMMY” qualities we’re going to go ahead and tag it as such.  
We’ve seen a huge decrease on the false positives since we enacted these changes 2 weeks ago and we have not seen an increase in the SPAM flow going through because of it. So as the lesson behind this fable I’d recommend that if you ever have false positive of SPAM issues, please always attach the .msg file of the original messages to your tickets. If you provide 5 or more it increases our chances for effective resolution.
Carlos Lascano
VP Support Services, ExchangeDefender
carlos@ownwebnow.com
(877) 546-0316 x737  
Rockerduck: What to Expect During an Outage
 Over my next two blogs I will be overviewing the fail over procedures for Rockerduck and what clients should expect should a fail over occur. This blog post will go over the actual back end process and what factors influence whether we activate our fail over procedure. The next blog post will review the client experience once an outage occurs, the fail over and the recovery.
Over my next two blogs I will be overviewing the fail over procedures for Rockerduck and what clients should expect should a fail over occur. This blog post will go over the actual back end process and what factors influence whether we activate our fail over procedure. The next blog post will review the client experience once an outage occurs, the fail over and the recovery.  
First, let’s qualify the differences between an “issue” and an “outage”. Issues are typically minor inconveniences or temporary “unavailability” such as a router reboot, temporary power outage, or network ‘blip’. Outages/failures can occur outright or can manifest from a minor issues. By rule of thumb, if the service is expected to be impacted for more than an hour, we consider the situation to require a fail over. Our fail over procedure is not ‘automated’ as we’ve elected to run the Database Availability Group for Rockerduck in DAC (Datacenter Activation Coordination) mode. When DAG’s run in DAC mode the secondary data center must be manually activated to mitigate an outage. This is done to prevent ‘Split brain syndrome’ where both data centers concurrently activate the same mailbox database.
There is a very specific reason we do not activate our fail over procedure for minor ‘issues’.
The fail over procedure by nature is risky and can lead to longer ‘down time’ if the issue is resolved before the fail over procedure completes or if an unforeseen event occurs during fail over.
For instance, imagine that our Dallas data center has a network issue and goes completely offline from the internet. Before receiving complete details on the outage from our data center, we decide to activate our Los Angeles data center. During the process of activating the LA copy, we switch DNS records to point away from Dallas to Los Angeles. Shortly after modifying DNS, imagine that our Dallas data center comes online and tries to take back control of the DAG (as communication was only lost to the internet). Dallas would then control the DAG databases while our entry point records would point to Los Angeles. This would yield poor results for clients as they would be proxying requests through LA to Dallas.
So what really goes on during a fail over?
After qualifying that any issue requires activation of our fail over procedure, immediately we will notify partners about the fail over activation. Before any changes get made, we review the health of our Los Angeles network and servers to ensure stability of the fail over. Once all services receive approval, we perform the following steps:
Step 1 – Modify cas.rockerduck.exchangedefender.com to point to the IP for cas.la.rockerduck.exchangedefender.com (TTL 5 Minutes)
Step 2 – Stop services on all Dallas mailbox servers
Step 3 – Restore DAG quorum in California
Step 4 – Mount databases in California
Step 5 – Modify inbound.rockerduck.exchangedefender.com to point to the multihomed MX record for Rockerduck LA (TTL 5 Minutes).
By keeping the TTL record for cas.rockerduck.exchangedefender.com at 5 minutes clients should automatically connect to the California data center to resume service without any modifications. In the same token, mail flow should automatically queue up in ExchangeDefender and upon update of the DNS records queued mail and new mail should deliver to Rockerduck LA.
Travis Sheldon
VP Network Operations, ExchangeDefender
(877) 546-0316 x757
travis@ownwebnow.com  
References…References…References…
 We’ve been calling partners a lot recently to get a pulse directly from our client base on what things we can do to improve our service and one that came up, that’s pretty easy to address, is our documentation is hard to find and navigate.
We’ve been calling partners a lot recently to get a pulse directly from our client base on what things we can do to improve our service and one that came up, that’s pretty easy to address, is our documentation is hard to find and navigate.  
We’ll go in order of relevance.
ExchangeDefender University – ExchangeDefender University is a very basic how to guide that includes some documentation links. This link is meant for a new partner that wants to know how to order or our services and deploy them.
http://exchangedefender.com/XDUniversity.php
ExchangeDefender Documentation – The documentation resources is a set of instructions is a bit more advanced as it goes beyond the standard deployments. However, these guides are scoped down to specific features so they’re more detailed and they’re more geared towards the guy that likes to print a doc and go do it to 20 machines, phones, etc.
http://exchangedefender.com/documentation.php
Support Knowledge Base – The Knowledge Base articles are for advanced users mainly. The details are provided for various custom deployments, repeat issues, advanced configurations. Now in the past while holding a lot of information, this option hasn’t been as appealing because it was not searchable. We recently made the Knowledge Base searchable which should improve its usefulness. The search is the same search box on the top right, it will now yield matching KB articles within the search results.
This link does require authentication, please use your partner portal credentials.
I hope you found this information useful!
Carlos Lascano
VP Support Services, ExchangeDefender
carlos@ownwebnow.com
(877) 546-0316 x737
If 2 is good, 50 is better, (Or is it)?
Over the weekend (12/09/11 – 12/10/11) we performed critical, preemptive upgrades for Rockerduck. During our upgrade cycle we were able to increase memory resources for Mailbox servers, rebalance resource distribution on Client Access servers and add additional Mailbox servers for quorum retention and additional high availability.
Mailbox, mailbox, mailbox…
By utilizing the current mailbox server layout, we were able to increase memory in Rockerduck mailbox servers in a staggering pattern without disrupting service to clients on Rockerduck. As each mailbox server was prepared for the upgrade, we moved all active mailboxes from the server to any passive mailbox node and then blocked the mailbox server from activating any database copy. After the memory upgrades were completed we then stress tested each server for 8 hours with a memory stress test for consistency. Once the upgrades were completed on the nodes, we were being the node back into the DAG and back up to availability.
Labs vs. Real World Results
Mailbox servers were not the only servers in Rockerduck to be upgrades. Over the past two weeks we’ve been monitoring the response statistics on CAS servers with a new memory / processor configuration.
Originally when we performed initial testing / scaling Rockerduck we seen the overall lowest latency and response time for RPC and Web Services from having a fewer CAS servers with higher RAM and processor. Over time, we’ve noticed the real world utilization result of overall latency on RPC was significantly outside the scope of our original Lab results causing us to reevaluate our delivery of CAS services.
All CAS servers for Rockerduck sit behind a hardware based load balancer. Each client that connects to the load balancer gets assigned to a specific CAS node for up to 5 hours on certain services (RPC, EWS) based off of the client WAN IP. Original design for the CAS nodes was 3 nodes with 8GB of RAM and 4 Processor cores available.
Unfortunately, this “least connected” model had the potential (and sometimes did) tie larger groups of users together from different IP addresses, essentially choking the server with queued requests.
The new setup for the CAS nodes is a balance of 6GB of RAM with 3 Processor cores available. This new configuration allowed us to introduce two new CAS servers to more efficiently process requests across multiple nodes without any additional “upgrades” to the CAS roles.
During our statistical collection phase, the new configuration nodes had a 40% reduction in response time on RPC requests and Address Book requests:
Originally: 22 ms
Now: 13.2 ms
Travis Sheldon
VP Network Operations, ExchangeDefender
(877) 546-0316 x757
travis@ownwebnow.com
 
                    






