Today we had an issue where a Mailbox server did not fail over to our secondary or tertiary domain controllers. Basically, what this does is it stops that server from providing the cluster the status of that database. This caused the inability for users on those databases to connect. We have made changes to the domain controller logic within Exchange to reduce the possibility of a reoccurrence. In addition, since certain databases were dismounted, the transport would not accept mail, so this would cause a bottle neck in mail delivery for folks on LOUIE.
As always please remember that during these times you have https://admin.exchangedefender.com/livearchive.php at your fingertips to ensure that your clients continue to transact business. Remember with ExchangeDefender they are never down.
Here’s our original post on Facebook regarding this:
Note from our CEO
Earlier today we received a DDoS attack on our DNS infrastructure specifically. This caused our message processing to crawl to a halt. We have made changes in our design to account for the attack pattern we encountered and this type of issue should not repeat itself. We would like apologize extensively during a week when we finalized big changes that had shown EXCELLENT metrics and alleviated issues we were experiencing with whitelists and processing delays.
Please rest assured these are not the same issues, but more a bout of terrible timing. As always we will continue to make changes for the better to ensure that we continue to improve the services we deliver to you and your clients. We have resolved the issues, however we’re now encountering a massive back log we’re processing through.
mail2.exchangedefender.com is scheduled for a migration this weekend to new hardware that has been fully provisioned.
Connectivity to the mail service for new mail is not expected to last beyond 5 minutes. However, the migration of the emails will take a few hours. This blog post will contain updates through the process. Again, we would like to reiterate that real time usage will not be affected, you clients will be seeing new email come immediately and send immediately.
We have had some questions regarding a slight increase in SPAM emails and to a certain extent, latency. We have been making multiple hardware and software changes to mitigate the increase that has come into the system world wide. This increase of mail/junk flow is largely due to the cPanel hack that infected thousands of servers with malware.
We have made it our top priority to monitor these large waves to ensure they’re mitigated in a timely fashion. We believe that the majority of these issues have been addressed by the changes we have made accross the system. If the issues continue to arise, please rest assured that we are monitoring this and have it our top priority to mitigate.
If you’d like to read further on the root of problem here are a couple of links.
We’re currently in the process of doing an infrastructure and design upgrade on this service. These upgrades will improve performance, stability as well as squash some bugs that we currently have in the system. Please be aware that certain aspects of the service will be unavailable intermittently.
3/14/2014 Just a clarification, this is extended maintenance of an intermittent nature. We will provide updates as the progress continues.
We’ve experienced explosive growth in the current quarter and some of you have experienced a few of those a growing pains with us. This weekend we will be working on a massive build out to our secondary data center on the mail processing side of the services. Los Angeles will not be available for mail processing this weekend. This should NOT have any impact on standard MTAs that can handle multiple DNS results.
The only true impact expected is mail releases. Releases which reside in LA , that generally are a lot of less due to the lack of parity between the sites will not release until the expansion is complete. Once the expansion begins its last stage as the servers begin to come online all releases will still be queued and at that juncture should release.
Please follow the NOC above in case anything unexpected becomes affected.
-Update 4:55 EST this work has started
-Update 5:40 EST XD Data Replication is online, nodes to follow
-Update 6:55 EST 80% capacity moved for exchangedefender
-Update 11:09 EST All work completed across all services.
Sunday Post Report
We located an issue with the SPAM actions not being correct for all users, we are working to resolve that this morning if you receive a report of this that goes beyond today at noon please open a ticket, if the timestamp of the message is before noon today, please consider the issue resolved for your end users.
We received additional reports that some SPAM actions still weren’t functioning correctly. We have resolved this issue if you see any emails time stamped after 7 PM EST please let us know. If its before then, please consider the issue resolved.
This weekend we will be performing maintenance on the DEWEY cluster. This will be done to improve the performance and overall experience for users whom opted to remain on the legacy Exchange 2007 servers. Throughout the weekend users may notice a disconnect from Outlook while we move their mailbox to a new database. Note that this will not affect all users. When the move has been completed Outlook will prompt the user with the following message:
Once the users clicks “OK” and restarts Outlook they will be automatically connected to their mailbox on the new database.
We have launched phase 1 of the LiveArchive4 succesfully. This means that real time email is available through https://admin.exchangedefender.com/livearchive.php.
Phase 2 which is in progress and will take a few weeks is the migration of old items. This is currently ongoing. To avoid any confusion we’ll cut to the bottom line. If your mail is not at the new URL, then its still at the old URL. So your email for up to a year is still with us.
Current gremblins we have noticed:
Dead: There WAS (as it was resolved yesterday) an issue with the RBL zone reloads that would cause some DNS timeouts and rejections to livearchive.
Current: This one is more of an annoyance than anything else but be aware. Due to the fact that we’re moving platforms, there’s no “database=>database” move. We’re moving content. Unfortunately, this is triggering read-receipts.
Phase 3. Vlad will blog about it once it’s ready.
We are currently doing massive redesigns to Compliance Archiving some aspects are not available. Your client’s data is still safe alot of the front end is getting some re-balancing done from the back end. By that I mean, its inaccessible for viewing at times but your data is fine. We’re working on balancing the rendering as well.
One of the benefits of ExchangeDefender is that we allow our clients to process both their inbound and outbound mail through ExchangeDefender and eliminate SPAM in both directions. Until now we allowed IP based relay on our outbound network even if the domain name was not protected by ExchangeDefender. While this was a matter of convenience, the explosion of compromised servers and virus distribution is starting to create an “open relay” problem for ExchangeDefender: where third party domains are relaying through our clients servers and forging the From: address.
Effective March 22nd, ExchangeDefender will only permit outbound relay from email domains that are protected by ExchangeDefender and have their MX records pointed to ExchangeDefender nodes. Any attempt to send email through the ExchangeDefender by a 3rd party domain not protected by ExchangeDefender will be rejected.
If you have a business case scenario that requires you to relay third party domains that are not protected by ExchangeDefender (because you don’t own the domain name, do not have administrative control, shared domain with another organization) we recommend that you relay those messages directly via your SMTP service / IP address instead of attempting to route them through the ExchangeDefender smarthost.
Note: This policy change will not affect any legitimate traffic going through our outbound network. It only picks up illegal spoofed traffic based on the From: line (if it includes a domain name not protected by ExchangeDefender)
There was an issue earlier today with one of the RBLs we use listing non-spam URLs. This resulted in issues where some clients’ outgoing messages were being blocked with a message from Exchange Defender saying the message contained a Spam URL.
That issue was resolved but there was a problem in the process that the RBL used, which made it so that it timed out and didn’t function at all, causing sporadic blocks in messaging inbound. This is not on our servers, nor on the recipient or sender MTAs.
The issue is resolved completely now and mail flow should be back to normal. You may still get complaints for the next few hours as the issues get ironed out completely. We would like to extend our deepest apologies for the inconvenience caused by this to your clients. We’re in the process of rewriting some of the code to ensure that the dependability on some of our providers is not so absolute.
As an edit to address follow up questions:
This DNS issue would affect certain mail delivery and processing speeds. But we’re flushing through it as quickly as possible.
We’re working on UI enhancements for the Service Manager within the our Support Portal. If you’re continuing to see any issues with please open a ticket with screenshots so we can make sure the development team is aware of any bugs they may have to address.
Our sip provider is currently having issues. Remember our portal is SLA backed 24/7. If you need help open a ticket or a LiveChat within the portal for support issues.
Update: Our provider has resolved their issues.
We’re currently working an issue with DB4 which is similar to the issues faced previously with DB3_1 . The issue is related to a storage controller issue and we are currently running an integrity check/repair on the database
We sincerely apologize for the length of this issue. We understand this is extremely frustrating, as it is extremely frustrating for us to provide this level of service. This is currently not only our top priority but only priority and our top Exchange engineers are on this task and this task alone until resolved.
Update 6:00 PM Eastern: We are still performing work on DB4 for the affected mailboxes. We would like to extend the option for users to utilize “recovery mode” / dial tone. Recovery mode is a temporary new “blank” mailbox where users can work with their live mail (after the outage occurred) as they normally would. Since the recovery mailbox is a “new” mailbox, there wont be contacts, calendars, etc. If you have active sync users or iPhone users then you should have then set their Email, Calendar, and Contact sync to a manual sync as the phone will sync with the empty mailbox.
When a mailbox is manually switched to a temp database Outlook will now prompt the user to either use their Temporary data (new data) or open their previous cached mailbox data
Outlook “Recovery Mode” prompt is a feature set introduced in Outlook 2007 to safely handle mailbox or server side issues that lead to mailbox configuration changes. After the server goes through a dial tone migration, Outlook will notice the content change in the mailbox and locks the old OST cache file from downloading any new mail. Once Outlook goes into Recovery Mode, the user will receive a prompt upon launching an Outlook profile asking if the user wants to use “Old data” or “Temp data” – in simple terms, Old data refers to the OST cache file on the client machine, and temp data provides online access to the mailbox to view new items that arrive – while in temp data, Outlook will run in Online mode and not Cached mode, so the overall experience will be slower for users who are used to the speeds from cached mode.
If a partner would like to utilize a recovery mailbox for mailboxes in their domain then we would need a new support request opened with following subject:
Dialtone Request: domain.com
Where domain.com is the client domain. By opening the request as a new request this will allow us to ensure all requests are properly completed and documented.
January 2 2013
Update 4:30 PM Eastern: The repair / integrity check is near 55% completed on the last step. We anticipate being able to mount the database by 8PM Eastern tonight which we will then deliver all queued mail to the mailboxes. Once all queues are flushed we will then swap users who are in dial tone mode back to their home database and subsequently merge the new and existing data. We want to thank everyone again for their patience as we’ve had a positive response to the dial tone mailbox setups.
Update 5:45 PM Eastern: We’ve successfully mounted DB4 on LOUIE. We’ll be resuming mail delivery shortly.
Users on LOUIE DB3_1 are currently unable to access their mailboxes due to a hardware level issue with the storage device. The database was a temporary database used as a temporary holding spot for mailboxes that were moving from DB3 which was being phased out on LOUIE. Unfortunately since this was a temporary database there was no anticipated need to make multiple database copies available as the extra over head could cause production performance issues. To rectify the issue we’ve been running a hardware RAID level resync to ensure all data is in the best state possible. We do anticipate the issue being resolved before the start of business on Monday.
As a reminder, partners are able to use LiveArchive (https://livearchive.exchangedefender.com) to access their live running mail during the outage
Update 9:07 AM: We are currently running an eseutil integrity check on the log files for DB3_1. This check will ensure that all the data that is present on the database is correct and true, as well as any uncommitted data can be finally committed. This process is expected to take up to two hours to complete. If the replay and check complete without error we will then be able to safely mount the database. All mail that hasn’t been delivered is still queued and waiting for delivery.
Update 5:30 PM: The database was successfully mounted after the integrity check completed. In order to prevent this from reoccurring we are adding a database copy to this database. Please keep in mind that this database was a transitional database, used temporarily to move users from DB3 which was in the process of being phased out. We’re extremely sorry for the inconvenience this has caused and we will include transitional databases in DAG setups going forward.
The servers in the Australia network (2007 Exchange) has suffered a major failure which has ultimately led to the decision to decommission it as a restoration will not bring service online seamlessly (without recovery and resetup by clients)
Unfortunately the recovery from Australia is not something we believe will be completed in the next 48 hours at a minimum. To accommodate this and provide users with an immediate solution to continue working we’ve proactively recreated all users on Australia (2007) on to Matilda (2010). Matilda has redundancy and we’ve already sent over an additional server to add to the Matilda cluster.
The new server address is cas.matilda.exchangedefender.com and the autodiscover record should be setup as autodiscover.matilda.exchangedefender.com
We’ve changed the target delivery location in ExchangeDefender to matilda. You can also login to livearchive to work with mail while the change over to matilda occurs.
Partners who have clients with cached outlook profiles can open the current profile (On Australia) and then export the cached data to .pst files. Any users who do not have cached data or any public folder data will not have an ETA on restoration for at least 48 hours (until we can process the drive’s data)
Update 10/25/12 9:45 PM Eastern: We are in the process of copying over items from livearchive to matilda mailboxes to import the past three days of mail from livearchive. We anticipate the process being completed in under 4 hours.
This morning around 3:30 AM Eastern we received alert that the hosted Exchange 2010 network in Australia (MATILDA) went offline. Upon login via KVM we noticed the server was repeatedly rebooting after faulting to a blue screen, however no diagnostic information was provided. It was soon determined that the operating system needed to be repaired which completed around 5:20 AM Eastern. Once we were able to successfully boot into Windows we performed a database integrity check to ensure no actual data was corrupted or lost which completed around 7:45 AM Eastern and service was restored.
Once service was restored we looked into the server logs which provided no information or logged entries regarding the server fault, however, a memory dump was created. Unfortunately the memory dump wasn’t of much help and the issue appeared to possibly be hardware related. We temporarily switched service for Matilda over to the backup node as we tested the hardware components. We were able to determine that the issue was related to a power supply that randomly dropped output voltage during high load. After replacing the power supply we ruled that the server faulted while running a backup. With regard to the OS repair we can only deduce that a system file was corrupted from the fault and needed to be replaced.
Over the weekend our Linux web hosting infrastructure was upgraded to PHP 5.3. The previous release of PHP 5.1.6 was getting out of date for a lot of social applications that dominate todays deployment base. You will now be able to run the latest Wordpress, Joomla and other CMS and shopping cart sites without a problem.
If you have any legacy PHP applications that were written in PHP4 and early PHP5 days, you may notice that some functionality does not work. Consult your software vendor for an update that functions with PHP 5.3 and above.
If your application is not supported on the new platform of PHP and you do not have the resources to make an immediate switch to an alternative, we do have a www2legacy web infrastructure in place that will support older applications. We will make the accommodations and the switch to the legacy platform for a small one-time fee but do keep in mind that the legacy platform will be discontinued in March of 2013. We have provided more than a 1 year announcement of our intention to upgrade our network infrastructure to the latest PHP so if this caught you by surprise we will still do what we can to make sure you have a smooth transition.
For a list of features in PHP 5.3.x and old deprecated functions:
This is probably an excellent place to start your troubleshooting on outdated code. If you decide you’d like to stay with the legacy code please open a support request at https://support.ownwebnow.com if you’d like further information.
This Friday night we will be performing maintenance to the ROCKERDUCK cluster that may interrupt client connections for up to 10 minutes. The purpose of this upgrade is to provide a more intelligent network routing for our mailbox database replication service. We will work our hardest to make the required changes without interrupting client connections.
We anticipate starting the work around 9PM on Friday August 24th.