May 4, 2017
Dear ExchangeDefender Partners,
Thank you for your patience and time spent working with us during the attacks we’ve been fighting for the past week. While I am happy to report that the issues that caused performance problems for our clients are largely contained, I cannot even begin to express how sorry I am for the business impact this has caused our clients. I am writing you this letter to explain what happened and how we responded.
On Monday, April 24th, ExchangeDefender network received an unprecedented DDoS (distributed denial of service) attack, followed by a SPAM storm that we continue to mitigate as a core functionality of ExchangeDefender: protecting your mailbox.
Late on Monday, April 24th, we were targeted by a 0-day Exchange exploit which attempted to load a virus and ransomware content across our Windows network. Neither Microsoft nor our multiple antivirus vendors had an answer or a response, leaving us to fight the potential infection and outage on our own. This particular hack/exploit would attempt to change name servers (so that antivirus updates cannot be downloaded) and compile the virus to exploit the network. Thanks to ExchangeDefender security and quick response, it was not able to access data or compromise security but out of abundance of caution we started quarantining individual cluster nodes, mapping the exploit and responding to it. This was a manual process that required physical access to affected systems, emergency patching to other infrastructure involved in delivering Exchange services, additional resources for inspection and tracking of hacking activity, etc.
By Thursday, April 27th, we were in a full incident containment mode racing ahead of the attack. The load balancing solution had to be redone because Microsoft CAS servers (systems that handle Outlook Web App, Outlook connectivity, ActiveSync) were not getting users properly distributed across them. Throughout the weekend, Monday May 1st to Wednesday May 3rd we continued to add servers and resources across the network to mitigate the combination of the attack and clients reconnecting to the network at new CAS access points.
During this time the access to ExchangeDefender CAS servers was below our standards and for that we apologize. We have expanded our capacity seven-fold (7x) across Rockerduck, Louie and Gladstone clusters while addressing this issue. ExchangeDefender LiveArchive continued to perform unaffected throughout the incident window and clients were able to continue communicate through it as a failover.
We sincerely apologize for the impact this had to your business and to your clients. While there are certainly enough vendors to point and blame for the combination of issues we faced, it is our responsibility to deliver the service we promise. The attack, and the severity of it, was truly out of our control and we have done everything possible to contain it, preserve mail security and continue delivering service. Attacks, hack attempts, viruses, DDoS and SPAM storms are nothing new and it’s why you outsource this problem to us. This same issue has and continues to happen at other providers as well all use the same technology, the difference with this was scale. Everyone on my staff has put in extraordinary effort and hours to combat this issue and we’re truly sorry that even with everything we’ve done some clients experienced excessive SPAM, Outlook disconnects, repeated password prompts and had to resort to using Outlook Web App and ExchangeDefender LiveArchive to continue working.
Our biggest regret in this entire episode is that of communication. While we did everything we could to communicate our response and mitigation through @xdnoc on Twitter, our NOC site, our Service Advisory section of Support Portal in the tickets, we did not effectively communicate the strategy and our response as it unfolded during the attack. In 20+ years of providing email and security services, we have never seen anything of this scale and our typical response changed as we continued to fight the various aspects of this attack. While our communication strategy was sufficient for isolated incidents, it was not good enough for a week of coordinated DDoS, SPAM storms, virus and persistent hack attempts. In everything we do, your security and privacy of your data is our first and primary concern and the entire staff focused on that. As a result, we have changed how we communicate and advise clients on extensive work that is going on behind the scenes.
The extended performance problems and Outlook/ActiveSync/OWA issues also exposed to us what an inefficient job we have done at promoting LiveArchive, our business continuity service that is designed to allow clients to work unaffected during outages and maintenance windows. The amount of resources and service redundancy that goes into delivering our Exchange services is staggering, but at the core it’s still Exchange and when Exchange is having issues we point clients to LiveArchive. We will prioritize extensive promotion of this to our partners and clients as many we talked to over the past week were simply unaware of it.
The extent and severity of this attack was unprecedented. The amount of resources we threw at solving all the issues the attack caused was extensive. While these attacks and hack attempts were truly something out of our control, it is why you outsource your Exchange to us and we are deeply sorry that we didn’t better communicate our incident response and mitigation strategies as we fought it all. We have given all the effort and every resource we possibly could have to mitigate an outbreak but we failed to communicate as extensively as necessary to assure our partners and our clients of every complex aspect we were addressing at the time. We apologize that this exposed out partners as uninformed and left many of you unaware of everything that was going on behind the scenes.
We have already made changes to our process and will communicate shortly on additional ways we will be handling communications going forward.
Own Web Now Corp
May 29, 2012
The work described below is schedule to begin at 9:00 PM Eastern May 29th 2012
- LOUIE – LOUIEMBOX1 & LOUIEMBOX2 – Update Network Driver
Will cause brief interruption, 15-30 seconds while the driver updates
- LOUIE – update exchange to service pack 2
Will not cause interruption to clients
- ROCKERDUCK – Reseeding databases between RDMBOX1 and RDMBOX2 for fail over
Will not cause interruption, but OWA users may see slight delays in accessing content (including public folders) since the replication is going to use the MAPI NICs instead of the replication NIC. This will only be during night as to not flood the network during the day
- ROCKERDUCK – Redistributing disk layout on RDMBOX3
RDMBOX3 is one of the additional fail over clusters and does not actively hold any mailbox databases
April 23, 2012
Update 5/2/2012: The reinstallation has completed and service is resumed normal operation.
Update 5/1/2012: We are in the process of reloading the operating system on LOUIEMBOX1. As of now the only affected service is mail enabled public folders, however, all mail will be queued until the reinstallation of Exchange has completed. We anticipate being done with all work by midnight.
Update : This has been scheduled for Friday at 3:00 PM Eastern
In preparation of the reload of LOUIEMBOX1 we’ve moved all mailboxes hosted on to other mailbox servers in the cluster. Last month we created a replication of all Public Folder content to all mailbox servers in LOUIE from LOUIEMBOX1. The final step is to change all ExchangeDefender delivery points for LOUIE customers away from LOUIEMBOX1. This step should fall in line with our seamless upgrades and should not be noticed by the client. Unfortunately in past experiences some public folders would not receive messages from the outside after a replica is taken offline. We believe this was a previous bug with Exchange and has since been resolved, however, we’d like to make all possibilities known ahead of time.
Due to the number of public folders and public folder content, we will be unable to validate mail delivery across all mail enabled public folders during the reload.
If any clients experience mail delivery delays to their public folders, please open a support request with the email address of the mail public folder and our support team will immediately look into the issue.
February 28, 2012
During the first half of March we will be performing upgrades to the LOUIE network which include adding mailbox servers, phasing out older servers, upgrading Exchange to SP2, and most importantly, DAG redesign.
On the first week (March 5th-9th 2012) we will add two new mailbox servers for LOUIE (one intended to phase out LOUIEMBOX1).
On the second week (March 12th – 16th 2012) we will create a new DAG for LOUIE and add two new mailbox databases into the DAG. Throughout the week users hosted on LOUIEMBOX1 will be moved to the new databases in the DAG. Finally once all users are moved from LOUIEMBOX1 we will begin replicating public folder content to the new mailbox servers.
All changes are intended to be transparent to users and should not interrupt service access.
February 23, 2012
Tonight beginning at 10:30 PM Eastern we will be performing the following maintenance
- Rockerduck Load Balancer
- Increasing physical resources. Restart required.
- Clients will be disconnected from their mailboxes for up to 5 minutes.
- LiveArchive: Los Angeles SP2
- Installing Service Pack 2 on Exchange 2010
- Client access will not be affected as we are currently running LiveArchive out of Dallas.
February 20, 2012
This weekend (02/24/12 – 02/25/12 19:00 Eastern [00:00 GMT]) we will be performing SP2 upgrades to the Europe Exchange 2010 Cluster: Della. Upgrade to Exchange 2010 SP2 will be performed on all passive nodes in Della. Upon successful upgrade clients will be moved from the active server to a passive node. . This upgrade is not expected to impact customer access however, there will be critical changes prior to the upgrade.
· New load balancer will be activated across the passive nodes.
· IP address for cas.della.exchangedefender.com will be modified to the new load balancer (Expected to be 126.96.36.199)
On Friday evening users on the active node will be moved to the passive nodes. The switch over from active to passive should be transparent to users.
Unfortunately BES services may be interrupted as BES does not detect and handle upgrades seamlessly. If BES service is interrupted we will work on restoring service after SP2 has been successfully applied.
Update 3:05 AM: Exchange 2010 SP2 has successfully been applied to DELLA.
February 15, 2012
This Friday (02/18/12) beginning at 07:00 Eastern (-5 GMT) [Saturday, 23:00 NSW] we will be upgrading the Exchange 2010 cluster Matilda to SP2. Clients should expect to see minimal downtime as services are restarted. Maintenance is expected to last one hour.
Update 2/17/12 6:50 AM Eastern: We are preparing the server for installation of SP2. We are expecting to begin the installation around 7:30 AM.
Update 2/17/12 7:30 AM Eastern: We are beginning the installation of SP2 on matilda.
Update 2/18/12 8:55 AM Eastern: SP2 installation has completed. We are testing services to verify a successful installation.
Update 9:07 AM Eastern: The installation was successfully verified.
January 25, 2012
On Friday (1/27/12 11:00 PM Eastern – 1:00 AM Eastern) and Saturday (1/28/12 11:00 PM Eastern – 2:30 AM Eastern) we will be performing maintenance on Rockerduck to wrap up new additions to the mailbox server high availability which will disrupt service to small population.
During maintenance we will be moving 5 mailbox active databases across new storage arrays to improve overall performance. Databases will be moved one by one and only one database will lose service availability to clients at a time. During each database move users on the respective database will be unable to access their mailbox on Rockerduck. Since we will be making architectural changes to the active mailbox database we will be unable to activate the standby copy as the passive and active copies must reside in the same location across all nodes.
The current time estimates include a 30 minute buffer in case of unforeseen events. During maintenance users should expect to be disconnected from their mailbox, however, clients can utilize livearchive during the maintenance interval.
Update 11:15 PM: We are beginning work on RDMBOX1 and moving the path of RDDB1
November 1, 2011
Update 11:08 PM 11/17/11
DB3 has been mounted successfully on DEWEY. We’ve switched all users back to the original DB3 off from the temporary DB. We will be seeding in data from the temporary mailboxes to the primary mailbox.
Update 11:10 AM 11/17/11
The integrity check on DB3 completed around 10:00 PM Eastern on 11/16/11. Upon completion we began the process of running isinteg before mounting the database to ensure any fixed corruption gets remapped properly in the database. The check is currently at 22% completion and is estimated to complete tonight. Upon completion we will switch all users that were on DB3 back to the live running DB3 and we will then merge mail from the tempDB3 to DB3.
Update 9:24 AM 11/07/11
The integrity check and repair on DB2 completed early Sunday morning. After completing eseutil, we ran isinteg which completed around 6PM Eastern. Once we mounted DB2 and confirmed data, we begun to seed the data from the temporary database back to the original user database. Unfortunately we’ve had some partners who’ve imported their previous cached data into their temporary mailbox instead of attaching it as an archived PST on the user computer. We understand partners wanted to restore their customer back to a normal state, but that wasn’t the intention or purpose of the temporary mailbox. The restore process now must check some mailboxes with 36k+ items in the temporary mailbox which puts an extreme delay on the restore time.
Update 9:59 AM 11/04/11
Users on DEWEY experiencing slow speeds can switch their Outlook anywhere server to deweycas2.dewey.exchangedefender.com for an immediate performance improvement.
Update 1:09 PM Eastern
The dial tone migration has completed and users are now able to access their mailboxes on the temporary database.
Update 12:30 PM Eastern
We will be performing a dial tone migration to DEWEYMBOX2 for users on the affected databases. A dial tone migration will allow users to reconnect to their user mailbox on DEWEYMBOX2 via Outlook, OWA and Active Sync however the mailbox will have no information other than the mail from the previous day when the outage occurred and any new live running mail.
Users will see the following prompt after restarting Outlook
If the user wants to access their new mail they’ll select “Use Temporary Mailbox”
After the databases are back online we will move users back to their original databases and then restore mail from the temporary mailbox.
If the user does not receive the dial tone prompt or they stay disconnected after restarting Outlook then open their profile settings in Outlook and select ‘Check Name’ on the user.
On 10/31/11 at 3:45 PM DEWEYMBOX1 suffered a major outage with the databases hosted on it. Around 3:15 PM our staff replaced a failed drive on the OS RAID for the server. As a result, the server began to rebuild the array and we seen slightly increased queue sizes in which we responded by issuing the original NOC report. Shortly after the rebuild began, the controller detected the new drive as bad and activated the global hot spare policy. Unfortunately this action is what caused the DEWEY outage.
A few months ago the Information Store logs for DEWEMYBOX1 had a drive fail in the RAID array. The RAID hot spare policy activated and automatically repaired the array.
Yesterday when the outage occurred, the global hot spare policy overrode the hot spare policy of the log drive hot spare policy and forcefully took the drive to become a spare for the DB RAID array (as this had a higher weight). Once the drive was removed from the logs array, the controller faulted and the log array went offline, causing the databases to shutdown dirty.
These series of events lead to roughly 10 uncommitted log files being lost per database. As the database knows that there are uncommitted logs, the information store wouldn’t mount any databases after we replayed the available logs. Unfortunately the only way to recover was by repairing the database.
Due to the sizes of the databases, the repair is an extremely lengthy process as each record in the database gets checked for corruption.
At this point we know of roughly 30 emails across all clients (not each) that were lost because of the automated forced removal, however these emails can be recovered from livearchive. Any mail that wasn’t committed to the transport server and delivered to deweymbox1 is still in queue and pending delivery once the databases are activated.
August 25, 2011
We will be updating MBOX1 and CAS2 on the ROCKERDUCK cluster tonight starting at 9:00 PM Eastern to Exchange 2010 SP1 RU4-v2
During the upgrade, users connected to CAS2 will automatically switch over to CAS1 as we will begin to drain the server connections around 8:45 PM Eastern.
Users with mailboxes on MBOX1 will have the passive copy of their database on MBOX2 activated around 8:50 PM Eastern.
The maintenance schedule should be transparent to users and should not interrupt or disrupt any service.
August 16, 2011
We’ve added maintenance tonight for the ROCKERDUCK cluster, specifically on the DAG. During maintenance users should not experience any interruption in service, however, service interruption is possible. We will monitor connection statuses during the maintenance and will update this post if any interruption is detected.
Update 1:45 AM Eastern: We will have to stop the cluster service on all nodes which will cause a momentary interruption in service as we repair communication.
Update 2:23 AM Eastern: The DAG on Rockerduck has been repaired and is back online.
August 11, 2011
Tonight we will be performing maintenance on DEWEYMBOX2 to address the performance issues reported by partners throughout the day.
Maintenance is expected to last between 10PM and 2AM.
1:46 AM EST – All Maintenance on DEWEYMBOX2 has been completed succesfully. All services have been restored and queued messages are flushing into the user’s mailbox.
Again we’d like to take this opportunity to apologize for this issue and we appreciate your patience with us through out the process.
Update 8:05 AM: We are going to reboot the CAS servers for DEWEY to clear up any issues before the start of business. Service may be interrupted for 15 mins
Update 8:15 AM: The domain controllers for DEWEY have come back online from the restart. We are still waiting for the primary CAS and MBOX server to come back online.
Update 8:45 AM: The primary CAS and MBOX server came up at 8:30AM and we’ve confirmed service access for RPC, MAPI, ActiveSync and POP/IMAP
July 28, 2011
We will be performing a Database switch on the DAG for ROCKERDUCK from RDMBOX2 to RDMBOX1. Users currently being serviced by RDMBOX2 will automatically switch to RDMBOX1. No changes are required by the end user to continue service and the switch over should be transparent to the end users. This switch is estimated to last for an hour as we investigate performance counter issues on RDMBOX2
July 27, 2011
At 3 PM Eastern (8 PM GMT) we will be rebooting servers in the DELLA cluster to apply windows updates which require a restart. We anticipate service to be slightly interrupted for up to 10 minutes while the reboot occurs.
[3:26 PM] We’ve prepared the servers for the reboot and will begin rebooting the MBOX servers shortly.
[3:41 PM] We are bringing all servers in DELLA back online.
[3:50 PM] Service has been restored to DELLA
July 21, 2011
On 7/20/2011 around 3:35 PM Eastern we started experiencing random packet loss across various services including Hosted Exchange and OWN Websites. Roughly around 3:45 PM Eastern, the random packet loss turned into a wide-spread service outage and lasted until 4:12 PM Eastern.
The incident appears to be faulted network driver on a Exchange monitoring server. Upon automatic recovery of the driver, the machine began to flood nearby network switches with invalid requests. Unfortunately the internal floods prevented access to the network analytic servers behind the DMZ. Since all machines received and responded to the request, all machines showed up as ‘flooding’ to the router and IDS was unable to determine the ‘source’ IP.
All services were essentially taken offline when the IDS started blocking traffic from the internal hosts. After we disabled the offending machine from the network and cleared IDS we were able to resume service across the board.
The biggest area of concern was the inability to contact us as the outage was occurring as the outage took down our support board and primary phone lines. We deeply apologize for the grief and trouble that this unexpected event caused and without saying, this has been the most impacting network event that we’ve experienced. We’ve implemented a new redundancy plan to our phone systems to handle global outages as this was the first time our phone systems were completely offline during a critical event.
We appreciate everything that our partners do for us and the patience that was extended yesterday as we definitely know that it was a very stressful event for our partners and their end users. As always we will continue to bring improvements to our solution stacks and address the areas where we may fall short.
June 9, 2011
Per our previous NOC posting, we’ve been redesigning our maintenance plan for rebalancing the user distribution on HUEY. The original plan to defrag the database was abandoned as the timeframe for completion was not acceptable.
Tonight starting at 9PM Eastern we will be taking the HUEY database offline for about 5 minutes as we clear out the memory cache in preparation for mailbox moves tonight. Throughout the night we will be moving users between two new databases to even out the load. During the move, mailboxes that are actively moving will be inaccessible to users as Exchange 2007 did not feature Online moves. Upon completion, users will be able to access their mailbox on the new database. Move times will depend on the mailbox size and item count.
Update 6:45 PM Eastern: After our previous update our metric test completed and we’ve noticed that there are write lock delays on the OS drive for HUEY. We’ve made an adjustment to our above outline. Prior to starting the mailbox moves we will be performing a full database backup at the NTFS file level. Unfortunately this means we will have to offline the database as we are capturing a raw file backup instead of a VSS backup. After the backup is completed we will scan the surface error on the OS drive for HUEY for any corruption. We anticipate this entire process will take up to 4 hours to complete. We will update this post as progress is made starting at 9PM when work begins.
Update 8:30 PM Eastern: The backup job of the OS is taking a bit longer than expected. We are pushing back dismounting the database to 9:30 PM. We will update this blog after 9:00 PM if we anticipate the backup taking longer.
Update 8:50 PM Eastern: We’ve received a request from a few west coast customers asking us to postpone maintenance until 10. In the interest of disturbing service as least as possible, we will be postponing maintenance until 10 PM Eastern.
Update 10:45 PM Eastern: The backup is estimated to complete in the next hour. We will then begin the surface error test on the OS drive. This is estimated to be the longest part of the process and will require a disruption of service as we take the server offline. We estimate the entire process to be 4 hours as described earlier. We will update this post once the work begins.
Update 11:30 PM Eastern: We are beginning to dismount the mailbox databases and stop Exchange services.
Update 2:05 AM Eastern: The surface area test has revealed issues on the OS drive. We are running a repair to on the drive and monitoring the progress.
Update 4:17 AM Eastern: We’ve replaced a bad drive in HUEY on the OS drive and we are proceeding to perform integrity checks before turning on any services.
Update 5:47 AM Eastern: The integrity check failed and we will be restoring from the backup image taken prior to maintenance. We will continue to update this blog as progress is made.
To clarify the issue is specifically with the operating system and not the database integrity.
Update 9:44 AM Eastern: The restoration process is proceeding as planned, this is courtesy update to ensure partners work is continuing.
Update 12:20 PM Eastern: The restoration surface test is underway and we are looking to confirm data consistency on the OS drive.
Update 1:57 PM Eastern: In order to achieve resolution in the fastest manner possible, we are beginning to concurrently restore the backup image on a spare server to eliminate any potential issues that may be affecting the physical host.
Update 7:00 PM Eastern: The integrity check has processed half of the files on the OS drive and overall progress is about 25 percent complete
Update 9:00 PM Eastern: This is a courtesy update as the process above is still continuing successfully without any halts. We understand this is an urgent issue and we appreciate your patience with this process.
Update 1:30 AM: The integrity check has processed about 90% of the files on the drive and overall progress is near 75% completed.
Update 2:35 AM: The integrity check has completed and we’ve successfully booted windows into safe mode. We are now proceeding to boot normally and resume services on HUEY
Update 3:15 AM: Service on HUEY has been restored and all queued mail is being delivered to user mailboxes.
June 8, 2011
reTonight starting at 9:00 PM Eastern we will be taking the mailbox databases on HUEY offline to perform an offline defragmentation. We anticipate the scan will take up to 4 hours, which will leave mailbox access offline until the database is remounted. Clients are able to utilize livearchive during the maintenance schedule to continue working with live mail.
Update 8:35 PM Eastern: We will begin work in 30 minutes, starting with dismounting the database and copying it to a temporary storage drive, and then starting the offline defrag. After the defrag completes, we will mount the database from the temporary location and stress test the integrity. After we’ve assured integrity we will copy the database back to the active RAID controller.
We estimate that each step may take up to 2 hours to complete , but we will update this post along the way.
Update 9:10 PM Eastern: We are pushing back maintenance one hour as we rearrange the temporary storage iSCSI server to increase overall speed, expecting to lower the overall time.
Update 10:09 PM Eastern: We will be begin the above outlined process in 5 minutes.
Update 11:00 PM Eastern: By estimation of current progress, we do not feel that we have enough time allocated for this process even with our earlier changes. We’ve remounted the current mailbox database. We are formulating a new plan to bring an solution that will run in parallel.
We’ve received alerts on our monitor software about faults in the memory in HUEY. We’ve dismounted the database to avoid any corruption as we test the memory and replace if needed. We currently do not have an ETA for service restoration, but we will update this blog as information is obtained.
Update 2:50 AM Eastern: Service was restored at 2:30 AM and all services have been confirmed online.
May 26, 2011
We will be installing Exchange 2010 SP1 RU3 on the LOUIE cluster tonight beginning at 9PM Eastern.
During the upgrade on the CAS servers, users may see a brief disconnection while individual CAS nodes are upgraded, however, they should automatically connect to another available CAS server.
During the upgrade on the Mailbox servers, users may see a brief disconnection from public folders while individual Mailbox codes are upgraded, however, they should automatically connect to the next available Public Folder Replica.
During the upgrade on the HUB servers, users may see a brief delay for incoming mail, however, outgoing mail should not be affected.
This upgrade will allow us to seamlessly move mailboxes from LOUIE to our newest Exchange Cluster if partners would like to move.
Update 9:10 PM Eastern: We are beginning to upgrade the CAS servers in LOUIE
Update 10:15 PM Eastern: We have completed the upgrade to the CAS servers and will now begin on the HUB and Mailbox servers.
Update 11:04 PM Eastern: Service has been restored to the LOUIE network and we’ve confirmed access to our newest Exchange cluster.
May 11, 2011
On 5/19/11 at 10 PM Eastern we will be performing maintenance on the LOUIE Exchange 2010 network as we physically move the servers into a new cabinet. During the maintenance period, users will briefly loose connection and access to their mailboxes, however, service is expected to be restored shortly thereafter.
Prior to maintenance we will update our NOC blog with a more refined timeline.
Older Posts »
Powered by WordPress