Friday – Sunday, January 5-7 we will be conducting routine network equipment maintenance.
Friday – Sunday, January 19-21 we will be conducting routine network equipment maintenance.
Remaining LiveArchive clusters will be moving from Los Angeles to Dallas. The infrastructure and the network layer is 100% owned by us and the DC exceeds our regulatory compliance requirements. We are moving our LiveArchive infrastructure away from our primary DC/network operations to improve fault tolerance and prepare for critical/catastrophy features that will be coming to LiveArchive in 2018.
Impact: We expect minimal service interruptions during this time as these systems are redundant but as per our policy anytime there is significant move of network resources and potential for isolated outages, we will be putting up a notice.
Our primary DC will be performing maintenance this weekend as outlined below:
EVENT ID: COLO402112012-NM
WINDOW START TIME: 1:00 AM CST
ESTIMATED END TIME: 4:00 AM CST
SERVICES/EQUIPMENT: Colo4 IP Plus/Value Network
PURPOSE OF WORK: Hardware maintenance/code upgrade
IMPACT OF WORK: Momentary loss of connectivity as the IP Plus primary router is rebooted to accomodate code upgrade.
Colo4 will be performing hardware maintenance and a software image upgrade. A reboot of the router will be required.
IP Plus/Value customers can expect a brief period of connectivity loss while the router is rebooted.
Over the course of the next few days Own Web Now Corp will be undergoing massive upgrades to our network infrastructure across our Dallas and Los Angeles data centers. We will take every precaution to limit service interruptions and while all maintenance will be performed well outside of business hours, many users will not even notice any issues as upgrades are being made to redundant systems.
The upgrades will impact all of our global services: Web hosting, ExchangeDefender, Exchange Hosting and more. Keep in mind that these are not individual computers – work is extremely complex and equipment is very sensitive. While we have planned our course of action very carefully there is always a possibility that certain service availability will be impacted or degregated. We will work diligently to communicate those issues here so you can stay in the loop. Here is a summary of maintenance work:
Thursday, September 15th, 2011
9 PM EST – ExchangeDefender in Los Angeles will be receiving new processing clusters for mail logs, inbound and outbound mail as well as LiveArchive.
Friday, September 16th, 2011
Noon – ExchangeDefender in Los Angeles will receive a new redundant LiveArchive network to improve geographic redundancy of our business continuity systems.
Saturday, September 17, 2011
Midnight EST – Power maintenance. We will be upgrading our power feeds, PDUs and UPS infrastructure across ExchangeDefender, hosting and more. We anticipate the work to be completed by 9AM EST.
Saturday Noon EST – ExchangeDefender inbound network upgrade. We are adding 25% more capacity to compensate for the growth in subscriber base.
Sunday, September 18, 2011
3 AM – 8 AM EST – ExchangeDefender will introduce geographic redundancy to our encryption, web file sharing and inbound mail routing capabilities.
All of these enhancements have been on the drawing board for months but were obviously reprioritized after the outage our Exchange 2010 network experienced in August. We moved extremely aggressively to make sure we provide a full geographically redundant network. Yes, experiencing one outage in 10 years related to power is not bad but we expect 100% uptime and this upgrade will help assure it.
On 7/20/2011 around 3:35 PM Eastern we started experiencing random packet loss across various services including Hosted Exchange and OWN Websites. Roughly around 3:45 PM Eastern, the random packet loss turned into a wide-spread service outage and lasted until 4:12 PM Eastern.
The incident appears to be faulted network driver on a Exchange monitoring server. Upon automatic recovery of the driver, the machine began to flood nearby network switches with invalid requests. Unfortunately the internal floods prevented access to the network analytic servers behind the DMZ. Since all machines received and responded to the request, all machines showed up as ‘flooding’ to the router and IDS was unable to determine the ‘source’ IP.
All services were essentially taken offline when the IDS started blocking traffic from the internal hosts. After we disabled the offending machine from the network and cleared IDS we were able to resume service across the board.
The biggest area of concern was the inability to contact us as the outage was occurring as the outage took down our support board and primary phone lines. We deeply apologize for the grief and trouble that this unexpected event caused and without saying, this has been the most impacting network event that we’ve experienced. We’ve implemented a new redundancy plan to our phone systems to handle global outages as this was the first time our phone systems were completely offline during a critical event.
We appreciate everything that our partners do for us and the patience that was extended yesterday as we definitely know that it was a very stressful event for our partners and their end users. As always we will continue to bring improvements to our solution stacks and address the areas where we may fall short.
As of 9:15 AM EST all services are back online.
We are addressing several performance issues on mail1.ownwebnow.com. We expect the resolution in a few minutes and will update this post at 9:15. As of 9:00, ability to send/receive mail from Own Web Now Linux Web hosted mailboxes is down.
Cause: Kernel page fault
On November 25th between 17:00 – 19:00 Eastern OwnWebNow will be upgrading the OS version on its Value bandwidth router. Estimated period of impact is no more than fifteen minutes. During this time, the core router will require a reboot, interrupting service momentarily
We have posted a remarkable quarter and have focused exclusively on the features and onboarding support. Traditionally, that has been the biggest pain point expressed by our partners and the biggest obstacle to us as a growing organization. In order to provide the level of service and assurance required going forward, we’ve had to reprioritize how we do business in Q4 2009 and Q1 2010.
This weekend we will be revealing the first set of patches that have plagued nearly all of our products. Additionally, we are migrating away from several technologies that have, frankly, failed us and failed our partners for the level of quality you should expect from Own Web Now. We are not willing to point any fingers, you and your customers trust us to deliver rock solid solutions and at the end of the day it’s our decision to make the right choices.
Unfortunately, the reality is that we haven’t made some right choices in the past two quarters and in April we will be moving to address those issues, in all products across services.
There will be extended maintenance hours each weekend outside of business hours as we move to address the many problems noted in the NOC blog over the past few months.
As always, I appreciate your business and I speak to partners every day that love what we’re able to do. That is what drives us and my team strives to deliver more every single day. That is what we are good at – competing. We have dedicated 2010 to raising Own Web Now to the next level, so I hope you pardon the dust throughout April as we hold extended maintenance hours to bring a level of consistency.
I promise you that each and every step taken, to address some of the issues the few of you have been very vocal about, will be fully communicated through this blog.
We will be performing a power (electricity) upgrade during the regularly scheduled monthly interval tomorrow at 11 AM EST – Noon EST (Saturday, July 24). This upgrade will affect a portion of our ExchangeDefender hosting network, including Exchange hosting service on HUEY network.
While we do not expect an unanticipated service outage, we will be testing the remote reboot switch and service recovery of the new power system including a lights-out test (full power outage, followed by a power restore).
We are conducting these tests and maintenance intervals to make sure we never interrupt you during work hours and keep our streak of rock solid Exchange hosting performance. It comes with a price tag of expensive hardware and extensive maintenance. Thank you for your understanding and your business.
At roughly 5:10 PM EST we experienced a routing issue that resulted in massive packet loss on our corporate subnet. It took approximately 4 minutes to restore services to 100% while access was available nearly immediately.
This issue would not have affected services as our corporate subnet only hosts OWN services (Shockey Monkey, support portals, monitoring systems).
While most people likely never noticed, this is now an open issue because the BGP connection should have failed over instantly. It was covered in realtime on our Twitter feed.
We experienced a brief power failure at 10:20 PM Central in our Dallas 3 data center. The outage affected roughly 20 servers and was caused by a failure in a PDU (power strip) that tripped. Outage affected a very small section of the ExchangeDefender network which is fully redundant and was not affected at all by this outage. As a precaution, we have taken the affected nodes out of the scanning pool until their hourly reload of network configuration at which point they will resume normal operations.
ExchangeDefender uptime, availability and load were not affected as this is a very low activity time window and a very small portion of the scanning network. As our storage arrays are not on the same PDU as the scanning nodes, there was no interruption or delay in scanning service.
We must have angered the Internet gods because this Monday has been nothing short of tremendously disappointing. Pictured below is my staff working on the issues:
On to the specifics:
ExchangeDefender reports did not run last night and will likely remain offline until close of business today. We have had two switch crashes on our load balancers in front of our shared mail1 and www1 hosting services. Our offsite backup upgrade does not seem to be validating the certificate requests so https:// requests are failing (http:// still works fine, and data is encrypted on the client side so the transport mechanism isn’t as relevant – but if you’ve set https:// your backups are failing so we are treating this as a very serious issue)
Somehow, the roof is still above us and we have power. For now.
All the outstanding issues are being filtered through by my teams and will have service restored to 100% across the entire product portfolio – by the end of business today.
Update: As of 5 PM EST the ExchangeDefender reporting is back online, all the network issues have been resolved. The Offsite Backup service is still available via http:// but we are still working with AhSay to get the certificate issue resolved. Will update further on this as soon as I have more information.
Update: As of 11 PM EST all offsite backup grids now respond with the valid SSL certificates on the SSL port.
Looks like the ugly Monday is finally behind us.
Vlad Mazek, CEO
As you may be aware, we have two data centers in Los Angeles on Wilshire Blvd. Earlier today, this area suffered a 5.8 magnitude earthquake. No systems were affected, no impact on any power feeds or network connections. Earthquakes tend to be followed by smaller “aftershocks” and we will be updating this post with details of any relevant information that may become available.
Our Los Angeles data center carrier has suffered an HVAC failure, and the connectivity to the network has been severed for the time being. The facilities team is in touch with the building owner, service restore is under way. All services provided by this data center are unfortunately affected and down at the moment.
Services affected: some ExchangeDefender, some SharePoint Hosting, some Virtual Servers.
We will update this ticket when all services have been restored. This ticket is ranked urgent. Our priority will be to restore services that are not redundant first: virtual servers, followed by SharePoint hosting.
Update (@ 3:00 AM PST -8 GMT, 6 AM EST -5 GMT): We expect SharePoint and Virtual Server services to be restored around 6 AM PST (-8 GMT). ExchangeDefender services are not impacted (please be patient with SPAM releases however). We will update this ticket at 6 AM or when services start coming back online.
Update (@ 3:44 AM PST -8 GMT, 6:44 AM EST -5 GMT): All services have been restored.
Total LA DC1 outage: 53 minutes.
For the past 10 hours or so we have been handling an 820% surge in reboot requests for hung Microsoft servers after applying the latest security patches. Our managed network of Windows 2003 servers has not been affected but a huge portion of our network apparently has, please be advised.
If your Windows Server becomes inaccessible as a result of the latest patches please open a ticket request and mark it as urgent. You will not be charged for the support request and your reboot will be handled with the highest priority. We have an additional shift on hand in all data centers to help you through this network event.
At roughly noon central time we have completed the upgrade of our Dallas DC4 network. The bandwidth upgrade brings in another 100Mbit of connectivity from Level3 and 100Mbit connectivity from Cogent, primarily for the offsite backup service that has experienced tremendous growth over the year.
Our Los Angeles DC2 will be undergoing a similar update by Thanksgiving along with plans to open the third data center in the Los Angeles area by start of 2008.
We have received a number of support tickets inquiring about the stabiblity of our San Jose data center (MAE West) following the 5.6 magnitude earthquake last night. While the 5.6 magnitude earthquake is significant, it has posed no issues to our data center or any infrastructure located there. All our west coast (Los Angeles, San Jose and Seattle data centers) equipment is rack mounted in four-post closed racks and even a significant quake would not pose any immediate danger to any equipment inside the building.
Thank you for your concerns and your well wishes to our staff, everyone is safe and sound and the network is as well. Your sympathies are appreciated nonetheless.