Please allow me to interrupt and offer my apology for the outage we experienced today. It was an unusual sequence of events–a perfect storm of sorts, that caused the outage.
- Around midnight CST, one of the primary app servers experienced a hardware issue (between the motherboard and hard drive), but in a way that caused our monitoring systems to not detect the problem.
- The server was performing, the web server was up, Ruby and other critical services were responding, and the hard drives were all responding. Yet, the Blinksale app itself was not responsive. And to state the obvious, our monitoring systems didn’t catch this.
- So in addition to experiencing the hardware failure on a key app server, the monitoring systems that were tracking a few dozen different aspects of network, hardware, and software systems, also failed to detect the specific (and unusual) type of hardware failure.
- As soon as the issues were discovered, we put two engineers on the problem — one hardware/network and one software development.
- This morning early, we began communicating about the issues, primarily via Twitter, and posted a “noon – 2pm” target to be back online.
- We responded to almost all tweets and customer support requests, and most within an hour.
- We posted email addresses, and both our support phone number and my direct phone number on Twitter.
- Today at about 2:00 pm, Blinksale went back online successfully.
- Improved monitoring coverage and redundancy in monitoring systems.
- Additional app servers (both cloud-based and traditional)
- Improved fail-over support for each server.
- Consider off-site, DNS-based load balancing and fail-over setup.
I am proud of our team, their response, and the end-result today. However, I am very disappointed, especially in light of our infrastructure investments in the previous 6 months, that the outage happened in the first place.
We know that you count on us. I have talked to a number of you via phone today and I know how hard it is when you can’t get a critical invoice out so you can get paid.
On behalf of all of us here, please accept our sincere apology–and our commitment to downtime reduction, continued service improvements, and to building a better Blinksale than ever.