1. Skip to navigation
  2. Skip to content
  3. Skip to sidebar

Fail. And Recovery.

Friends,

Please allow me to interrupt and offer my apology for the outage we experienced today.  It was an unusual sequence of events–a perfect storm of sorts, that caused the outage.

What Happened:

  • Around midnight CST, one of the primary app servers experienced a hardware issue (between the motherboard and hard drive), but in a way that caused our monitoring systems to not detect the problem.
  • The server was performing, the web server was up, Ruby and other critical services were responding, and the hard drives were all responding.  Yet, the Blinksale app itself was not responsive. And to state the obvious, our monitoring systems didn’t catch this.
  • So in addition to experiencing the hardware failure on a key app server, the monitoring systems that were tracking a few dozen different aspects of network, hardware, and software systems, also failed to detect the specific (and unusual) type of hardware failure.

Our Response:

  • As soon as the issues were discovered, we put two engineers on the problem — one hardware/network and one software development.
  • This morning early, we began communicating about the issues, primarily via Twitter, and posted a “noon – 2pm” target to be back online.
  • We responded to almost all tweets and customer support requests, and most within an hour.
  • We posted email addresses, and both our support phone number and my direct phone number on Twitter.
  • Today at about 2:00 pm, Blinksale went back online successfully.

Next Steps:

  • Improved monitoring coverage and redundancy in monitoring systems.
  • Additional app servers (both cloud-based and traditional)
  • Improved fail-over support for each server.
  • Consider off-site, DNS-based load balancing and fail-over setup.

Closing:

I am proud of our team, their response, and the end-result today.  However, I am very disappointed, especially in light of our infrastructure investments in the previous 6 months, that the outage happened in the first place.

We know that you count on us. I have talked to a number of you via phone today and I know how hard it is when you can’t get a critical invoice out so you can get paid.

On behalf of all of us here, please accept our sincere apology–and our commitment to downtime reduction, continued service improvements, and to building a better Blinksale than ever.

— bc

7 Comments

  • On February 25, 2010, Callum MacKendrick said:

    Recovery was quick, and the response was good. One thing that would have reduced panic to some degree would be to have had the status page acknowledging the problem and with the down-time ETA up sooner. I didn’t see any information on the website until at last 12 hours had passed (starting around 11 MST) and only found the twitter notices through Google.

  • On February 25, 2010, W. Gene Powell said:

    You may want to promote that beefed up infrastructure somewhere on your site. Prospective subscribers need to know it exists and current customers should be reminded that you have it. Personally, knowing that my data is safe and accessible is the most important aspect of your service.

    Thanks for your transparency during this process. That must have been one lousy way to start a day.

    Be well.

  • On February 25, 2010, Andrew Scott said:

    You did what you could and you were transparent about it – can’t ask for much more.

    Agree that getting a ‘maintenance’ page up asap helps but I soon found your twitter updates.

  • On February 25, 2010, Tony Oravet said:

    The Blinksale folks had a very quick response and the engagement with it’s users on Twitter was amazing. Thanks for your transparency during this process…and thanks for keeping us in the loop on the future plans to prevent something like this from happening again. I agree that having a page up sooner would have been nice…but you did what you could and you did it very well. Congrats on getting everything back online and keep up the great work! We are looking forward to the new features coming as well!

  • On February 27, 2010, Tom Broad said:

    I was not fully using Blinksale during this time, but reading this blog post update really gets me ramped about what seems like a great product and the company that stands behind it. bc, thanks for your transparency.

  • On March 5, 2010, Steph said:

    You guys certainly responded well under stress, and your response time was great under the circumstances. What a bizarre glitch.

    I’ve been happily using Blinksale for several years now and remain a loyal customer.

  • On March 23, 2010, Thomas said:

    Not sure if you’re still taking suggestions, but if you’re not familiar with http://www.freshbooks.com/, there is a lot with this online invoice service that is crave-worthy.

    Specifically, the ability to send ESTIMATES. I’ve seen it asked before, multiple times, over the last few years, to no avail. I would switch to freshbooks except I don’t like the fact that it makes the client click on a link to view the invoice/estimate, no integrated paypal link, and a paucity of nice-looking templates.

    What I did like was the pricing structure, emailing of PDFs, converting estimates to open invoices, ability to have a list of commonly used services you can choose from a drop-down menu when creating an invoice/estimate, and a few other bells and whistles.

    I also hope you fix the fact that every month I literally have to wait days for my account to reset (I believe I’m on the $6 or $9 a month plan), even though it says it has. Very frustrating.

    So I hope you are listening because being a long-term user, I’ve seen few and far between changes that should have been implemented at least a year ago. Thank you.

Leave your comment