My good friend Moans (who posts to his blog about as often as I do), recently wrote up his current opinion on high availability here. His post resonated with me, as I’ve had a fair amount of experience with system availability over the years. While ruminating on his post, I dug up some excellent work on the subject from HP (one of the biggest proponents of 5nines during the First Internet Gilded Age). This presentation by William Sawyer is particularly good, as well as this paper by Tzvi Chumash.
I’m particularly proud of the reliability of the system I’m currently a part of. From a good design, to talented engineers, to very good operational folks, we have a system that consistently achieves high availability. One thing that I like about the system is that it isn’t very complex. And it’s especially simple in its redundancy features.
I have to agree with most of the authors from above. Designing, building and maintaining high-availability features can be very costly and error-prone in and of themselves. A lot of ideas look good on paper, and sound good coming from the vendor, but in the end, your own people have to own and maintain them. And if they can’t, for whatever reason, you’ve just wasted a ton of time and money.
This doesn’t mean that you don’t need higher levels of availability — it just means that you need to make sure you capture all of the costs. That includes training and drills for the people responsible for the system.
High availability is achievable, but a realistic assessment of how to get and stay there really makes it easier.
(I also found this article interesting)