Tuesday, January 17, 2012

Availability and Fault-Tolerance

I got this discussion of the difference between availability and fault-tolerance from Quora...


3 Answers • Create Answer Wiki

Edmond LauQuora Engineer
While availability and fault tolerance are sometimes conflated to mean the same concept, the two terms actually refer to different requirements. Designing for high availability is a stricter requirement than designing for high fault tolerance.

Availability is a measure of a system's uptime -- the percentage of time that a system is actually operational and providing its intended service. Service companies, when offering service level agreements (SLAs) to their customers, usually quantify their availability in nines of availability. Carrier-grade telecommunication networks claim "five nines" of availability [1, 2], meaning that the network should be up 99.999% of time and experience no more than 5.26 minutes of downtime per year. Amazon's S3 covers three nines of availability (99.9% uptime) in its SLA [3] and offers a service credit if it is down for more than 43.2 minutes per month.

Fault tolerance refers to a system's ability to continue operating, perhaps gracefully degrading in performance when components of the system fail. RAID 1, for example, by mirroring data across multiple disks, provides fault tolerance from disk failures [4]. Running a hot MySQL slave that can be promoted to a master if the master fails, or eliminating Hadoop's NameNode as a single point of failure [5] are other examples of making a system more fault tolerant.

Making individual components more reliable and more fault tolerant are steps toward making an overall system more highly available; however, a system can be fault tolerant and not be highly available. An analytics system based on Cassandra, for example, where there are no single points of failure might be considered fault tolerant, but if application-level data migrations, software upgrades, or configuration changes take an hour or more of downtime to complete, then the system is not highly available.

--------
[1] http://en.wikipedia.org/wiki/Car...
[2] http://www.windriver.com/announc...
[3] http://aws.amazon.com/s3-sla/
[4] http://en.wikipedia.org/wiki/RAID
[5] http://www.cloudera.com/blog/201...
Suggest Edits

No comments:

Post a Comment