Fault tolerance is about avoiding system-wide failures because of component-level faults

Last Updated on Jan 02, 2021

The Internet was done so well that most people think of it as a natural resource like the Pacific Ocean, rather than something that was man-made. When was the last time a technology with a scale like that was so error-free?

Alan Kay in Dr. Dobb’s Journal (2012)

Systems that anticipate things that can go wrong and have the ability to cope with them are said to be fault-tolerant (or resilient). But it is unrealistic to make a system tolerant of every possible kind of fault. A fully fault-tolerant system may be necessary for life-critical systems or space travel, but it is far more pragmatic to tackle and alleviate only certain types of faults in most systems.

Fault-tolerant systems avert catastrophic failures when one or more parts of the system fail. A fault-tolerant design enables a system to continue operations, albeit at some reduced service level. So faults are not failures unless they are left unchecked and cause the system to fail.

Individual components tend to be architected and tested extensively for fault tolerance, of course, with some assumptions on performance requirements and data volume. When special-purpose composite systems are created from smaller, general-purpose components, the responsibility shifts to the application to provide usage guarantees that may span more than one technology. Since each system is invariably a composite of many business functions and components, we need to think and plan for fault tolerance for each system individually.

Applications today are more data-intensive than compute-intensive. Their challenges are primarily the amount and complexity of data and the speed of data change, as opposed to network speed or raw CPU power. Hardware faults are no longer a cause for concern, with redundancy becoming a default with most cloud providers. Most faults today are cascading failures at the software layer, specifically due to the volume, validity, or completeness of data.

Fault-tolerant systems are built to expect component failures. They isolate components designing them to be as independent as possible of each other and isolate failures by handling timeouts, employing circuit breakers, and constructing bulkheads around each service.

© 2022 Ambitious Systems. All Rights Reserved.