Fault-Tolerant Design

Jul 24, 2022

The very first fault-tolerant computer was the Samočinný počítač, or SAPO, built in 1950. It could tolerate the failure of an arithmetic unit --it had three parallel arithmetic logic units and decided on the correct result by voting1.

The obvious use cases for fault-tolerant computers are when maintenance or repairs are extremely hard to do (spacecraft) or when failures are extremely costly (nuclear power plants).

But fault-tolerant design can be useful in everyday programs. Networks are unreliable. Byzantine faults. Humans write bugs. And at scale, all sorts of black swans happen – bits get flipped, cosmic rays, data centers catch on fire, code gremlins, etc.

Fault tolerance can mean graceful degradation, failovers, replication, or automatic repairs. Fault-tolerant building blocks like Kubernetes or the Erlang Virtual Machine (BEAM) make it easier to write programs that behave more predictably and continue working through unexpected failures. When you have a fault-tolerant design, you can let it fail.

You also end up with simpler programs. Ironically, you have to write less error handling code. In my experience, fault-tolerant systems are also easier to debug at the application layer. The cyclomatic complexity of these programs is often lower – fewer states that the program can be in.

1No computer can be completely fault-tolerant. The SAPO was destroyed in a fire in 1960.