Design for both component failure, and the failure of the system as a whole: Focus on fast detection and response instead of trying to avoid failure through observability (see Software should be designed for observability from the ground up) and trustworthy on-call (‣).

Related:

See also:

References

http://www.v-wiki.net/design-for-failure/

‣ Good design doesn't rely on having too many highly accurate components in the system.

https://www.usenix.org/legacy/event/lisa07/tech/full_papers/hamilton/hamilton.pdf Design for failure. This is a core concept when developing large services that comprise many cooperating components. Those components will fail and they will fail frequently. The components don’t always cooperate and fail independently either. Once the service has scaled beyond 10,000 servers and 50,000 disks, failures will occur multiple times a day. If a hardware failure requires any immediate administrative action, the service simply won’t scale cost-effectively and reliably. The entire service must be capable of surviving failure without human administrative interaction. Failure recovery must be a very simple path and that path must be tested frequently. Armando Fox of Stanford [4, 5] has argued that the best way to test the failure path is never to shut the service down normally. Just hard-fail it. This sounds counter-intuitive, but if the failure paths aren’t frequently used, they won’t work when needed [7].