Focus on fast detection and response instead of trying to avoid failure

Part of Design for failure and Rasmussen's model of operational boundaries.

This idea is also reminiscent of overfitting in ML: see ‣.

In the context of batteries, Onboard BMS should detect anomalies and report failures itself.

References

Niall Ferguson: for system robustness, it is better to be generally paranoid than to be very specifically prepared. (Example: Taiwan in COViD-19 response and American financial regulation checks pre 2008 crisis). See ‣

https://people.eecs.berkeley.edu/~brewer/papers/GiantScale-IEEE.pdf The traditional metric for availability is uptime, which is the fraction of time a site is handling traffic. Uptime is typically measured in nines, and traditional infrastructure systems such as the phone system aim for four or five nines (“four nines” implies 0.9999 uptime, or less than 60 seconds of downtime per week). Two related metrics are meantime-between-failure (MTBF) and mean-time-torepair (MTTR). We can think of uptime as: uptime = (MTBF – MTTR)/MTBF.

Following this equation, we can improve uptime either by reducing the frequency of failures or

reducing the time to fix them. Although the former is more pleasing aesthetically, the latter is much easier to accomplish with evolving systems. For example, to see if a component has an MTBF of one week requires well more than a week of testing under heavy realistic load. If the component fails, you have to start over, possibly repeating the process many times. Conversely, measuring the MTTR takes minutes or less and achieving a 10-percent improvement takes orders of magnitude less total time because of the very fast debugging cycle. In addition, new features tend to reduce MTBF but have relatively little impact on MTTR, which makes it more stable. Thus, giant-scale systems should focus on improving MTTR and simply apply best effort to MTBF.