Onboard BMS should estimate battery’s State-of-Health independently from the cloud

In this post, I develop the ideas that I first verbalised in the summary of the paper Digital twin for battery systems: Cloud BMS with online SoC and SoH estimation. I also spoke with folks from cloud battery analytics like Accure Battery Intelligence, Nortical, and Volytica.

Cloud-based battery intelligence systems typically do the following:

Estimate remaining useful capacity and internal resistance. Below, I refer to them as degradation parameters.
Collect and chart state-of-charge, voltage, current, and degradation parameters to let users see what happens with their individual batteries in the cloud.
Predict the remaining useful life of batteries to inform insurance, capacity planning, and procurement.

Li et al. proposed that batteries’ degradation parameter estimations should be sent from the cloud to the onboard battery management system (BMS) because they could be more accurate than the estimations that the onboard BMS could compute on its own.

I think this might be a reasonable approach only for very small batteries, such as those of electric bikes, but not batteries of electric vehicles or energy storage systems. Onboard BMS of the latter should estimate the degradation parameters on their own.

Estimating battery parameters in the cloud and making the onboard BMS rely on these estimates exposes the energy infrastructure to highly impactful risks

Estimating state-of-charge and degradation parameters which battery’s onboard BMS regularly receives from the cloud is a form of control. And centralising control in the cloud exposes the energy infrastructure to some highly impactful risks, considering that batteries are safety-critical systems: they can catch fire, explode, and harm people who are around them at the moment.

The most trivial risk: developers may deploy a new version of the cloud software that estimates battery parameters and this version has a bug due to which the software makes incorrect estimates, or stops processing the telemetry from the batteries (required to make new estimates), or stops responding when batteries try to fetch new estimates from the cloud.

One can say that developers can mitigate this risk by testing the software in staging environments, rolling out new software versions carefully, applying the new logic only to small batches of batteries at a time, monitoring errors and quality metrics, etc. Yes, this is all possible, but in practice, developers tend to cut the “unnecessary” steps of the deployment process as long as everything goes well: see Rasmussen’s model of how accidents happen. The only way to combat this dynamic is to have hard boundaries that developers cannot circumvent. For example, engineers at Amazon AWS can’t make any operational changes or deployments across Availability Zones, which I think is one of the reasons why AWS has had much fewer serious outages than Google Cloud Platform. In the case of batteries, if parameter estimators are built into the onboard BMS and battery operators (rather than BMS developers) are responsible for deploying BMS software then new BMS versions will arrive at batteries very slowly, over the course of months. If there are bugs in the new BMS that cause problems with batteries, operators are likely to discover this and report the problems to developers while the new version has been deployed on a minor portion of all batteries that use that particular BMS.

Even if developers do everything “right”, unless the estimators of battery parameters are implemented on top of multi-datacenter, multi-cloud computing infrastructure, cloud-centric estimation software is impacted by datacenter and cloud provider outages.

Finally, there are two risk factors that are least probable, but also cannot be mitigated when batteries depend for parameter estimations on some software running in the cloud even in principle. Terrorists can hack into the cloud and send wrong estimates to batteries. Internet can stop working due to a grid blackout which is exactly the moment when batteries must remain operational and reliable.

Fallback from cloud to local parameter estimators is risky, expensive, and will become obsolete as the embedded computing becomes more efficient

Li et al. suggest to mitigate the risks that I described above by providing a fallback from the cloud parameter estimators to less accurate estimators that run on the onboard BMS: “The functions which are required at each time point during operation should also run locally, guaranteeing the system safety. An advanced version of these functions will run in the cloud with advanced algorithms, which provide higher accuracy while requiring high computation power.” However, I think there are multiple problems with this architecture.

First of all, it doesn’t address the main risk of cloud estimators: a faulty deployment that starts to send wrong parameter estimates to batteries. Tasking the onboard BMS with distinguishing between “good” and “bad” estimates is very fragile and implies that the onboard BMS can be at least as “smart” as the estimation software in the cloud: but then, why wouldn’t it make good estimations itself?

In general, implementing fallback that works not just in theory, but in practice can be very hard and is also subject to sociotechnical problems similar to the problem of “lazy developers” that I described above: since the fallback logic is not normally used, it will receive little attention and testing from developers. This will go unnoticed until a big failure happens. Jacob Gabrielson (working for AWS) wrote a good article on this topic: “Avoid fallback in distributed systems”, I will not repeat his points here.

It takes a lot more effort to develop, test, and maintain two separate sets of parameter estimators: one for the cloud and another for the onboard BMS, plus the fallback logic, instead of just one set of algorithms running on the onboard BMS.

Finally, the assumption that cloud estimators could actually be noticeably more accurate than algorithms running on the onboard BMS is eroding. Battery parameter estimation algorithms take just a few timeseries as their inputs. They are not comparable with realtime video analysis or language-related algorithms that either process a lot of data or require large models that cannot fit into embedded computers (however, this too can soon change due to the very quick progress in machine learning model optimisation). Therefore, there should be algorithms that can run on embedded computers and that estimate battery parameters with errors very marginally higher than the errors of algorithms can only run on big servers in the cloud. Time and effort spent on improving the algorithms themselves and the battery models (rather than setting up reliable cloud infrastructure for computing and delivering estimates to batteries and maintaining fallback) will likely bring much higher gains.

On the other hand, as embedded computers become cheaper and more efficient, more and more advanced algorithms can be executed on them. This trend will not reverse, and I think it will make the architecture where battery parameters are estimated in the cloud mostly obsolete by 2025.