A decade of deploying machine learning in data centers and telecom switching centers throughout the world has taught us a thing or two about risk and reliability management.
In the context of reliability engineering, risk is often defined as the probability of failure times the consequence of the failure. The failure itself, therefore, is only half of the risk consideration. The resulting consequences are equally, and sometimes more, relevant. Data centers typically manage risk with redundancy to reduce the chances of failures that may cause a business interruption. This method reduces the consequence of single component failure. If failure occurs, a redundant component ensures continuity.
When people talk about the role of machine learning in risk and reliability management, most view machine learning from a similar perspective – as a tool for predicting the failure of single components.
But this focus falls short of the true capabilities of machine learning. Don’t get me wrong, predicting the probability of failure is useful – and difficult – to do. But it only has value when the consequence of the predicted failure is significant.
When data centers and telecom switching centers perform and operate as designed, the consequences of most failures are typically small. But most data centers don’t operate as designed, especially the longer they run.
Vigilent uses machine learning to predict the consequences of control actions. We use machine learning to train our Influence Map™ to make accurate predictions of cooling control actions, including what will happen when a cooling unit is turned on or off. If the Influence Map predicts that turning a particular unit off would cause a rack to become too hot, the system won’t turn that cooling unit off.
The same process can be used to predict the consequence of a cooling unit failure. In other words, the Influence Map can predict the potential business impact of a particular cooling unit failure, such as whether a rack will get hot enough to impact business continuity. This kind of failure analysis simultaneously estimates the redundancy of the cooling system.
This redundancy calculation doesn’t merely compare the total cooling capacity with the total heat load of the equipment. Fully understanding the consequence of a failure requires both predictive modeling and machine learning. Together, these technologies accurately model actual, real time system behavior in order to predict and manage the cost of that failure.
This is why the distinction between failures and consequences matter. Knowing the consequences of failure enables you to predict the cost of failure.
Some predicted failures might not require a 3 a.m. dispatch. In my next blog, I’ll outline the material advantages of understanding consequences and the resulting effect on redundancy planning and maintenance operations.
Speak Your Mind