How Did I Live Without My AI-Driven Cooling?

Driving the other day, I decided to grab a quick bite to eat on the way home. I quickly launched my maps application, searched on the type of food I wanted, picked a local place, made sure they had decent reviews, checked their hours, and started the navigation guidance to tell me how to get there quickly.

When I did this, I was hungry. A few seconds later, I was on my way to solving that issue.

But I didn’t break it down that I was using a mobile-sized computer to triangulate my position on the globe from satellites. I didn’t then overlay a series of restaurants from a back-end database on top of that map, which was then integrated with a reviews database as well as location-specific information about that restaurant and its hours of operation. I didn’t follow that up by evaluating different routes from my current location to the restaurant, and deciding which one to take.

This was all on auto-pilot. I decided I wanted food, looked up restaurants, made sure the food was good, the place was open, and went. This took just seconds of my time.

We get so much information from simple swipes and glances that we forget what’s really guiding all of those interactions under the hood.

All the ways that we live, work, drive, interact…have all gone beyond the scope of what many of these technologies were originally designed to do.

And it only makes sense that this sort of distillation of technology to simplify our lives has also found its way into the data center, especially with the advancement of artificial intelligence for optimization and operation of cooling systems. Data Center Knowledge described recent advancements in an article on machine learning.

We’re not quite at the fully automated, human-to-computer interfaces seen in futuristic shows like Star Trek, but the day is rapidly approaching when you can “make it so.” Just like the technology above, you’ll wonder how you ever managed without AI-driven cooling(tm).

In an AI-driven data center, you can already:

  • Continually monitor conditions on your console or mobile device, from anywhere
  • Know which racks have redundant cooling so you can orchestrate variable workloads automatically
  • Identify the effects of “hands in the network” by viewing real-time or time-sequenced heat maps and data
  • See where the cooling is being delivered using Influence Maps™
  • See when floor panels haven’t been put back or blanked
  • Verify that work has been completed successfully using data and performance metrics (and hold vendors accountable)
  • Review anomalies that result from unexpected behavior even if they have already been mitigated by AI-driven cooling, and then review the data to see what and where you need to focus

This real-time information is immediately and continually visible from your dashboard. Walking the floor is only necessary for physical or configuration changes.

You can already see – and be able to prove – whether you really need that new CRAC, or if by shifting IT load or cooling you’ll net the same effect. You can see if your free cooling is operating as designed and have the data to troubleshoot it if not. AI-driven cooling automatically resolves issues and gives you the additional time – and critical data — to investigate further if need be.

AI-driven cooling enables autonomous, truly remote data centers to become even more cost effective as your best facility personnel can manage your most critical facilities – from miles or continents away.

Highly variable data centers which house very high-density high-heat-producing racks, in the proximity of others that don’t, will be easier to manage with less stress. Because AI-driven cooling understands the distinct cooling requirements of any situation it can automatically manage airflow within the same room for optimum efficiency.

When Fortune Magazine forecasted the “25 Ways AI is Changing Business,” they said that “the reality is that no one knows or can know what’s ahead, not even approximately. The reason is that we can never foresee human ingenuity, all the ways in which millions of motivated entrepreneurs and managers worldwide will apply rapidly improving technology.” But just as you and I have already seen what AI and mobile phone technology has done for our lives, so will it be for data center infrastructure.

And, like the power available through our mobile phones, someday soon we’ll wonder how we ever managed without AI-driven data centers.

With Data Centers, What Can Happen Will Happen (Eventually).

Because data centers and telecom switching centers are designed to withstand failures without interrupting business operations, a 3 a.m. emergency due to a malfunctioning air conditioner should never occur – in theory. But Murphy’s Law says that if a single failure can create an emergency, it will. So, to date, operators have had to react to single-component failures as if they are business-critical. Because they might be.

In my previous blog, I pointed out the two components of risk: the probability of and the consequence of failure. While both of these components are important in failure analysis, it is the consequence of failure that’s most effective at helping decision-makers manage the cost of failure.

If you know there is a high probability of impending failure, but you don’t know the potential consequence, you have to act as though every threat has the potential for an expensive business interruption. Taking such actions is typically expensive. But if you know the consequence, even without knowing the probability of failure, you can react to inconsequential failures at your leisure and plan so that consequential failures are less likely.

In the past, the consequences of a failure weren’t knowable or predictable. The combination of Internet of Things (IoT) data and machine learning has changed all that. It’s now possible to predict the consequence of failure by analyzing large quantities of historical sensor data. These predictions can be performed on demand and without the need for geometrical data hall descriptions.

The advantage of machine learning-based systems is that predictive models are continually tuned to actual operating conditions. Even as things change and scale over time, the model remains accurate without manual intervention. The consequences of actions, in addition to equipment failures, become knowable and predictable.

This type of consequence analysis is particularly important for organizations that have a run-to-failure policy for mechanical equipment. Run-to-failure is common in organizations with severe capital constraints, but it only works, and avoids business interruptions, if the consequence of the next failure is predictable.

Predicting the consequence of failure allows an operations team to avoid over-reacting to failures that do not affect business continuity. Rather than dispatching a technician in the middle of the night, an operations team can address a predicted failure with minimal or no consequence during its next scheduled maintenance. If consequence analysis indicates that a cooling unit failure may put more significant assets at risk, the ability to predict how much time is available before a critical temperature is reached provides time for graceful shutdown – and mitigation.

Preventative maintenance carries risk, but equipment still needs to be shut off at times for maintenance. Will it cause a problem? Predictive consequence analysis can provide the answer. If there’s an issue with shutting off a particular unit, you can know in advance and provide spot cooling to mitigate the risks.

 The ability to predict the consequences of failure, or intentional action such as preventative maintenance, gives facility managers greater control over the reliability of their facilities, and the peace of mind that their operations are as safe as possible.

Consequence Planning Avoids Getting Trapped Between a Rack and a Hot Place

A decade of deploying machine learning in data centers and telecom switching centers throughout the world has taught us a thing or two about risk and reliability management.

In the context of reliability engineering, risk is often defined as the probability of failure times the consequence of the failure. The failure itself, therefore, is only half of the risk consideration. The resulting consequences are equally, and sometimes more, relevant. Data centers typically manage risk with redundancy to reduce the chances of failures that may cause a business interruption. This method reduces the consequence of single component failure. If failure occurs, a redundant component ensures continuity.

When people talk about the role of machine learning in risk and reliability management, most view machine learning from a similar perspective – as a tool for predicting the failure of single components.

But this focus falls short of the true capabilities of machine learning. Don’t get me wrong, predicting the probability of failure is useful – and difficult – to do. But it only has value when the consequence of the predicted failure is significant.

When data centers and telecom switching centers perform and operate as designed, the consequences of most failures are typically small. But most data centers don’t operate as designed, especially the longer they run.

Vigilent uses machine learning to predict the consequences of control actions. We use machine learning to train our Influence Map™ to make accurate predictions of cooling control actions, including what will happen when a cooling unit is turned on or off. If the Influence Map predicts that turning a particular unit off would cause a rack to become too hot, the system won’t turn that cooling unit off.

The same process can be used to predict the consequence of a cooling unit failure. In other words, the Influence Map can predict the potential business impact of a particular cooling unit failure, such as whether a rack will get hot enough to impact business continuity. This kind of failure analysis simultaneously estimates the redundancy of the cooling system.

This redundancy calculation doesn’t merely compare the total cooling capacity with the total heat load of the equipment. Fully understanding the consequence of a failure requires both predictive modeling and machine learning. Together, these technologies accurately model actual, real time system behavior in order to predict and manage the cost of that failure.

This is why the distinction between failures and consequences matter. Knowing the consequences of failure enables you to predict the cost of failure.

Some predicted failures might not require a 3 a.m. dispatch. In my next blog, I’ll outline the material advantages of understanding consequences and the resulting effect on redundancy planning and maintenance operations.