Analytics in Action for Data Center Cooling

When a data center is first designed, everything is tightly controlled. Rack densities are all the same. The layout is precisely planned and very consistent. Power and space constraints are well-understood. The cooling system is modeled – sometimes even with CFD – and all of the cooling units operate at the same level.

But the original design is often a short-lived utopia. The realty of most data centers becomes much more complex as business needs and IT requirements change and equipment moves in and out.

As soon as physical infrastructure changes, cooling capacity and redundancy are affected.  Given the complexity of design versus operational reality, many organizations have not had the tools to understand what has changed or degraded, so cannot make informed decisions about their cooling infrastructure. Traditional DCIM products often focus on space, network and power.  They don’t provide detailed, measured data on the cooling system.  So, decisions about cooling are made without visibility into actual conditions.

Analytics can help. Contrary to prevailing views, analytics don’t necessarily take a lot of know-how or data analysis skills to be extremely helpful in day-to-day operations management. Analytics can be simple and actionable. Consider the following examples of how a daily morning glance at thermal analytics helped these data center managers quickly identify and resolve some otherwise tricky thermal issues.

In our first example, the manager of a legacy, urban colo data center with DX CRAC units was asked to determine the right place for some new IT equipment. There were several areas with space and power available, but determining which of these areas had sufficient cooling was more challenging. The manager used a cooling influence map to identify racks cooled by multiple CRACs. He then referenced a cooling capacity report to confirm that more than one of these CRACs had capacity to spare. By using these visual analytics, the manager was able to place the IT equipment in an area with sufficient, and redundant, cooling.

In a second facility, a mobile switching center for a major telco, the manager noticed a hot spot on the thermal map and sent a technician to investigate the location. The technician saw that some of the cooling coils had low delta T even though the valves were open, which implied a problem with the hydronics. Upon physical investigation of the area, he discovered that this was caused by trapped air in the coil, so he bled it off. The delta T quickly went from 3 to 8.5 – a capacity increase of more than 65 percent – as displayed on the following graph:

 

DeltaT

These examples are deceptively simple. But without analytics, the managers would not have been able to as easily identify the exact location of the problem, the cooling units involved, and have enough information to direct trouble-shooting action within the short time needed to resolve problems in a mission critical facility.

Analytics typically use the information already available in a properly monitored data center. They complement the experienced intuition of data center personnel with at-a-glance data that helps identify potential issues more quickly and bypasses much of the tedious, blood pressure-raising and time-consuming diagnostic activities of hotspot resolution.

Analytics are not the future. Analytics have arrived. Data centers that aren’t taking advantage of them are riskier and more expensive to operate, and place themselves at competitive disadvantage

Cooling Failures

The New York Times story “Power, Pollution, and the Internet” highlights a largely unacknowledged issue with data centers, cooling.  James Glanz starts with an anecdote describing an overheating problem at a Facebook data center in the early days. The article then goes on to quote: “Data center operators live in fear of losing their jobs on a daily basis, and that’s because the business won’t back them up if there’s a failure.”

It turns out that the issue the author describes is not an isolated incident. As data centers get hotter, denser and more fragile, cooling becomes increasingly critical to reliability. Here are examples of cooling-related failures which have made the headlines in recent years.

Facebook: A BMS programming error in the outside air economizer logic at Facebook’s Prineville data center caused the outdoor air dampers to close and the spray coolers to go to 100%, which caused condensate to form inside servers leading to power unit supply failure.

Wikipedia: A cooling failure caused servers at Wikimedia to go into automatic thermal shutdown, shutting off access to Wikipedia from European users.

Nokia: A cooling failure led to a lengthy service interruption and data loss for Nokia’s Contacts by Ovi service.

Yahoo: A single cooling unit failure resulted in locally high temperatures, which tripped the fire suppression system and shut down the remainder of the units.

Lloyds: Failure of a “server cooling system” brought down the wholesale banking division of the British financial services company Lloyds Banking Group for several hours.

Google: For their 1800-server clusters, Google estimates that “In each cluster’s first year, … there’s about a 50 percent chance that the cluster will overheat, taking down most of the servers in less than 5 minutes and taking 1 to 2 days to recover.”

It is no surprise that data center operators live in fear.  What is surprising is that so few operators have mitigated risk through currently-available technology. It’s now possible to non-intrusively upgrade existing data centers with supervisory cooling management systems that compensate for and alert operators to cooling failures. Changes in IT load, environmental conditions, or even human error can quickly be addressed, avoiding what could quickly become an out-of-control incident that results in downtime, loss of availability, and something that’s anathema to colo operators: SLA penalties.

It’s incumbent on facilities operators and business management to evaluate and install the latest technology that puts not only operational visibility, but essential control, in their hands before the next avoidable incident occurs.

Unexpected Savings

Data Center Cooling Systems Return
Unexpected Maintenance Cost Savings

Advanced cooling management in critical facilities such as
data centers and telecom central offices can save tons of energy (pun
intended). Using advanced cooling management to achieve always-ready,
inlet-temperature-controlled operation, versus the typical always-on,
always-cold approach yields huge energy savings.

But energy savings isn’t the only benefit of advanced cooling management. NTT America recently took a hard look at some of the
direct, non-energy savings of an advanced cooling system. They quantified
savings from reduced maintenance costs, increased cooling capacity from
existing resources, improved thermal management and deferred capital
expenditures. Their analysis found that the non-energy benefits increased the total dollar savings by one-third.

Consider first the broader advantages of reduced maintenance costs. Advanced cooling management identifies when CRACs are operating
inefficiently. Turning off equipment that doesn’t need to be on reduces wear and tear. Equipment that isn’t running isn’t wearing out. Reducing wear and tear reduces the chance of an unexpected failure, which is always something to avoid in a mission-critical facility. One counter-intuitive result of turning off lightly provisioned CRACs is that inlet air temperatures are reduced by a few degrees. Reducing inlet air temperature also reduces the risk of IT equipment failure and increases the ride-through time in the event of a cooling system failure.

The maintenance and operations cost savings of advanced cooling
management is significant, but avoiding downtime is priceless.

Occam’s Razor

Data Center Energy Savings

The simplest approach to data center energy savings might suggest that a facility manager’s best option is to turn off a few air conditioners.  And there’s truth to this.  See the graph below, showing before and after energy usage, and the impact of turning off some of the cooling units.

Before & After Energy Management Software Started

But the simplicity suggested here is deceptive.

Which air conditioners?

How many?

How will this truly affect the temperature?

What’s the risk to uptime or ridethrough?

While turning things off or down is likely our greatest opportunity for significant, immediate savings, the science driving the decision of which device to turn off and when, is complex and dynamic.

Fortunately, a convergence of new technology – wireless sensors for continuous, real-time and location-specific data, along with predictive, adaptive software algorithms that take into account all immediate and known variables at any given moment – can predict the future impact of energy management decisions, taking on/off decision-making to a new level.  Now, for the first time, it’s possible – thanks to the latest AI technology – to automatically, constantly and dynamically manage cooling resources to reduce average temperatures across a facility and avoid hot and cold spots of localized temperature extremes. Simultaneously, overall cooling energy consumption is reduced by intelligently turning down, or off, the right CRACs at the right time. The result is continually optimized cooling with greater assurance that the overall integrity of the data center is preserved.