Cooling Doesn’t Manage Itself

Cooling Doesn’t Manage Itself

Of the primary components driving data center operations – IT assets, power, space and cooling – the first three command the lion’s share of attention.  Schneider Electric (StruxureWare), Panduit (PIM), ABB (Decathalon), Nlyte, Emerson (Trellis) and others have created superb asset and power tracking systems.   Using systems like these and others, companies can get a good idea as to where their assets are located, how to get power to them and even how to optimally manage them under changing conditions.

Less well understood and, I would argue, not understood at all, is how to get all the IT-generated heat out of the data center, and as efficiently as possible.

Some believe that efficient cooling can be “designed in,” as opposed to operationally managed, and that this is good enough.

On the day a new data center goes live the cooling will, no doubt, operate superbly.  That is, right up until something changes – which could happen the next day, weeks or months later.  Even the most efficiently designed data centers eventually operate inefficiently. At that point, your assets are at risk and you probably won’t even know it.  Changes and follow-in inefficiencies are inevitable.

As well, efficiency by design only applies to new data centers.  The vast majority of data centers operating today are aging. All of them have degraded with incremental cooling issues over time.   IT changes, infrastructure updates, failures, essentially any and all physical data center changes or incidents, affect cooling in ways that may not be detected through traditional operations or “walk around” management.

Data center managers must manage their cooling infrastructure as dynamically and closely as they do their IT assets.  The health of the cooling system directly impacts the health of those very same IT assets.

Further, cooling must be managed operationally.  Beyond the cost savings of continually optimized efficiency, cooling management systems provide clearer insight into where to add capacity, redundancy, potential thermal problems, and areas of risk.

Data centers have grown beyond the point where they can be managed manually.  It’s time stop treating cooling as the red-headed step-child of data centers.  Cooling requires the same attention and sophisticated management systems that are in common use for IT assets.  There’s no time to lose.

Machine Learning

Why Machine Learning-based DCIM Systems Are Becoming Best Practice.

Here’s a conundrum.  While data center IT equipment has a lifespan of about three years, data center cooling equipment will endure about 15 years. In other words,  your data center will likely  undergo five complete IT refreshes within the lifetime of your cooling equipment – at the very least.  In reality, refreshes happen much more frequently. Racks and servers come and go, floor tiles are moved, maintenance is performed, density is changed based on containment operations – any one of which will affect the ability of the cooling system to work efficiently and effectively.

If nothing is done to re-configure cooling operations as IT changes are made, and this is typically the case, the data center develops hot and cold spots, stranded cooling capacity and wasted energy consumption.  There is also risk with every equipment refresh – particularly if the work is done manually.

There’s a better way. The ubiquitous availability of low cost sensors, in tandem with the emerging availability of machine learning technology, is leading to development of new best practices for data center cooling management. Sensor-driven machine learning software enables the impact of IT changes on cooling performance to be anticipated and more safely managed.

Data centers instrumented with sensors gather real-time data which can inform software of minute-by-minute cooling capacity changes.  Machine learning software uses this information to understand the influence of each and every cooling unit, on each and every rack, in real-time as IT loads change.  And when loads or IT infrastructure changes, the software re-learns accordingly and updates itself, ensuring that the accuracy of its influence predictions remains current and accurate.   This ability to understand cooling influence at a granular level also enables the software to learn which cooling units are working effectively – and at expected performance levels  – and which aren’t.

This understanding also illuminates, in a data-supported way, the need for targeted corrective maintenance. With a clearer understanding and visualization of cooling unit health, operators can justify the right budget to maintain equipment effectively thereby improving the overall health and reducing risk in the data center.

In one recent experience at a large US data center, machine learning software revealed that 40% of the cooling units were consuming power but not cooling.  The data center operator was aware of the problem, but couldn’t convince senior management to expend budget because he couldn’t quantify the problem nor prove the value/need for a specific expenditure to resolve the issue.  With new and clear data in hand, the operator was able to identify the failed CRACs and present the appropriate budget required to fix and replace them accordingly.

This ability to more clearly see the impact of IT changes on cooling equipment enables personnel to keep up with cooling capacity adjustment and, in most cases, eliminate the need for manual control.  A reduction of the corresponding “on-the-fly, floor time corrections” also frees up operators to focus on problems that require more creativity and to more effectively manage physical changes such floor tile adjustments, etc.

There’s no replacement for experience-based human expertise. However, why not leverage your staff  to do what they do best, and eliminate those tasks which are better served by software control.  Data centers using machine learning software are undeniably more efficient and more robust.  Operators can more confidently future proof themselves against inefficiency or adverse capacity impact as conditions change.  For these reasons alone, use of machine learning-based software should be considered an emerging best practice.

Cooling Failures

The New York Times story “Power, Pollution, and the Internet” highlights a largely unacknowledged issue with data centers, cooling.  James Glanz starts with an anecdote describing an overheating problem at a Facebook data center in the early days. The article then goes on to quote: “Data center operators live in fear of losing their jobs on a daily basis, and that’s because the business won’t back them up if there’s a failure.”

It turns out that the issue the author describes is not an isolated incident. As data centers get hotter, denser and more fragile, cooling becomes increasingly critical to reliability. Here are examples of cooling-related failures which have made the headlines in recent years.

Facebook: A BMS programming error in the outside air economizer logic at Facebook’s Prineville data center caused the outdoor air dampers to close and the spray coolers to go to 100%, which caused condensate to form inside servers leading to power unit supply failure.

Wikipedia: A cooling failure caused servers at Wikimedia to go into automatic thermal shutdown, shutting off access to Wikipedia from European users.

Nokia: A cooling failure led to a lengthy service interruption and data loss for Nokia’s Contacts by Ovi service.

Yahoo: A single cooling unit failure resulted in locally high temperatures, which tripped the fire suppression system and shut down the remainder of the units.

Lloyds: Failure of a “server cooling system” brought down the wholesale banking division of the British financial services company Lloyds Banking Group for several hours.

Google: For their 1800-server clusters, Google estimates that “In each cluster’s first year, … there’s about a 50 percent chance that the cluster will overheat, taking down most of the servers in less than 5 minutes and taking 1 to 2 days to recover.”

It is no surprise that data center operators live in fear.  What is surprising is that so few operators have mitigated risk through currently-available technology. It’s now possible to non-intrusively upgrade existing data centers with supervisory cooling management systems that compensate for and alert operators to cooling failures. Changes in IT load, environmental conditions, or even human error can quickly be addressed, avoiding what could quickly become an out-of-control incident that results in downtime, loss of availability, and something that’s anathema to colo operators: SLA penalties.

It’s incumbent on facilities operators and business management to evaluate and install the latest technology that puts not only operational visibility, but essential control, in their hands before the next avoidable incident occurs.

Data Center Risk

Surprising Areas of Data Center Risk and How to Proactively Manage Them

Mission critical facilities need a different level of scrutiny and control over cooling management.

It’s no surprise that cooling is critical to the security of these facilities.  With requirements for 99.999 uptime and multimillion dollar facilities at risk, cooling is often the thin blue line between data safety and disaster.

And yet, many mission critical facilities use cooling control systems that were designed for comfort cooling, versus the reliable operation of hugely valuable and sensitive equipment.

When people get warm, they become uncomfortable. When IT equipment overheats, it fails – often with catastrophically expensive results.

In one recent scenario, a 6-minute chiller plant failure resulted in lost revenue and penalties totaling $14 million.  In another scenario, the failure of a single CRAC unit caused temperatures to shoot up to over 100 degrees Fehrenheit in a particular zone, resulting in the failure of a storage array.

These failures result from a myriad of complex, and usually unrealized risk areas.  My recent talk at the i4Energy Seminar series hosted by the California Institute for Energy and Environment (CIEE) exposes some of these hidden risk areas and what you can do about them.

You can watch that talk here:


More Cooling

More Cooling With Less $$

My last post took a look at the maintenance savings possible through more efficient data center/facility cooling management.  You can gain further savings by increasing the capacity of your existing air handling/ air conditioning units.  It is even possible to add IT load without requiring new air conditioners or at the least, deferring those purchases.  Here’s how.

Data centers and buildings have naturally occurring air stratification.  Many facilities deliver cool air from an under floor plenum.  As the air heats and rises, cooling air is delivered low and moved about with low velocity.  Because server racks sit on the floor, they sit in a colder area on average.  The air conditioners however, draw from higher in the room – capturing the hot air from above and delivering it, once cooled down, to the under floor plenum. This vertical stratification creates an opportunity to deliver cooler air to servers and at the same time increase cooling capacity by drawing return air from higher in the room.

However, this isn’t easy to achieve.  The problem is that uncoordinated or decentralized control of air conditioners often causes some of the units to deliver uncooled air into the under floor plenum. There, the mixing of cooled and uncooled air results in higher inlet air temperatures of servers, and ultimately lower return-air temperatures, which reduces the capacity of the cooling equipment.

A cooling management system can establish a colder profile at the bottom of the rack and make sure that each air conditioner is actually having a cooling effect, versus working ineffectively and actually increasing heat through its operation. An intelligent cooling energy management system dynamically right-sizes air conditioning unit capacity loads, coordinating their combined operation so that all the units deliver cool air and don’t mix hot return air from some units with cold air from other units. This unit-by-unit but combined coordination squeezes the maximum efficiency out of all available units so that, even at full load, inefficiency due to mixing is avoided and significant capacity-improving benefits are gained.

Consider this example.  At one company, their 40,000 sq. foot data center appeared to be out of cooling capacity.  After deploying an intelligent energy management system, not only did energy usage drop, but the company was able to increase its data center IT load by 40% without adding additional air conditioners and, in fact,  after de-commissioning two existing units. As well, the energy management system maintained proper, desired inlet air temperatures under this higher load condition.

Consider going smarter before moving to an additional equipment purchase decision.  Savings become even larger if you consider avoided maintenance costs for new equipment, and energy reduction through more efficiently balanced capacity loads, year-over-year.