Cooling Failures

The New York Times story “Power, Pollution, and the Internet” highlights a largely unacknowledged issue with data centers, cooling.  James Glanz starts with an anecdote describing an overheating problem at a Facebook data center in the early days. The article then goes on to quote: “Data center operators live in fear of losing their jobs on a daily basis, and that’s because the business won’t back them up if there’s a failure.”

It turns out that the issue the author describes is not an isolated incident. As data centers get hotter, denser and more fragile, cooling becomes increasingly critical to reliability. Here are examples of cooling-related failures which have made the headlines in recent years.

Facebook: A BMS programming error in the outside air economizer logic at Facebook’s Prineville data center caused the outdoor air dampers to close and the spray coolers to go to 100%, which caused condensate to form inside servers leading to power unit supply failure.

Wikipedia: A cooling failure caused servers at Wikimedia to go into automatic thermal shutdown, shutting off access to Wikipedia from European users.

Nokia: A cooling failure led to a lengthy service interruption and data loss for Nokia’s Contacts by Ovi service.

Yahoo: A single cooling unit failure resulted in locally high temperatures, which tripped the fire suppression system and shut down the remainder of the units.

Lloyds: Failure of a “server cooling system” brought down the wholesale banking division of the British financial services company Lloyds Banking Group for several hours.

Google: For their 1800-server clusters, Google estimates that “In each cluster’s first year, … there’s about a 50 percent chance that the cluster will overheat, taking down most of the servers in less than 5 minutes and taking 1 to 2 days to recover.”

It is no surprise that data center operators live in fear.  What is surprising is that so few operators have mitigated risk through currently-available technology. It’s now possible to non-intrusively upgrade existing data centers with supervisory cooling management systems that compensate for and alert operators to cooling failures. Changes in IT load, environmental conditions, or even human error can quickly be addressed, avoiding what could quickly become an out-of-control incident that results in downtime, loss of availability, and something that’s anathema to colo operators: SLA penalties.

It’s incumbent on facilities operators and business management to evaluate and install the latest technology that puts not only operational visibility, but essential control, in their hands before the next avoidable incident occurs.

Cleantech Evolves

Smart Loading for the Smart Grid – New Directions in Cleantech

I recently participated in a TiE Energy Panel (The Hottest Energy Startups: Companies Changing the Energy Landscape), with colleagues from Primus Power, Power Assure, Mooreland Partners and Gen110.

The panel concurred that the notion of Cleantech – and the investment money that follows it – has shifted from a focus on energy generation to a focus on energy management.   To date, this is primarily because cheaper energy sources, hyped in early Cleantech press, haven’t materialized.  It’s hard to compete with heavily subsidized incumbent energy sources, much less build a business for what’s perceived as a commodity business.  There are exceptions, like solar energy development, but other alternative sources have languished financially despite their promise.

The investment shift toward energy management is also a result of emerging efficiency-focused technology.  Data Center Infrastructure Management or DCIM is all about smart management – with an emphasis on energy.  Gartner believes that there are some 60+ companies in this space, which is rapidly gaining acceptance as a data center requirement.

This shift is also supported by the convergence of other technology growth areas, such as big data and cloud computing, both of which play well with energy management.   As our increasingly sensor-driven environment creates more and more data – big data – its volume has surpassed the ability of humans to manage it.

And yet the availability of this data, accurate, collected in real-time, inclusive of the dimensions of time and location, represents real promise.  Availability and analysis of this information within individual corporations and perhaps shared more broadly via the cloud, will reveal continuous options for improving efficiency and will likely point to entirely new means of larger scale energy optimization through an integrated smart grid.

The days of facility operators running around with temperature guns and clipboards – although still surprisingly common today – is giving way to central computer screens with consolidated and scannable, actionable data.

This is an exciting time.  I’m all for new ideas and the creation of less expensive, less environmentally harmful ways to generate energy.  But as these alternative options evolve, I am equally excited by the strides industry has made for the smarter use of the resources we have.

The wave of next generation energy management is still rising.

More Cooling

More Cooling With Less $$

My last post took a look at the maintenance savings possible through more efficient data center/facility cooling management.  You can gain further savings by increasing the capacity of your existing air handling/ air conditioning units.  It is even possible to add IT load without requiring new air conditioners or at the least, deferring those purchases.  Here’s how.

Data centers and buildings have naturally occurring air stratification.  Many facilities deliver cool air from an under floor plenum.  As the air heats and rises, cooling air is delivered low and moved about with low velocity.  Because server racks sit on the floor, they sit in a colder area on average.  The air conditioners however, draw from higher in the room – capturing the hot air from above and delivering it, once cooled down, to the under floor plenum. This vertical stratification creates an opportunity to deliver cooler air to servers and at the same time increase cooling capacity by drawing return air from higher in the room.

However, this isn’t easy to achieve.  The problem is that uncoordinated or decentralized control of air conditioners often causes some of the units to deliver uncooled air into the under floor plenum. There, the mixing of cooled and uncooled air results in higher inlet air temperatures of servers, and ultimately lower return-air temperatures, which reduces the capacity of the cooling equipment.

A cooling management system can establish a colder profile at the bottom of the rack and make sure that each air conditioner is actually having a cooling effect, versus working ineffectively and actually increasing heat through its operation. An intelligent cooling energy management system dynamically right-sizes air conditioning unit capacity loads, coordinating their combined operation so that all the units deliver cool air and don’t mix hot return air from some units with cold air from other units. This unit-by-unit but combined coordination squeezes the maximum efficiency out of all available units so that, even at full load, inefficiency due to mixing is avoided and significant capacity-improving benefits are gained.

Consider this example.  At one company, their 40,000 sq. foot data center appeared to be out of cooling capacity.  After deploying an intelligent energy management system, not only did energy usage drop, but the company was able to increase its data center IT load by 40% without adding additional air conditioners and, in fact,  after de-commissioning two existing units. As well, the energy management system maintained proper, desired inlet air temperatures under this higher load condition.

Consider going smarter before moving to an additional equipment purchase decision.  Savings become even larger if you consider avoided maintenance costs for new equipment, and energy reduction through more efficiently balanced capacity loads, year-over-year.

2011 Reflections

There is a saying in the MEP consulting business: “no one ever gets sued for oversizing.” That fear-driven mentality also affects the operation of mechanical systems in data centers, which accounts for why data centers are over-cooled at great expense.  But few facility managers know by how much.  The fact is that it has been easier – and to date –safer to over-cool a data center as the importance of the data it contains has increased and with that importance, the pressure to protect it.

Last year that changed.  With new technology, facility managers know exactly how much cooling is required in a data center, at any given time. And, perhaps more importantly, technology can provide warning – and reaction time – in the rare instances when temperatures increase unexpectedly. With this technology, data center cooling can now be “dynamically right-sized.”  The risk of dynamic management can be made lower than manual operation, which is prone to human error.

In our own nod to the advantages of this technology, we re-named the company I co-founded in 2004, from Federspiel Corporation to Vigilent Corporation.  As our technology increased in sophistication, we felt that our new name, denoting vigilance and intelligent oversight of facility heating and cooling operations, was more reflective of the new reality in data center cooling management.   Last year, through smart, automated management of data center energy consumption, Vigilent reduced carbon emissions and energy consumption of cooling systems by 30-40%.  These savings will continue year after year, benefiting not only those companies’ bottom line, but also their corporate sustainability objectives.   These savings have been accomplished while maintaining the integrity and desired temperatures of data centers of all sizes and configurations in North America, Canada and Japan.

I’m proud of what we have achieved last year.  And I’m proud of those companies who have stepped up to embrace technology that can replace fear with certainty, and waste with efficiency.

Unexpected Savings

Data Center Cooling Systems Return
Unexpected Maintenance Cost Savings

Advanced cooling management in critical facilities such as
data centers and telecom central offices can save tons of energy (pun
intended). Using advanced cooling management to achieve always-ready,
inlet-temperature-controlled operation, versus the typical always-on,
always-cold approach yields huge energy savings.

But energy savings isn’t the only benefit of advanced cooling management. NTT America recently took a hard look at some of the
direct, non-energy savings of an advanced cooling system. They quantified
savings from reduced maintenance costs, increased cooling capacity from
existing resources, improved thermal management and deferred capital
expenditures. Their analysis found that the non-energy benefits increased the total dollar savings by one-third.

Consider first the broader advantages of reduced maintenance costs. Advanced cooling management identifies when CRACs are operating
inefficiently. Turning off equipment that doesn’t need to be on reduces wear and tear. Equipment that isn’t running isn’t wearing out. Reducing wear and tear reduces the chance of an unexpected failure, which is always something to avoid in a mission-critical facility. One counter-intuitive result of turning off lightly provisioned CRACs is that inlet air temperatures are reduced by a few degrees. Reducing inlet air temperature also reduces the risk of IT equipment failure and increases the ride-through time in the event of a cooling system failure.

The maintenance and operations cost savings of advanced cooling
management is significant, but avoiding downtime is priceless.

Cooling Tips

Ten Tips For Cooling Your Data Center

Even as data centers grow in size and complexity, there are still relatively simple and straightforward ways to reduce data center energy costs. And, if you are looking at an overall energy cost reduction plan, it makes sense to start with cooling costs as they likely comprise at least 50% of your data center energy spend.  Start with the assumption that your data center is over-cooled and consider the following:

Turn Off Redundant Cooling Units.  You know you have them, figure out which are truly unnecessary and turn them off. Of course, this can be tricky. See my previous blog on Data Center Energy Savings.

Raise Your Temperature Setting. You can stay within ASHRAE limits and likely raise the temperature a degree or two.

Turn Off Your Humidity Controls. Unless you really need them, and most data centers do not.

Use Variable Speed Drives but don’t run them all at 100% (which ruins their purpose). These are one of the biggest energy efficiency drives in a data center.

Use Plug Fans for CRAH Units. They have twice the efficiency and they distribute air more effectively.

Use Economizers.  Take advantage of outside air when you can.

Use An Automated Cooling Management System. Remove the guesswork.

Use Hot and Cold Aisle Arrangements. Don’t blow hot exhaust air from some servers into the inlets of other servers.

Use Containment. Reduce air mixing within a single space.

Remove Obstructions. This sounds simple, but  a poorly placed cart can create a hot spot. Check every day.

Here’s an example of the effect use of an automated cooling management system can provide.

The first section shows a benchmark of the data center energy consumption prior to automated cooling. The second section shows energy consumption after the automated cooling system was turned on. The third section shows consumption when the system was turned off and manual control was resumed, and the fourth section shows consumption with fully automated control. Notice that energy savings during manual control were nearly completely eroded in less than a month, but resumed immediately after resuming automatic control.