Machine Learning

Why Machine Learning-based DCIM Systems Are Becoming Best Practice.

Here’s a conundrum.  While data center IT equipment has a lifespan of about three years, data center cooling equipment will endure about 15 years. In other words,  your data center will likely  undergo five complete IT refreshes within the lifetime of your cooling equipment – at the very least.  In reality, refreshes happen much more frequently. Racks and servers come and go, floor tiles are moved, maintenance is performed, density is changed based on containment operations – any one of which will affect the ability of the cooling system to work efficiently and effectively.

If nothing is done to re-configure cooling operations as IT changes are made, and this is typically the case, the data center develops hot and cold spots, stranded cooling capacity and wasted energy consumption.  There is also risk with every equipment refresh – particularly if the work is done manually.

There’s a better way. The ubiquitous availability of low cost sensors, in tandem with the emerging availability of machine learning technology, is leading to development of new best practices for data center cooling management. Sensor-driven machine learning software enables the impact of IT changes on cooling performance to be anticipated and more safely managed.

Data centers instrumented with sensors gather real-time data which can inform software of minute-by-minute cooling capacity changes.  Machine learning software uses this information to understand the influence of each and every cooling unit, on each and every rack, in real-time as IT loads change.  And when loads or IT infrastructure changes, the software re-learns accordingly and updates itself, ensuring that the accuracy of its influence predictions remains current and accurate.   This ability to understand cooling influence at a granular level also enables the software to learn which cooling units are working effectively – and at expected performance levels  – and which aren’t.

This understanding also illuminates, in a data-supported way, the need for targeted corrective maintenance. With a clearer understanding and visualization of cooling unit health, operators can justify the right budget to maintain equipment effectively thereby improving the overall health and reducing risk in the data center.

In one recent experience at a large US data center, machine learning software revealed that 40% of the cooling units were consuming power but not cooling.  The data center operator was aware of the problem, but couldn’t convince senior management to expend budget because he couldn’t quantify the problem nor prove the value/need for a specific expenditure to resolve the issue.  With new and clear data in hand, the operator was able to identify the failed CRACs and present the appropriate budget required to fix and replace them accordingly.

This ability to more clearly see the impact of IT changes on cooling equipment enables personnel to keep up with cooling capacity adjustment and, in most cases, eliminate the need for manual control.  A reduction of the corresponding “on-the-fly, floor time corrections” also frees up operators to focus on problems that require more creativity and to more effectively manage physical changes such floor tile adjustments, etc.

There’s no replacement for experience-based human expertise. However, why not leverage your staff  to do what they do best, and eliminate those tasks which are better served by software control.  Data centers using machine learning software are undeniably more efficient and more robust.  Operators can more confidently future proof themselves against inefficiency or adverse capacity impact as conditions change.  For these reasons alone, use of machine learning-based software should be considered an emerging best practice.

Cooling Failures

The New York Times story “Power, Pollution, and the Internet” highlights a largely unacknowledged issue with data centers, cooling.  James Glanz starts with an anecdote describing an overheating problem at a Facebook data center in the early days. The article then goes on to quote: “Data center operators live in fear of losing their jobs on a daily basis, and that’s because the business won’t back them up if there’s a failure.”

It turns out that the issue the author describes is not an isolated incident. As data centers get hotter, denser and more fragile, cooling becomes increasingly critical to reliability. Here are examples of cooling-related failures which have made the headlines in recent years.

Facebook: A BMS programming error in the outside air economizer logic at Facebook’s Prineville data center caused the outdoor air dampers to close and the spray coolers to go to 100%, which caused condensate to form inside servers leading to power unit supply failure.

Wikipedia: A cooling failure caused servers at Wikimedia to go into automatic thermal shutdown, shutting off access to Wikipedia from European users.

Nokia: A cooling failure led to a lengthy service interruption and data loss for Nokia’s Contacts by Ovi service.

Yahoo: A single cooling unit failure resulted in locally high temperatures, which tripped the fire suppression system and shut down the remainder of the units.

Lloyds: Failure of a “server cooling system” brought down the wholesale banking division of the British financial services company Lloyds Banking Group for several hours.

Google: For their 1800-server clusters, Google estimates that “In each cluster’s first year, … there’s about a 50 percent chance that the cluster will overheat, taking down most of the servers in less than 5 minutes and taking 1 to 2 days to recover.”

It is no surprise that data center operators live in fear.  What is surprising is that so few operators have mitigated risk through currently-available technology. It’s now possible to non-intrusively upgrade existing data centers with supervisory cooling management systems that compensate for and alert operators to cooling failures. Changes in IT load, environmental conditions, or even human error can quickly be addressed, avoiding what could quickly become an out-of-control incident that results in downtime, loss of availability, and something that’s anathema to colo operators: SLA penalties.

It’s incumbent on facilities operators and business management to evaluate and install the latest technology that puts not only operational visibility, but essential control, in their hands before the next avoidable incident occurs.

Data Center Risk

Surprising Areas of Data Center Risk and How to Proactively Manage Them

Mission critical facilities need a different level of scrutiny and control over cooling management.

It’s no surprise that cooling is critical to the security of these facilities.  With requirements for 99.999 uptime and multimillion dollar facilities at risk, cooling is often the thin blue line between data safety and disaster.

And yet, many mission critical facilities use cooling control systems that were designed for comfort cooling, versus the reliable operation of hugely valuable and sensitive equipment.

When people get warm, they become uncomfortable. When IT equipment overheats, it fails – often with catastrophically expensive results.

In one recent scenario, a 6-minute chiller plant failure resulted in lost revenue and penalties totaling $14 million.  In another scenario, the failure of a single CRAC unit caused temperatures to shoot up to over 100 degrees Fehrenheit in a particular zone, resulting in the failure of a storage array.

These failures result from a myriad of complex, and usually unrealized risk areas.  My recent talk at the i4Energy Seminar series hosted by the California Institute for Energy and Environment (CIEE) exposes some of these hidden risk areas and what you can do about them.

You can watch that talk here:

 

2011 Reflections

There is a saying in the MEP consulting business: “no one ever gets sued for oversizing.” That fear-driven mentality also affects the operation of mechanical systems in data centers, which accounts for why data centers are over-cooled at great expense.  But few facility managers know by how much.  The fact is that it has been easier – and to date –safer to over-cool a data center as the importance of the data it contains has increased and with that importance, the pressure to protect it.

Last year that changed.  With new technology, facility managers know exactly how much cooling is required in a data center, at any given time. And, perhaps more importantly, technology can provide warning – and reaction time – in the rare instances when temperatures increase unexpectedly. With this technology, data center cooling can now be “dynamically right-sized.”  The risk of dynamic management can be made lower than manual operation, which is prone to human error.

In our own nod to the advantages of this technology, we re-named the company I co-founded in 2004, from Federspiel Corporation to Vigilent Corporation.  As our technology increased in sophistication, we felt that our new name, denoting vigilance and intelligent oversight of facility heating and cooling operations, was more reflective of the new reality in data center cooling management.   Last year, through smart, automated management of data center energy consumption, Vigilent reduced carbon emissions and energy consumption of cooling systems by 30-40%.  These savings will continue year after year, benefiting not only those companies’ bottom line, but also their corporate sustainability objectives.   These savings have been accomplished while maintaining the integrity and desired temperatures of data centers of all sizes and configurations in North America, Canada and Japan.

I’m proud of what we have achieved last year.  And I’m proud of those companies who have stepped up to embrace technology that can replace fear with certainty, and waste with efficiency.

Unexpected Savings

Data Center Cooling Systems Return
Unexpected Maintenance Cost Savings

Advanced cooling management in critical facilities such as
data centers and telecom central offices can save tons of energy (pun
intended). Using advanced cooling management to achieve always-ready,
inlet-temperature-controlled operation, versus the typical always-on,
always-cold approach yields huge energy savings.

But energy savings isn’t the only benefit of advanced cooling management. NTT America recently took a hard look at some of the
direct, non-energy savings of an advanced cooling system. They quantified
savings from reduced maintenance costs, increased cooling capacity from
existing resources, improved thermal management and deferred capital
expenditures. Their analysis found that the non-energy benefits increased the total dollar savings by one-third.

Consider first the broader advantages of reduced maintenance costs. Advanced cooling management identifies when CRACs are operating
inefficiently. Turning off equipment that doesn’t need to be on reduces wear and tear. Equipment that isn’t running isn’t wearing out. Reducing wear and tear reduces the chance of an unexpected failure, which is always something to avoid in a mission-critical facility. One counter-intuitive result of turning off lightly provisioned CRACs is that inlet air temperatures are reduced by a few degrees. Reducing inlet air temperature also reduces the risk of IT equipment failure and increases the ride-through time in the event of a cooling system failure.

The maintenance and operations cost savings of advanced cooling
management is significant, but avoiding downtime is priceless.

Cooling Tips

Ten Tips For Cooling Your Data Center

Even as data centers grow in size and complexity, there are still relatively simple and straightforward ways to reduce data center energy costs. And, if you are looking at an overall energy cost reduction plan, it makes sense to start with cooling costs as they likely comprise at least 50% of your data center energy spend.  Start with the assumption that your data center is over-cooled and consider the following:

Turn Off Redundant Cooling Units.  You know you have them, figure out which are truly unnecessary and turn them off. Of course, this can be tricky. See my previous blog on Data Center Energy Savings.

Raise Your Temperature Setting. You can stay within ASHRAE limits and likely raise the temperature a degree or two.

Turn Off Your Humidity Controls. Unless you really need them, and most data centers do not.

Use Variable Speed Drives but don’t run them all at 100% (which ruins their purpose). These are one of the biggest energy efficiency drives in a data center.

Use Plug Fans for CRAH Units. They have twice the efficiency and they distribute air more effectively.

Use Economizers.  Take advantage of outside air when you can.

Use An Automated Cooling Management System. Remove the guesswork.

Use Hot and Cold Aisle Arrangements. Don’t blow hot exhaust air from some servers into the inlets of other servers.

Use Containment. Reduce air mixing within a single space.

Remove Obstructions. This sounds simple, but  a poorly placed cart can create a hot spot. Check every day.

Here’s an example of the effect use of an automated cooling management system can provide.

The first section shows a benchmark of the data center energy consumption prior to automated cooling. The second section shows energy consumption after the automated cooling system was turned on. The third section shows consumption when the system was turned off and manual control was resumed, and the fourth section shows consumption with fully automated control. Notice that energy savings during manual control were nearly completely eroded in less than a month, but resumed immediately after resuming automatic control.