The Real Cost of Cooling Configuration Errors

Hands in the network cause problems. A setting adjusted once, based on someone’s instinct of what needed to be changed at one moment in time, is often unmodified years later.

This is configuration rot. If your data center has been running for a while, the chances are pretty high that your cooling configurations, to name one example, are wildly out of sync. It’s even more likely you don’t know about it.

Every air conditioner is controlled by an embedded computer. Each computer supports multiple configuration parameters. Each of these different configurations can be perfectly acceptable. But a roomful of air conditioners with individually sensible configurations can produce bad outcomes when their collective impact is considered.

I recently toured a new data center in which each air conditioner supported 17 configuration parameters affecting temperature and humidity. There was a lot of unexplainable variation in the configurations. Six of the 17 configuration settings varied by more than 30%, unit to unit. Only five configurations were the same. Configuration variation initially and entropy over time wastes energy and prevents the overall air conditioning system from producing an acceptable temperature and humidity distribution.

Configuration errors contribute to accidental de-rating and loss of capacity. This wastes energy, and it’s costly from a capex perceptive. Perhaps you don’t need a new air conditioner. Instead, perhaps you can optimize or synchronize the configurations for the air conditioners you already have and unlock the capacity you need. Another common misconfiguration error is incompatible set points. If one air conditioner is trying to make a room cold and another is trying to make it warmer, the units will fight.

Configuration errors also contribute to poor free cooling performance. Misconfiguration can lock out free cooling in many ways.

The problem is significant. Large organizations use thousands of air conditioners. Manual management of individual configurations is impossible. Do the math. If you have 2000 air conditioners, each of which has up to 17 configuration parameters, you have 34,000 configuration possibilities, not to mention the additional external variables. How can you manage, much less optimize configurations over time?

Ideally, you need intelligent software that manages these configurations automatically. You need templates that prescribe optimized configuration. You need visibility to determine, on a regular basis, which configurations are necessary as conditions change. You need exception handling, so you can temporarily change configurations when you perform tasks such as maintenance, equipment swaps, and new customer additions, and then make sure the configurations return to their optimized state afterward. And, you need a system that will alert you when someone tries to change a configuration, and/or enforce optimized configurations automatically.

This concept isn’t new. It’s just rarely done. But if you aren’t aggressively managing configurations, you are losing money.

Why Don’t Data Centers Use Data?

Data analysis doesn’t readily fall into the typical data center operator’s job description.   That fact, and the traditional hands-on focus of those operators, isn’t likely to change soon.

But turning a blind eye or ignoring the floodgate of data now available to data centers through IoT technology, sensors and cloud-based analytics is no longer tenable.  While the data impact of IoT has yet to be truly realized, most data centers have already become too complex to be managed manually.

What’s needed is a new role entirely, one with dotted line/cross-functional responsibility to operations, energy, sustainability and planning teams.

Consider this.  The aircraft industry has historically been driven by design, mechanical and engineering teams.  Yet General Electric aircraft engines, as an example, throw off terabytes of data on every single flight.  This massive quantity of data isn’t managed by these traditional teams.  It’s managed by data analysts who continually monitor this information to assess safety and performance, and update the traditional teams who can take any necessary actions.

Like aircraft, data centers are complex systems.  Why aren’t they operated in the same data-driven way given that the data is available today?

Data center operators aren’t trained in data analysis nor can they be expected to take it on.  The new data analyst role requires an understanding and mastery of an entirely different set of tools.  It requires domain-specific knowledge so that incoming information can be intelligently monitored and triaged to determine what constitutes a red flag event, versus something that could be addressed during normal work hours to improve reliability or reduce energy costs.

It’s increasingly clear that managing solely through experience and physical oversight is no longer best practice and will no longer keep pace with the increasing complexity of modern data centers.  Planning or modeling based only on current conditions – or a moment in time –  is also not sufficient.  The rate of change, both planned and unplanned, is too great.  Data, like data centers, is fluid and multidimensional. 

Beyond the undeniable necessity of incorporating data into day-to-day operations to manage operational complexity, data analysis provides significant value-added benefit by revealing cost savings and revenue generating opportunities in energy use, capacity and risk avoidance.  It’s time to build this competency into data center operations.

A Look at 2014

In 2014 we leveraged the significant company, market and customer expansion we achieved in 2013 to focus on strategic partnerships.  Our goal was to significantly increase our global footprint with the considerable resources and vision of these industry leaders.  We have achieved that goal and more.

Together with our long-standing partner NTT Facilities, we continue to add power and agility to complementary data center product lines managed by NTT in pan-Asia deployments.  In partnership with Schneider Electric, we are proud to announce the integration of Vigilent dynamic cooling management technology into the Cooling Optimize module of Schneider Electric’s industry-leading DCIM suite, StruxureWare for Data Centers.

Beyond the technical StruxureWare integration, Vigilent has also worked closely with Schneider Electric to train hundreds of Schneider Electric sales and field operations professionals in preparation for the worldwide roll-out of Cooling Optimize.  Schneider Electric’s faith in us has already proven well-founded as deployments are already underway across multiple continents.  With the reach of Schneider Electric’s global sales and marketing operations, their self-described “Big Green Machine,” and NTT Facilities’ expanding traction in and outside of Japan, we anticipate a banner year.

As an early adopter of machine learning, Vigilent has been recognized as a pioneer of the Internet of Things (IoT) for energy.  Data collected over seven years from hundreds of deployments continually informs and improves Vigilent system performance.  The analytics we have developed provide unprecedented visibility into data center operations and are driving the introduction of new Vigilent capabilities.

Business success aside, our positive impact on the world continues to grow.  In late 2014, we announced that Vigilent systems have reduced energy consumption by more than half a billion kilowatt hours and eliminated more than 351,000 tons of CO2 emissions.  These figures are persistent and grow with each new deployment.

We are proud to see our customers turn pilot projects into multiple deployments as the energy savings and data center operational benefits of the system prove themselves over and over again.  This organic growth is testimony to the consistency of the Vigilent product’s operation in widely varying mission critical environments.

Stay tuned to watch this process repeat itself as we add new Fortune 50 logos to our customer base in 2015.  We applaud the growing sophistication of the data center industry as it struggles with the dual challenges of explosive growth and environmental stewardship and remain thankful for our part in that process.

 

Data Center Capacity Planning – Why Keep Guessing?

Capacity management involves decisions about space, power, and cooling.

Space is the easiest. You can assess it by inspection.

Power is also fairly easy. The capacity of a circuit is knowable. It never changes. The load on a circuit is easy to measure.

Cooling is the hardest. The capacity of cooling equipment changes with time. Capacity depends on how the equipment is operated, and it degrades over time. Even harder is the fact that cooling is distributed. Heat and air follow the paths of least resistance and don’t always go where you would expect. For these reasons and more, mission-critical facilities are designed for and built with far more cooling capacity than they need. And yet many operators add even more cooling each time there is a move, add, or change to IT equipment, because that’s been a safer bet than guessing wrong.

Here is a situation we frequently observe:

Operations will receive frequent requests to add or change IT loads as a normal course of business.  In large or multi-site facilities, these requests may occur daily.  Let’s say that operations receives a request to add 50 kW to a particular room.  Operations will typically add 70 kW of new cooling.

This provisioning is calculated assuming a full load for each server, with the full load being determined from server nameplate data.  In reality, it’s highly unlikely that all cabinets in a room will be fully loaded, and it is equally unlikely that the server will ever require its nameplate power.  And remember, the room was originally designed with excess cooling capacity.  When you add even more cooling to these rooms, you have escalated over-provisioning.  Capital and energy are wasted.

We find that cooling utilization is typically 35 to 40%, which leaves plenty of excess capacity for IT equipment expansions.  We also find that in 5-10% of situations, equipment performance and capacity has degraded to the point where cooling redundancy is compromised.  In these cases, maintenance becomes difficult and there is a greater risk of IT failure due to a thermal event. So, it’s important to know how a room is running before adding cooling.  But it isn’t always easy to tell if cooling units are not performing as designed and specified.

How can operations managers make more cost effective – and safe – planning decisions?  Analytics.

Analytics using real-time data provides managers with the insight to determine whether or not cooling infrastructure can handle a change or expansion to IT equipment, and to manage these changes while minimizing risk.  Specifically, analytics can quantify actual cooling capacity, expose equipment degradation, and reveal where there is more or less cooling reserve in a room for optimal placement of physical and virtual IT assets.

Consider the following analytics-driven capacity report.  Continually updated by a sensor network, the report clearly displays exactly where capacity is available and where it is not.  With this data alone, you can determine where capacity exists and where you can safely and immediately add capacity with no CapEx investment.  And, in those situations where you do need to add additional cooling, it will predict with high confidence what you need. (click on the image for a full-size version)

Cooling Capacity

Yet you can go deeper still.  By pairing the capacity report with a cooling reserve map (below), you can determine where you can safely place additional load in the desired room.  You can also see where you should locate your most critical assets and, when you need that new air conditioner, and where you should place it.

(click on the image for a full size version)thermalcircle

Using these reports, operations can:

  • avoid the CapEx cost of more cooling every time IT equipment is added;
  • avoid the risk of cooling construction in production data rooms when it is often not needed;
  • avoid the delayed time to revenue from adding cooling to a facility that doesn’t need it.

In addition, analytics used in this way avoids unnecessary energy and maintenance OpEx costs.

Stop guessing and start practicing the art of avoidance with analytics.

 

 

Maintenance is Risky

No real surprise here. Mission critical facilities that pride themselves on and/or are contractually obligated to provide the “five 9’s” of reliability know that sooner or later they must turn critical cooling equipment off to perform maintenance. And they know that they face risk each time they do so.

This is true even for the newest facilities. The minute a facility is turned up, or IT load is added, things start to change. The minute a brand new cooling unit is deployed, it starts to degrade – however incrementally. And that degree of degradation is different from unit to unit, even when those units are nominally identical.

In a risk and financial performance panel presentation at a recent data center event sponsored by Digital Realty, ebay’s Vice President of Global Foundation Services Dean Nelson recently stated that “touching equipment for maintenance increases Probability of Failure (PoF).” Nelson actively manages and focuses on reducing ebay’s PoF metric throughout the facilities he manages.

Performing maintenance puts most facility managers between the proverbial rock and a hard place. If equipment isn’t maintained, by definition you have a “run to failure” maintenance policy. If you do maintain equipment, you incur risk each time you turn something off. The telecom industry calls this “hands in the network” which they manage as a significant risk factor.

What if maintenance risks could be mitigated? What if you could predict what would happen to the thermal conditions of a room and, even more specifically, what racks or servers could be affected if you took a particular HVAC unit offline?

This ability is available today. It doesn’t require computational fluid dynamics (CFD) or other complicated tools that rely on physical models. It can be accomplished through data and analytics. That is, analytics continually updated by real-time data from sensors instrumented throughout a data center floor. Gartner Research says that hindsight based on historical data, followed by insight based on current trends, drives foresight.

Using predictive analytics, facility managers can also determine exactly which units to maintain and when – in addition to understanding the potential thermal affect that each maintenance action will have on every location in the data center floor.

If this knowledge was easily available, what facility manager wouldn’t choose to take advantage of it before taking a maintenance action? My next blog post will provide a visual example of the analysis facility managers can perform to determine when and where to perform maintenance while simultaneously reducing risk to more critical assets and the floor as a whole.

Machine Learning

Why Machine Learning-based DCIM Systems Are Becoming Best Practice.

Here’s a conundrum.  While data center IT equipment has a lifespan of about three years, data center cooling equipment will endure about 15 years. In other words,  your data center will likely  undergo five complete IT refreshes within the lifetime of your cooling equipment – at the very least.  In reality, refreshes happen much more frequently. Racks and servers come and go, floor tiles are moved, maintenance is performed, density is changed based on containment operations – any one of which will affect the ability of the cooling system to work efficiently and effectively.

If nothing is done to re-configure cooling operations as IT changes are made, and this is typically the case, the data center develops hot and cold spots, stranded cooling capacity and wasted energy consumption.  There is also risk with every equipment refresh – particularly if the work is done manually.

There’s a better way. The ubiquitous availability of low cost sensors, in tandem with the emerging availability of machine learning technology, is leading to development of new best practices for data center cooling management. Sensor-driven machine learning software enables the impact of IT changes on cooling performance to be anticipated and more safely managed.

Data centers instrumented with sensors gather real-time data which can inform software of minute-by-minute cooling capacity changes.  Machine learning software uses this information to understand the influence of each and every cooling unit, on each and every rack, in real-time as IT loads change.  And when loads or IT infrastructure changes, the software re-learns accordingly and updates itself, ensuring that the accuracy of its influence predictions remains current and accurate.   This ability to understand cooling influence at a granular level also enables the software to learn which cooling units are working effectively – and at expected performance levels  – and which aren’t.

This understanding also illuminates, in a data-supported way, the need for targeted corrective maintenance. With a clearer understanding and visualization of cooling unit health, operators can justify the right budget to maintain equipment effectively thereby improving the overall health and reducing risk in the data center.

In one recent experience at a large US data center, machine learning software revealed that 40% of the cooling units were consuming power but not cooling.  The data center operator was aware of the problem, but couldn’t convince senior management to expend budget because he couldn’t quantify the problem nor prove the value/need for a specific expenditure to resolve the issue.  With new and clear data in hand, the operator was able to identify the failed CRACs and present the appropriate budget required to fix and replace them accordingly.

This ability to more clearly see the impact of IT changes on cooling equipment enables personnel to keep up with cooling capacity adjustment and, in most cases, eliminate the need for manual control.  A reduction of the corresponding “on-the-fly, floor time corrections” also frees up operators to focus on problems that require more creativity and to more effectively manage physical changes such floor tile adjustments, etc.

There’s no replacement for experience-based human expertise. However, why not leverage your staff  to do what they do best, and eliminate those tasks which are better served by software control.  Data centers using machine learning software are undeniably more efficient and more robust.  Operators can more confidently future proof themselves against inefficiency or adverse capacity impact as conditions change.  For these reasons alone, use of machine learning-based software should be considered an emerging best practice.