Consequence Planning Avoids Getting Trapped Between a Rack and a Hot Place

A decade of deploying machine learning in data centers and telecom switching centers throughout the world has taught us a thing or two about risk and reliability management.

In the context of reliability engineering, risk is often defined as the probability of failure times the consequence of the failure. The failure itself, therefore, is only half of the risk consideration. The resulting consequences are equally, and sometimes more, relevant. Data centers typically manage risk with redundancy to reduce the chances of failures that may cause a business interruption. This method reduces the consequence of single component failure. If failure occurs, a redundant component ensures continuity.

When people talk about the role of machine learning in risk and reliability management, most view machine learning from a similar perspective – as a tool for predicting the failure of single components.

But this focus falls short of the true capabilities of machine learning. Don’t get me wrong, predicting the probability of failure is useful – and difficult – to do. But it only has value when the consequence of the predicted failure is significant.

When data centers and telecom switching centers perform and operate as designed, the consequences of most failures are typically small. But most data centers don’t operate as designed, especially the longer they run.

Vigilent uses machine learning to predict the consequences of control actions. We use machine learning to train our Influence Map™ to make accurate predictions of cooling control actions, including what will happen when a cooling unit is turned on or off. If the Influence Map predicts that turning a particular unit off would cause a rack to become too hot, the system won’t turn that cooling unit off.

The same process can be used to predict the consequence of a cooling unit failure. In other words, the Influence Map can predict the potential business impact of a particular cooling unit failure, such as whether a rack will get hot enough to impact business continuity. This kind of failure analysis simultaneously estimates the redundancy of the cooling system.

This redundancy calculation doesn’t merely compare the total cooling capacity with the total heat load of the equipment. Fully understanding the consequence of a failure requires both predictive modeling and machine learning. Together, these technologies accurately model actual, real time system behavior in order to predict and manage the cost of that failure.

This is why the distinction between failures and consequences matter. Knowing the consequences of failure enables you to predict the cost of failure.

Some predicted failures might not require a 3 a.m. dispatch. In my next blog, I’ll outline the material advantages of understanding consequences and the resulting effect on redundancy planning and maintenance operations.

The Real Cost of Cooling Configuration Errors

Hands in the network cause problems. A setting adjusted once, based on someone’s instinct of what needed to be changed at one moment in time, is often unmodified years later.

This is configuration rot. If your data center has been running for a while, the chances are pretty high that your cooling configurations, to name one example, are wildly out of sync. It’s even more likely you don’t know about it.

Every air conditioner is controlled by an embedded computer. Each computer supports multiple configuration parameters. Each of these different configurations can be perfectly acceptable. But a roomful of air conditioners with individually sensible configurations can produce bad outcomes when their collective impact is considered.

I recently toured a new data center in which each air conditioner supported 17 configuration parameters affecting temperature and humidity. There was a lot of unexplainable variation in the configurations. Six of the 17 configuration settings varied by more than 30%, unit to unit. Only five configurations were the same. Configuration variation initially and entropy over time wastes energy and prevents the overall air conditioning system from producing an acceptable temperature and humidity distribution.

Configuration errors contribute to accidental de-rating and loss of capacity. This wastes energy, and it’s costly from a capex perceptive. Perhaps you don’t need a new air conditioner. Instead, perhaps you can optimize or synchronize the configurations for the air conditioners you already have and unlock the capacity you need. Another common misconfiguration error is incompatible set points. If one air conditioner is trying to make a room cold and another is trying to make it warmer, the units will fight.

Configuration errors also contribute to poor free cooling performance. Misconfiguration can lock out free cooling in many ways.

The problem is significant. Large organizations use thousands of air conditioners. Manual management of individual configurations is impossible. Do the math. If you have 2000 air conditioners, each of which has up to 17 configuration parameters, you have 34,000 configuration possibilities, not to mention the additional external variables. How can you manage, much less optimize configurations over time?

Ideally, you need intelligent software that manages these configurations automatically. You need templates that prescribe optimized configuration. You need visibility to determine, on a regular basis, which configurations are necessary as conditions change. You need exception handling, so you can temporarily change configurations when you perform tasks such as maintenance, equipment swaps, and new customer additions, and then make sure the configurations return to their optimized state afterward. And, you need a system that will alert you when someone tries to change a configuration, and/or enforce optimized configurations automatically.

This concept isn’t new. It’s just rarely done. But if you aren’t aggressively managing configurations, you are losing money.

2016 and Looking Forward

2016-imageTo date, Vigilent has saved more than 1 billion kilowatt hours of energy, delivering $100 million in savings to our customers.  This also means we reduced the amount of CO2 released into the atmosphere by over 700,000 metric tons, equivalent to not acquiring and burning almost 4000 railcars of coal.  This matters because climate change is real.

Earlier this year, Vigilent announced its support for the Low-Carbon USA initiative, a consortium of leading businesses across the United States that support the Paris Climate Accord with the goal of reducing global temperature rise to well below 2 degrees Celsius.  Conservation plays its part, but innovation driving efficiency and renewable power creation will make the real difference.  Vigilent and its employees are fiercely proud to be making a tangible difference every day with the work that we do.

Beyond this remarkable energy savings milestone, I am very proud of the market recognition Vigilent achieved this year.  Bloomberg recognized Vigilent as a “New Energy Pioneer.”  Fierce Innovation named Vigilent the Best in Show:  Green Application & Data Centers (telecom category.)

Of equal significance, Vigilent has become broadly recognized as a leader in the emerging field of industrial IoT.  With our early start in this industry, integrating sensors and machine learning for measurable advantage long before they ever became a “thing,” Vigilent has demonstrated significant market traction with concrete results.  The industry has recognized Vigilent’s IoT achievements with the following awards this year:

TiE50                    Top Startup: IoT

IoT Innovator     Best Product: Commercial and Industrial Software

We introduced Vigilent prescriptive analytics this summer with shocking results, and I say that in a good way.  Our customers have uniformly received insights that surprised them.  These insights have ranged from unrealized capacity to failing equipment in critical areas.  The analytics are also helping customers meet SLA requirements with virtually no extra work and to identify areas ranging out of compliance, enabling facility operators to quickly resolve issues as soon as a temperature goes beyond a specified threshold.

Vigilent dynamic cooling management systems are actively used in the world’s largest colos and telcos, and in Fortune 500 companies spanning the globe.  We have expanded relationships with long-term partners’ NTT Facilities and Schneider Electric, who have introduced Vigilent to new regions such as Latin America and Greater Asia.  We signed a North America-focused partnership with Siemens, which leverages Siemens Demand Flow and the Vigilent system to optimize efficiency and manage data center challenges across the white space and chiller plant. We are very pleased that the world’s leading data center infrastructure and service vendors have chosen to include Vigilent in their solution portfolio.

We thank you, our friends, customers and partners, for your continued support and look forward to another breakout year as we help the businesses of the world manage energy use intelligently and combat climate change.

 

Analytics in Action for Data Center Cooling

When a data center is first designed, everything is tightly controlled. Rack densities are all the same. The layout is precisely planned and very consistent. Power and space constraints are well-understood. The cooling system is modeled – sometimes even with CFD – and all of the cooling units operate at the same level.

But the original design is often a short-lived utopia. The realty of most data centers becomes much more complex as business needs and IT requirements change and equipment moves in and out.

As soon as physical infrastructure changes, cooling capacity and redundancy are affected.  Given the complexity of design versus operational reality, many organizations have not had the tools to understand what has changed or degraded, so cannot make informed decisions about their cooling infrastructure. Traditional DCIM products often focus on space, network and power.  They don’t provide detailed, measured data on the cooling system.  So, decisions about cooling are made without visibility into actual conditions.

Analytics can help. Contrary to prevailing views, analytics don’t necessarily take a lot of know-how or data analysis skills to be extremely helpful in day-to-day operations management. Analytics can be simple and actionable. Consider the following examples of how a daily morning glance at thermal analytics helped these data center managers quickly identify and resolve some otherwise tricky thermal issues.

In our first example, the manager of a legacy, urban colo data center with DX CRAC units was asked to determine the right place for some new IT equipment. There were several areas with space and power available, but determining which of these areas had sufficient cooling was more challenging. The manager used a cooling influence map to identify racks cooled by multiple CRACs. He then referenced a cooling capacity report to confirm that more than one of these CRACs had capacity to spare. By using these visual analytics, the manager was able to place the IT equipment in an area with sufficient, and redundant, cooling.

In a second facility, a mobile switching center for a major telco, the manager noticed a hot spot on the thermal map and sent a technician to investigate the location. The technician saw that some of the cooling coils had low delta T even though the valves were open, which implied a problem with the hydronics. Upon physical investigation of the area, he discovered that this was caused by trapped air in the coil, so he bled it off. The delta T quickly went from 3 to 8.5 – a capacity increase of more than 65 percent – as displayed on the following graph:

 

DeltaT

These examples are deceptively simple. But without analytics, the managers would not have been able to as easily identify the exact location of the problem, the cooling units involved, and have enough information to direct trouble-shooting action within the short time needed to resolve problems in a mission critical facility.

Analytics typically use the information already available in a properly monitored data center. They complement the experienced intuition of data center personnel with at-a-glance data that helps identify potential issues more quickly and bypasses much of the tedious, blood pressure-raising and time-consuming diagnostic activities of hotspot resolution.

Analytics are not the future. Analytics have arrived. Data centers that aren’t taking advantage of them are riskier and more expensive to operate, and place themselves at competitive disadvantage

DCIM & ERP

Yes, DCIM Systems ARE like ERP Systems, Critical for Both Cost and Risk Management

Technology and manufacturing companies nearly all use sophisticated ERP systems for oversight on the myriad functions that contribute to a company’s operation.  Service companies use SAP. 

Data center managers more typically use their own experience.   With all due respect to this experience, the complexity of today’s data center has long surpassed the ability of any human or even group of humans, to manage it for maximum safety and efficiency.

As data centers have come to acknowledge this fact, they are increasingly adopting DCIM, the data center’s answer to ERP.   The similarities between ERP systems and DCIM are striking.

Just as manufacturing and technology firms needed a system to manage the complexity of operations, data center operations have grown and matured to the state that such systems are now required as well.

 Data Center Knowledge’s  Jason Verge says that “… [DCIM] is being touted as the ERP for the data center; it is addressing a complicated challenge.  When a device is introduced changes or fails, it changes the make-up of these complex facilities.”

Mark Harris of Nlyte said in a related Data Center Journal article: “DCIM was envisioned to become the ERP for IT.  It was to become the enabler for the IT organization to extend and manage their span of control, much like all other organizations (Sales, Engineering, manufacturing, Finance, etc.) had adopted over the years.”

Just like ERP systems, DCIM attempts to de-silo and shed light, along with management control, on cost and waste, while also addressing risk concerns.   In initial DCIM deployments the focus has understandably been on asset management.  Understanding the equipment you have and if this equipment is appropriate for your challenges was the right place to start. However, DCIM vendors and users quickly realized that elimination of energy waste, particularly energy wasted by unused IT assets, was another useful area of focus.  Cooling as a resource or even area of waste, was a tertiary concern.  Business managers no longer have this luxury.  The cost of cooling and the risk of a cooling/heating-related data center failure is too high.  As Michelle Bailey, VP of 451 Datacenter Initiatives and Digital Infrastructure said in a recent webinar on Next Generation Data Centers, Data centers have become too big to fail.  She also said that data centers are still using imprecise measurements of accountability – which don’t match up to business goals.  Processes must be made more transparent to business managers, and that metrics must be established and tie directly back to business goals.

Data center managers can and do make extremely expensive energy-related decisions from a cost perspective in order to reduce risk.  These may not even be bad decisions.  But the issue is that, without site visibility and the transparency that Michelle suggests above, business managers don’t realize that these decisions are being made at all, or that there may be options available which, with more analysis, make more sense from a business cost and risk trade-off perspective. And, while cost is one driver of the need for management oversight, waste (and its obvious effect on cost), is another.

As an example, a facility manager may turn his chiller plant down a degree to manage his cost function and perception of risk control.  This action has the cost equivalent of expensing a Tesla, but likely has no visibility to management.  Nor, typically, does the facility manager realize that he has less expensive and even less risky alternatives, because he/she has never had to consider them.   Facility managers are not traditionally accountable to energy savings.  They are accountable to uptime.  This thinking is outdated.  The two are no longer mutually exclusive.  In fact they are inextricably tied.  Proactive and intelligently managed energy saves money and reduces downtime risk by reducing the possibility of cooling failures.  If DCIM, like an ERP system, is used to understand and manage where cost – and waste- is being generated, it must specifically address and incorporate cooling infrastructure.

DCIM systems, offering granular data center information, aggregated and analyzed for business case analysis enables such oversight and with this, improved operational management.

 

 

Intelligent Efficiency

Intelligent Efficiency, The Next New Thing.

Greentech Media’s senior editor Stephen Lacey reported that the convergence of the internet and distributed energy are contributing to a new economic paradigm for the 21st century.

Intelligent efficiency is the next new thing enabled by that paradigm, he says, in a special report  of the same name.  He also notes that this isn’t the “stale, conservation-based energy efficiency Americans often think about.”  He says that the new thinking around energy efficiency is information-driven.  It is granular. And it empowers consumers and businesses to turn energy from a cost into an asset.

I couldn’t agree more.

Consider how this contrast in thinking alone generates possibilities for resources that have been hidden or economically unavailable until now.

Conservation-based thinking or, as I think about it in data centers, “efficiency by design or replacement,” is capital intensive.  To date, this thinking has been focused on new construction, physical infrastructure change, or equipment swap-outs.  These efforts are slow and can’t take advantage of operational variations such as the time-varying costs of energy.

Intelligent energy efficiency thinking, on the other hand, leverages newly available information enabled by networked devices and wireless sensors  to make changes primarily through software.  Intelligent energy management is non-disruptive and easier to implement.  It reduces risk by offering greater transparency.   And, most importantly, it is fast.  Obstacles to the speed of implementation – and the welcome results of improved efficiency – have been removed by technology.

Intelligence is the key factor here.  You can have an efficient system, an efficient design, but if it isn’t operated effectively, it is inherently inefficient.  For example, you may deploy one perfectly efficient machine right next to another perfectly efficient machine believing that you have installed a state-of-the-art solution.  In reality, it’s more likely that these two machines are interacting and fighting with each other – at significant energy cost.   You also need to factor in and be able to track equipment degradation as well as the risks incurred by equipment swap-outs.

You need the third element – intelligence – working in tandem with efficient equipment, to make sure that the whole system works at peak level and continues to work at peak level, regardless of the operating conditions.  This information flow must be constant.  Even the newest, most perfectly optimized data centers will inevitably change.

Kudos to Greentech Media for this outstanding white paper and for highlighting how this new thinking and the” blending of real-time communications with physical systems”  is changing the game for energy efficiency.