With Data Centers, What Can Happen Will Happen (Eventually).

Because data centers and telecom switching centers are designed to withstand failures without interrupting business operations, a 3 a.m. emergency due to a malfunctioning air conditioner should never occur – in theory. But Murphy’s Law says that if a single failure can create an emergency, it will. So, to date, operators have had to react to single-component failures as if they are business-critical. Because they might be.

In my previous blog, I pointed out the two components of risk: the probability of and the consequence of failure. While both of these components are important in failure analysis, it is the consequence of failure that’s most effective at helping decision-makers manage the cost of failure.

If you know there is a high probability of impending failure, but you don’t know the potential consequence, you have to act as though every threat has the potential for an expensive business interruption. Taking such actions is typically expensive. But if you know the consequence, even without knowing the probability of failure, you can react to inconsequential failures at your leisure and plan so that consequential failures are less likely.

In the past, the consequences of a failure weren’t knowable or predictable. The combination of Internet of Things (IoT) data and machine learning has changed all that. It’s now possible to predict the consequence of failure by analyzing large quantities of historical sensor data. These predictions can be performed on demand and without the need for geometrical data hall descriptions.

The advantage of machine learning-based systems is that predictive models are continually tuned to actual operating conditions. Even as things change and scale over time, the model remains accurate without manual intervention. The consequences of actions, in addition to equipment failures, become knowable and predictable.

This type of consequence analysis is particularly important for organizations that have a run-to-failure policy for mechanical equipment. Run-to-failure is common in organizations with severe capital constraints, but it only works, and avoids business interruptions, if the consequence of the next failure is predictable.

Predicting the consequence of failure allows an operations team to avoid over-reacting to failures that do not affect business continuity. Rather than dispatching a technician in the middle of the night, an operations team can address a predicted failure with minimal or no consequence during its next scheduled maintenance. If consequence analysis indicates that a cooling unit failure may put more significant assets at risk, the ability to predict how much time is available before a critical temperature is reached provides time for graceful shutdown – and mitigation.

Preventative maintenance carries risk, but equipment still needs to be shut off at times for maintenance. Will it cause a problem? Predictive consequence analysis can provide the answer. If there’s an issue with shutting off a particular unit, you can know in advance and provide spot cooling to mitigate the risks.

 The ability to predict the consequences of failure, or intentional action such as preventative maintenance, gives facility managers greater control over the reliability of their facilities, and the peace of mind that their operations are as safe as possible.

2016 and Looking Forward

2016-imageTo date, Vigilent has saved more than 1 billion kilowatt hours of energy, delivering $100 million in savings to our customers.  This also means we reduced the amount of CO2 released into the atmosphere by over 700,000 metric tons, equivalent to not acquiring and burning almost 4000 railcars of coal.  This matters because climate change is real.

Earlier this year, Vigilent announced its support for the Low-Carbon USA initiative, a consortium of leading businesses across the United States that support the Paris Climate Accord with the goal of reducing global temperature rise to well below 2 degrees Celsius.  Conservation plays its part, but innovation driving efficiency and renewable power creation will make the real difference.  Vigilent and its employees are fiercely proud to be making a tangible difference every day with the work that we do.

Beyond this remarkable energy savings milestone, I am very proud of the market recognition Vigilent achieved this year.  Bloomberg recognized Vigilent as a “New Energy Pioneer.”  Fierce Innovation named Vigilent the Best in Show:  Green Application & Data Centers (telecom category.)

Of equal significance, Vigilent has become broadly recognized as a leader in the emerging field of industrial IoT.  With our early start in this industry, integrating sensors and machine learning for measurable advantage long before they ever became a “thing,” Vigilent has demonstrated significant market traction with concrete results.  The industry has recognized Vigilent’s IoT achievements with the following awards this year:

TiE50                    Top Startup: IoT

IoT Innovator     Best Product: Commercial and Industrial Software

We introduced Vigilent prescriptive analytics this summer with shocking results, and I say that in a good way.  Our customers have uniformly received insights that surprised them.  These insights have ranged from unrealized capacity to failing equipment in critical areas.  The analytics are also helping customers meet SLA requirements with virtually no extra work and to identify areas ranging out of compliance, enabling facility operators to quickly resolve issues as soon as a temperature goes beyond a specified threshold.

Vigilent dynamic cooling management systems are actively used in the world’s largest colos and telcos, and in Fortune 500 companies spanning the globe.  We have expanded relationships with long-term partners’ NTT Facilities and Schneider Electric, who have introduced Vigilent to new regions such as Latin America and Greater Asia.  We signed a North America-focused partnership with Siemens, which leverages Siemens Demand Flow and the Vigilent system to optimize efficiency and manage data center challenges across the white space and chiller plant. We are very pleased that the world’s leading data center infrastructure and service vendors have chosen to include Vigilent in their solution portfolio.

We thank you, our friends, customers and partners, for your continued support and look forward to another breakout year as we help the businesses of the world manage energy use intelligently and combat climate change.

 

The Fastest Route to Using Data Analysis in Data Center Operations

voltThe transition to data-driven operations within data centers is inevitable.  In fact, it has already begun.

With this in mind, my last blog questioned why data centers still resist data use, surmising that because data use doesn’t fall within traditional roles and training, third parties – and new tools – will be needed to help with the transition. “Retrofitting” existing personnel, at least in the short term, is unrealistic.  And time matters.

Consider the example of my Chevy Volt.  The Volt illustrates just how quickly a traditional industry can be caught flat-footed in a time of transition, opening opportunities for others to seize market share. The Volt is as much a rolling mass of interconnected computers as it is a car. It has 10 million lines of code. 10 million!  That’s more than a F-22 Raptor, the most advanced fighter plane on earth.

The Volt of course, needs regular service just like any car.  While car manufacturers were clearly pivoting toward complex software-driven engines, car dealerships were still staffed with engine mechanics, albeit highly skilled mechanics.  During my service experience, the dealership had one guy trained and equipped to diagnose and tune the Volt.  One guy.  Volts were and are selling like crazy.  And when that guy was on vacation, I had to wait.

So, the inevitable happened.  Third party service shops, which were fully staffed with digitally-savvy technicians specifically trained in electric vehicle maintenance, quickly gained business.  Those shops employed mechanics, but the car diagnostics were performed by technology experts who could provide the mechanics with very specific guidance from the car’s data.  In addition, I had direct access to detail about the operation of my car from monthly reports delivered by OnStar, enabling me to make more informed driving, maintenance and purchase decisions.

Most dealerships weren’t prepared for the rapid shift from servicing mechanical systems to servicing computerized systems.  Referencing my own experience, the independent service shop that had been servicing my other, older car, very quickly transitioned to service all kinds of electric service vehicles.  Their agility in adjusting to new market conditions brought them a whole new set of service opportunities.  The Chevy dealership, on the other hand, created a service vacuum that opened business for others.

The lesson here is to transition rapidly to new market conditions.  Oftentimes, using external resources is the fastest way to transition to a new skillset without taking your eye off operations, without making a giant investment, and while creating a path to incorporating these skills into your standard operating procedures over time. 

During transitions, and as your facility faces learning curve challenges, it makes sense to turn to resources that have the expertise and the tools at hand.  Because external expert resources work with multiple companies, they also bring the benefit of collective perspective, which can be brought to bear on many different types of situations.

In an outsourced model, and specifically in the case of data analytics services, highly experienced and focused data specialists can be responsible for collecting, reviewing and regularly reporting back to facility managers on trends, exceptions, actions to take and potentially developing issues.  These specialists augment the facility manager’s ability to steer his or her data centers through a transition to more software and data intensive systems, without the time hit or distraction of engaging a new set of skills.  Also, as familiarity with using data evolves, the third party can train data center personnel, providing operators with direct access to data and indicative metrics in the short term, while creating a foundation for the eventual onboarding of data analysis operations.  

Data analysis won’t displace existing data center personnel.  It is an additional and critical function that can be supported internally or externally.  Avoiding the use of data to improve data center operations is career-limiting.  Until data analysis skills and tools are embedded within day-to-day operations, hiring a data analysis service can provide immediate relief and help your team transition to adopt these skills over time.  

Analytics in Action for Data Center Cooling

When a data center is first designed, everything is tightly controlled. Rack densities are all the same. The layout is precisely planned and very consistent. Power and space constraints are well-understood. The cooling system is modeled – sometimes even with CFD – and all of the cooling units operate at the same level.

But the original design is often a short-lived utopia. The realty of most data centers becomes much more complex as business needs and IT requirements change and equipment moves in and out.

As soon as physical infrastructure changes, cooling capacity and redundancy are affected.  Given the complexity of design versus operational reality, many organizations have not had the tools to understand what has changed or degraded, so cannot make informed decisions about their cooling infrastructure. Traditional DCIM products often focus on space, network and power.  They don’t provide detailed, measured data on the cooling system.  So, decisions about cooling are made without visibility into actual conditions.

Analytics can help. Contrary to prevailing views, analytics don’t necessarily take a lot of know-how or data analysis skills to be extremely helpful in day-to-day operations management. Analytics can be simple and actionable. Consider the following examples of how a daily morning glance at thermal analytics helped these data center managers quickly identify and resolve some otherwise tricky thermal issues.

In our first example, the manager of a legacy, urban colo data center with DX CRAC units was asked to determine the right place for some new IT equipment. There were several areas with space and power available, but determining which of these areas had sufficient cooling was more challenging. The manager used a cooling influence map to identify racks cooled by multiple CRACs. He then referenced a cooling capacity report to confirm that more than one of these CRACs had capacity to spare. By using these visual analytics, the manager was able to place the IT equipment in an area with sufficient, and redundant, cooling.

In a second facility, a mobile switching center for a major telco, the manager noticed a hot spot on the thermal map and sent a technician to investigate the location. The technician saw that some of the cooling coils had low delta T even though the valves were open, which implied a problem with the hydronics. Upon physical investigation of the area, he discovered that this was caused by trapped air in the coil, so he bled it off. The delta T quickly went from 3 to 8.5 – a capacity increase of more than 65 percent – as displayed on the following graph:

 

DeltaT

These examples are deceptively simple. But without analytics, the managers would not have been able to as easily identify the exact location of the problem, the cooling units involved, and have enough information to direct trouble-shooting action within the short time needed to resolve problems in a mission critical facility.

Analytics typically use the information already available in a properly monitored data center. They complement the experienced intuition of data center personnel with at-a-glance data that helps identify potential issues more quickly and bypasses much of the tedious, blood pressure-raising and time-consuming diagnostic activities of hotspot resolution.

Analytics are not the future. Analytics have arrived. Data centers that aren’t taking advantage of them are riskier and more expensive to operate, and place themselves at competitive disadvantage

Data Center Capacity Planning – Why Keep Guessing?

Capacity management involves decisions about space, power, and cooling.

Space is the easiest. You can assess it by inspection.

Power is also fairly easy. The capacity of a circuit is knowable. It never changes. The load on a circuit is easy to measure.

Cooling is the hardest. The capacity of cooling equipment changes with time. Capacity depends on how the equipment is operated, and it degrades over time. Even harder is the fact that cooling is distributed. Heat and air follow the paths of least resistance and don’t always go where you would expect. For these reasons and more, mission-critical facilities are designed for and built with far more cooling capacity than they need. And yet many operators add even more cooling each time there is a move, add, or change to IT equipment, because that’s been a safer bet than guessing wrong.

Here is a situation we frequently observe:

Operations will receive frequent requests to add or change IT loads as a normal course of business.  In large or multi-site facilities, these requests may occur daily.  Let’s say that operations receives a request to add 50 kW to a particular room.  Operations will typically add 70 kW of new cooling.

This provisioning is calculated assuming a full load for each server, with the full load being determined from server nameplate data.  In reality, it’s highly unlikely that all cabinets in a room will be fully loaded, and it is equally unlikely that the server will ever require its nameplate power.  And remember, the room was originally designed with excess cooling capacity.  When you add even more cooling to these rooms, you have escalated over-provisioning.  Capital and energy are wasted.

We find that cooling utilization is typically 35 to 40%, which leaves plenty of excess capacity for IT equipment expansions.  We also find that in 5-10% of situations, equipment performance and capacity has degraded to the point where cooling redundancy is compromised.  In these cases, maintenance becomes difficult and there is a greater risk of IT failure due to a thermal event. So, it’s important to know how a room is running before adding cooling.  But it isn’t always easy to tell if cooling units are not performing as designed and specified.

How can operations managers make more cost effective – and safe – planning decisions?  Analytics.

Analytics using real-time data provides managers with the insight to determine whether or not cooling infrastructure can handle a change or expansion to IT equipment, and to manage these changes while minimizing risk.  Specifically, analytics can quantify actual cooling capacity, expose equipment degradation, and reveal where there is more or less cooling reserve in a room for optimal placement of physical and virtual IT assets.

Consider the following analytics-driven capacity report.  Continually updated by a sensor network, the report clearly displays exactly where capacity is available and where it is not.  With this data alone, you can determine where capacity exists and where you can safely and immediately add capacity with no CapEx investment.  And, in those situations where you do need to add additional cooling, it will predict with high confidence what you need. (click on the image for a full-size version)

Cooling Capacity

Yet you can go deeper still.  By pairing the capacity report with a cooling reserve map (below), you can determine where you can safely place additional load in the desired room.  You can also see where you should locate your most critical assets and, when you need that new air conditioner, and where you should place it.

(click on the image for a full size version)thermalcircle

Using these reports, operations can:

  • avoid the CapEx cost of more cooling every time IT equipment is added;
  • avoid the risk of cooling construction in production data rooms when it is often not needed;
  • avoid the delayed time to revenue from adding cooling to a facility that doesn’t need it.

In addition, analytics used in this way avoids unnecessary energy and maintenance OpEx costs.

Stop guessing and start practicing the art of avoidance with analytics.

 

 

Maintenance is Risky

No real surprise here. Mission critical facilities that pride themselves on and/or are contractually obligated to provide the “five 9’s” of reliability know that sooner or later they must turn critical cooling equipment off to perform maintenance. And they know that they face risk each time they do so.

This is true even for the newest facilities. The minute a facility is turned up, or IT load is added, things start to change. The minute a brand new cooling unit is deployed, it starts to degrade – however incrementally. And that degree of degradation is different from unit to unit, even when those units are nominally identical.

In a risk and financial performance panel presentation at a recent data center event sponsored by Digital Realty, ebay’s Vice President of Global Foundation Services Dean Nelson recently stated that “touching equipment for maintenance increases Probability of Failure (PoF).” Nelson actively manages and focuses on reducing ebay’s PoF metric throughout the facilities he manages.

Performing maintenance puts most facility managers between the proverbial rock and a hard place. If equipment isn’t maintained, by definition you have a “run to failure” maintenance policy. If you do maintain equipment, you incur risk each time you turn something off. The telecom industry calls this “hands in the network” which they manage as a significant risk factor.

What if maintenance risks could be mitigated? What if you could predict what would happen to the thermal conditions of a room and, even more specifically, what racks or servers could be affected if you took a particular HVAC unit offline?

This ability is available today. It doesn’t require computational fluid dynamics (CFD) or other complicated tools that rely on physical models. It can be accomplished through data and analytics. That is, analytics continually updated by real-time data from sensors instrumented throughout a data center floor. Gartner Research says that hindsight based on historical data, followed by insight based on current trends, drives foresight.

Using predictive analytics, facility managers can also determine exactly which units to maintain and when – in addition to understanding the potential thermal affect that each maintenance action will have on every location in the data center floor.

If this knowledge was easily available, what facility manager wouldn’t choose to take advantage of it before taking a maintenance action? My next blog post will provide a visual example of the analysis facility managers can perform to determine when and where to perform maintenance while simultaneously reducing risk to more critical assets and the floor as a whole.