With Data Centers, What Can Happen Will Happen (Eventually).

Because data centers and telecom switching centers are designed to withstand failures without interrupting business operations, a 3 a.m. emergency due to a malfunctioning air conditioner should never occur – in theory. But Murphy’s Law says that if a single failure can create an emergency, it will. So, to date, operators have had to react to single-component failures as if they are business-critical. Because they might be.

In my previous blog, I pointed out the two components of risk: the probability of and the consequence of failure. While both of these components are important in failure analysis, it is the consequence of failure that’s most effective at helping decision-makers manage the cost of failure.

If you know there is a high probability of impending failure, but you don’t know the potential consequence, you have to act as though every threat has the potential for an expensive business interruption. Taking such actions is typically expensive. But if you know the consequence, even without knowing the probability of failure, you can react to inconsequential failures at your leisure and plan so that consequential failures are less likely.

In the past, the consequences of a failure weren’t knowable or predictable. The combination of Internet of Things (IoT) data and machine learning has changed all that. It’s now possible to predict the consequence of failure by analyzing large quantities of historical sensor data. These predictions can be performed on demand and without the need for geometrical data hall descriptions.

The advantage of machine learning-based systems is that predictive models are continually tuned to actual operating conditions. Even as things change and scale over time, the model remains accurate without manual intervention. The consequences of actions, in addition to equipment failures, become knowable and predictable.

This type of consequence analysis is particularly important for organizations that have a run-to-failure policy for mechanical equipment. Run-to-failure is common in organizations with severe capital constraints, but it only works, and avoids business interruptions, if the consequence of the next failure is predictable.

Predicting the consequence of failure allows an operations team to avoid over-reacting to failures that do not affect business continuity. Rather than dispatching a technician in the middle of the night, an operations team can address a predicted failure with minimal or no consequence during its next scheduled maintenance. If consequence analysis indicates that a cooling unit failure may put more significant assets at risk, the ability to predict how much time is available before a critical temperature is reached provides time for graceful shutdown – and mitigation.

Preventative maintenance carries risk, but equipment still needs to be shut off at times for maintenance. Will it cause a problem? Predictive consequence analysis can provide the answer. If there’s an issue with shutting off a particular unit, you can know in advance and provide spot cooling to mitigate the risks.

 The ability to predict the consequences of failure, or intentional action such as preventative maintenance, gives facility managers greater control over the reliability of their facilities, and the peace of mind that their operations are as safe as possible.

Consequence Planning Avoids Getting Trapped Between a Rack and a Hot Place

A decade of deploying machine learning in data centers and telecom switching centers throughout the world has taught us a thing or two about risk and reliability management.

In the context of reliability engineering, risk is often defined as the probability of failure times the consequence of the failure. The failure itself, therefore, is only half of the risk consideration. The resulting consequences are equally, and sometimes more, relevant. Data centers typically manage risk with redundancy to reduce the chances of failures that may cause a business interruption. This method reduces the consequence of single component failure. If failure occurs, a redundant component ensures continuity.

When people talk about the role of machine learning in risk and reliability management, most view machine learning from a similar perspective – as a tool for predicting the failure of single components.

But this focus falls short of the true capabilities of machine learning. Don’t get me wrong, predicting the probability of failure is useful – and difficult – to do. But it only has value when the consequence of the predicted failure is significant.

When data centers and telecom switching centers perform and operate as designed, the consequences of most failures are typically small. But most data centers don’t operate as designed, especially the longer they run.

Vigilent uses machine learning to predict the consequences of control actions. We use machine learning to train our Influence Map™ to make accurate predictions of cooling control actions, including what will happen when a cooling unit is turned on or off. If the Influence Map predicts that turning a particular unit off would cause a rack to become too hot, the system won’t turn that cooling unit off.

The same process can be used to predict the consequence of a cooling unit failure. In other words, the Influence Map can predict the potential business impact of a particular cooling unit failure, such as whether a rack will get hot enough to impact business continuity. This kind of failure analysis simultaneously estimates the redundancy of the cooling system.

This redundancy calculation doesn’t merely compare the total cooling capacity with the total heat load of the equipment. Fully understanding the consequence of a failure requires both predictive modeling and machine learning. Together, these technologies accurately model actual, real time system behavior in order to predict and manage the cost of that failure.

This is why the distinction between failures and consequences matter. Knowing the consequences of failure enables you to predict the cost of failure.

Some predicted failures might not require a 3 a.m. dispatch. In my next blog, I’ll outline the material advantages of understanding consequences and the resulting effect on redundancy planning and maintenance operations.

The Real Cost of Cooling Configuration Errors

Hands in the network cause problems. A setting adjusted once, based on someone’s instinct of what needed to be changed at one moment in time, is often unmodified years later.

This is configuration rot. If your data center has been running for a while, the chances are pretty high that your cooling configurations, to name one example, are wildly out of sync. It’s even more likely you don’t know about it.

Every air conditioner is controlled by an embedded computer. Each computer supports multiple configuration parameters. Each of these different configurations can be perfectly acceptable. But a roomful of air conditioners with individually sensible configurations can produce bad outcomes when their collective impact is considered.

I recently toured a new data center in which each air conditioner supported 17 configuration parameters affecting temperature and humidity. There was a lot of unexplainable variation in the configurations. Six of the 17 configuration settings varied by more than 30%, unit to unit. Only five configurations were the same. Configuration variation initially and entropy over time wastes energy and prevents the overall air conditioning system from producing an acceptable temperature and humidity distribution.

Configuration errors contribute to accidental de-rating and loss of capacity. This wastes energy, and it’s costly from a capex perceptive. Perhaps you don’t need a new air conditioner. Instead, perhaps you can optimize or synchronize the configurations for the air conditioners you already have and unlock the capacity you need. Another common misconfiguration error is incompatible set points. If one air conditioner is trying to make a room cold and another is trying to make it warmer, the units will fight.

Configuration errors also contribute to poor free cooling performance. Misconfiguration can lock out free cooling in many ways.

The problem is significant. Large organizations use thousands of air conditioners. Manual management of individual configurations is impossible. Do the math. If you have 2000 air conditioners, each of which has up to 17 configuration parameters, you have 34,000 configuration possibilities, not to mention the additional external variables. How can you manage, much less optimize configurations over time?

Ideally, you need intelligent software that manages these configurations automatically. You need templates that prescribe optimized configuration. You need visibility to determine, on a regular basis, which configurations are necessary as conditions change. You need exception handling, so you can temporarily change configurations when you perform tasks such as maintenance, equipment swaps, and new customer additions, and then make sure the configurations return to their optimized state afterward. And, you need a system that will alert you when someone tries to change a configuration, and/or enforce optimized configurations automatically.

This concept isn’t new. It’s just rarely done. But if you aren’t aggressively managing configurations, you are losing money.

The Fastest Route to Using Data Analysis in Data Center Operations

voltThe transition to data-driven operations within data centers is inevitable.  In fact, it has already begun.

With this in mind, my last blog questioned why data centers still resist data use, surmising that because data use doesn’t fall within traditional roles and training, third parties – and new tools – will be needed to help with the transition. “Retrofitting” existing personnel, at least in the short term, is unrealistic.  And time matters.

Consider the example of my Chevy Volt.  The Volt illustrates just how quickly a traditional industry can be caught flat-footed in a time of transition, opening opportunities for others to seize market share. The Volt is as much a rolling mass of interconnected computers as it is a car. It has 10 million lines of code. 10 million!  That’s more than a F-22 Raptor, the most advanced fighter plane on earth.

The Volt of course, needs regular service just like any car.  While car manufacturers were clearly pivoting toward complex software-driven engines, car dealerships were still staffed with engine mechanics, albeit highly skilled mechanics.  During my service experience, the dealership had one guy trained and equipped to diagnose and tune the Volt.  One guy.  Volts were and are selling like crazy.  And when that guy was on vacation, I had to wait.

So, the inevitable happened.  Third party service shops, which were fully staffed with digitally-savvy technicians specifically trained in electric vehicle maintenance, quickly gained business.  Those shops employed mechanics, but the car diagnostics were performed by technology experts who could provide the mechanics with very specific guidance from the car’s data.  In addition, I had direct access to detail about the operation of my car from monthly reports delivered by OnStar, enabling me to make more informed driving, maintenance and purchase decisions.

Most dealerships weren’t prepared for the rapid shift from servicing mechanical systems to servicing computerized systems.  Referencing my own experience, the independent service shop that had been servicing my other, older car, very quickly transitioned to service all kinds of electric service vehicles.  Their agility in adjusting to new market conditions brought them a whole new set of service opportunities.  The Chevy dealership, on the other hand, created a service vacuum that opened business for others.

The lesson here is to transition rapidly to new market conditions.  Oftentimes, using external resources is the fastest way to transition to a new skillset without taking your eye off operations, without making a giant investment, and while creating a path to incorporating these skills into your standard operating procedures over time. 

During transitions, and as your facility faces learning curve challenges, it makes sense to turn to resources that have the expertise and the tools at hand.  Because external expert resources work with multiple companies, they also bring the benefit of collective perspective, which can be brought to bear on many different types of situations.

In an outsourced model, and specifically in the case of data analytics services, highly experienced and focused data specialists can be responsible for collecting, reviewing and regularly reporting back to facility managers on trends, exceptions, actions to take and potentially developing issues.  These specialists augment the facility manager’s ability to steer his or her data centers through a transition to more software and data intensive systems, without the time hit or distraction of engaging a new set of skills.  Also, as familiarity with using data evolves, the third party can train data center personnel, providing operators with direct access to data and indicative metrics in the short term, while creating a foundation for the eventual onboarding of data analysis operations.  

Data analysis won’t displace existing data center personnel.  It is an additional and critical function that can be supported internally or externally.  Avoiding the use of data to improve data center operations is career-limiting.  Until data analysis skills and tools are embedded within day-to-day operations, hiring a data analysis service can provide immediate relief and help your team transition to adopt these skills over time.  

Does Efficiency Matter?

Currently, it seems that lots of things matter more than energy efficiency. Investments in reliability, capacity expansion and revenue protection all receive higher priority in data centers than any investment focusing on cutting operating expenses through greater efficiency.

So does this mean that efficiency really doesn’t matter? Of course efficiency matters. Lawrence Berkeley National Labs just issued a data center energy report proving just how much efficiency improvements have slowed the data center industry’s energy consumption; saving a projected 620 billion kWh between 2010 and 2020.

The investment priority disconnect occurs when people view efficiency from the too narrow perspective of cutting back.

Efficiency, in fact, has transformational power – when viewed through a different lens.

Productivity is an area ripe for improvements specifically enabled by IoT and automation. Automation’s impact on productivity often gets downplayed by employees who believe automation is the first step toward job reductions. And sure, this happens. Automation will replace some jobs. But if you have experienced and talented people working on tasks that could be automated, your operational productivity is suffering. Those employees can and should be repurposed for work that’s more valuable. And, as most datacenters run with very lean staffing, your employees are already working under enormous pressure to keep operations working perfectly and without downtime. Productivity matters here as well. Making sure your employees are working on the right, highest impact activities generates direct returns in cost, facility reliability and job satisfaction.

Outsourcing is another target. Outsourcing maintenance operations has become common practice. Yet how often are third party services monitored for efficiency? Viewing the before and after performance of a room or a piece of equipment following maintenance is telling. These details, in context with operational data, can identify where you are over-spending on maintenance contracts or where dollars can be allocated elsewhere for higher benefit.

And then there is time. Bain and Company in a 2014 Harvard Business Review article called time “your scarcest resource,” and as such is a logical target for efficiency improvement.  Here’s an example. Quite often data center staff will automatically add cooling equipment to facilities to support new or additional IT load. A quick and deeper look into the right data often reveals that the facilities can handle the additional load immediately and without new equipment. A quick data dive can save months of procurement and deployment time, while simultaneously accelerating your time to the revenue generated by the additional IT load.

Every time employees can stop or reduce time spent on a low value activity, they can achieve results in a different area, faster. Conversely, every time you free up employee time for more creative or innovative endeavors, you have an opportunity to capture competitive advantage. According to a report by KPMG as cited by the Silicon Valley Beat, the tech sector is already focused on this concept, leveraging automation and machine learning for new revenue advantages as well as efficiency improvements.

“Tech CEOs see the benefits of digital labor augmenting workforce capabilities,” said Gary Matuszak, global and U.S. chair of KPMG’s Technology, Media and Telecommunications practice.

“The increased automation and machine learning could enable new ways for tech companies to conduct business so they can add customer value, become more efficient and slash costs.”

Investments in efficiency when viewed through the lens of “cutting back” will continue to receive low priority. However, efficiency projects focusing on productivity or time to revenue will pay off with immediate top line effect. They will uncover ways to simultaneously increase return on capital, improve workforce productivity, and accelerate new sources of revenue. And that’s where you need to put your money.

IOT: A Unifying Force for the Data Center

A recent McKinsey & Company Global Institute report states that that factories, including industrial facilities and data centers, will receive the lion’s share of value enabled by IoT.  That’s up to $3.7 trillion dollars of incremental value over the next ten years.   Within that focus, McKinsey states that the areas of greatest potential are optimization and predictive maintenance – things that every data center facility manager addresses on a daily basis. The report also states that Industrial IoT (combining the strength of both industry and the Internet) will accelerate global GDP per capita to a pace never seen before during the industrial and Internet revolutions.

The McKinsey study described key enablers required for the success of Industrial IoT as: software and hardware technology, interoperability, security and privacy, business organization and cultural support.  Translated into the requirements for a data center, these are: low power & inexpensive sensors, mesh connectivity, smart software to analyze and act on the data (analytics), standardization and APIs across technology stacks, interoperability across vendors, and ways to share data that retain security and privacy.

Many of these enabling factors are readily available today.  Data centers must have telemetry and communications.  If you don’t have it, you can add it in the form of mesh network sensors.  Newer data centers and equipment will have this telemetry embedded.  The data center industry already has standards that can be used to share data.  Smart software capable of aggregating, analyzing and acting on this data is also available. Security isn’t as well evolved, or understood.  As more data becomes available through the Internet of Things, the network must be secure, private and locked down.

Transitions always involve change, and sometimes challenge the tried and true ways of doing things.  In the case of industrial IoT, I really think that change is good.  Telemetry and analytics reveal previously hidden information and patterns that will help facility professionals develop even more efficient processes.  Alternately, it may help these same professionals prove to their executive management that existing processes are working very well.  The point is that to date, no one has known for sure, because the data just hasn’t been available.

The emergence of IoT in the data center is inevitable, and facility managers who embrace this change and use it to their operational advantage can turn their attention to more strategic projects.

My next blog will address how telemetry and IoT can break down the traditional conflicts between facilities, IT and sustainability managers.

Stay tuned.