Why Don’t Data Centers Use Data?

Data analysis doesn’t readily fall into the typical data center operator’s job description.   That fact, and the traditional hands-on focus of those operators, isn’t likely to change soon.

But turning a blind eye or ignoring the floodgate of data now available to data centers through IoT technology, sensors and cloud-based analytics is no longer tenable.  While the data impact of IoT has yet to be truly realized, most data centers have already become too complex to be managed manually.

What’s needed is a new role entirely, one with dotted line/cross-functional responsibility to operations, energy, sustainability and planning teams.

Consider this.  The aircraft industry has historically been driven by design, mechanical and engineering teams.  Yet General Electric aircraft engines, as an example, throw off terabytes of data on every single flight.  This massive quantity of data isn’t managed by these traditional teams.  It’s managed by data analysts who continually monitor this information to assess safety and performance, and update the traditional teams who can take any necessary actions.

Like aircraft, data centers are complex systems.  Why aren’t they operated in the same data-driven way given that the data is available today?

Data center operators aren’t trained in data analysis nor can they be expected to take it on.  The new data analyst role requires an understanding and mastery of an entirely different set of tools.  It requires domain-specific knowledge so that incoming information can be intelligently monitored and triaged to determine what constitutes a red flag event, versus something that could be addressed during normal work hours to improve reliability or reduce energy costs.

It’s increasingly clear that managing solely through experience and physical oversight is no longer best practice and will no longer keep pace with the increasing complexity of modern data centers.  Planning or modeling based only on current conditions – or a moment in time –  is also not sufficient.  The rate of change, both planned and unplanned, is too great.  Data, like data centers, is fluid and multidimensional. 

Beyond the undeniable necessity of incorporating data into day-to-day operations to manage operational complexity, data analysis provides significant value-added benefit by revealing cost savings and revenue generating opportunities in energy use, capacity and risk avoidance.  It’s time to build this competency into data center operations.

Breaking Down Communication Barriers with IoT

The Internet of Things holds the unprecedented opportunity to improve the long-standing conflict between facilities, IT and sustainability managers.  Traditionally, these three silos are orthogonal, and don’t share each other’s priorities.

Data generated from more granular sensing in data centers reveals information that has traditionally been difficult to access, and not easily shared between groups.  This data can provide both an incentive and a means to work together by establishing a common source for business discussions.  This concept is becoming increasingly important.  As Bill Kleyman said in a Data Center Knowledge article projecting Data Center and Cloud Considerations for 2016: “The days of resources locked in silos are quickly coming to an end.”  We agree.  While Kleyman was referring to architecture convergence in the reference we believe his forecast applies equally forcefully to data.  Multi-group access to more comprehensive data has collaborative power.  IoT contributes to both the generation of such data and the ability to act on it, instantaneously.

Consider the following examples of how IoT operations can accelerate decision-making and collaboration between IT and Facilities.

IT Expansion Deployments

As service shifts to the network edge, or higher traffic is needed for a particular geographic region, IT is usually tasked to identify the most desired sites for these expansions.  In bigger companies, the possible sites can number 50 or more.  IT and Facilities need to quickly determine a short list.

A highly granular view of the actual (versus designed) operating cooling capacity available in each of the considered sites would greatly speed and simplify this selection.  With operating cooling capacity information readily in hand, facilities can easily create a case for the most attractive sites from a cost and time perspective, and/or create a business case for the upgrades necessary to support IT’s expansion deployments.

Data can expose previously hidden or unknowable information.  Capacity planners are provided with the right information for asset deployment in the right places, faster and with less expense.  Everyone gets what they want.

Repurposing capital assets

After airflow is balanced, and redundant or unnecessary cooling is put into standby through automated control, IT and facilities can view the real-time amount of cooling actually available in a particular area.  It becomes easy to identify rooms that have way more cooling than needed.  The surplus cooling units can be moved to a different part of the facility, or to a different site as needed.

IoT powered by smart software can thus expose inefficient capital asset allocation.  Rather than spending money on new capital assets, existing capital can be moved from one place to another.  This has huge and nearly instant financial benefits.  It also establishes a method of cooperation between the facilities team that is maintaining the cooling system and the IT team that needs to deploy additional IT assets and that is tasked with paying for additional cooling.

In both situations, data produced by IoT becomes the arbiter and the language on which the business cases can be focused.

Data essentially becomes the “neutral party.”

All stakeholders can benefit from IoT-produced data to make rational and mutually understood decisions.  As more IoT-based data becomes available, stakeholders who use it to augment their intuition will find that data’s collaborative power is profitable as well as insightful.

IOT: A Unifying Force for the Data Center

A recent McKinsey & Company Global Institute report states that that factories, including industrial facilities and data centers, will receive the lion’s share of value enabled by IoT.  That’s up to $3.7 trillion dollars of incremental value over the next ten years.   Within that focus, McKinsey states that the areas of greatest potential are optimization and predictive maintenance – things that every data center facility manager addresses on a daily basis. The report also states that Industrial IoT (combining the strength of both industry and the Internet) will accelerate global GDP per capita to a pace never seen before during the industrial and Internet revolutions.

The McKinsey study described key enablers required for the success of Industrial IoT as: software and hardware technology, interoperability, security and privacy, business organization and cultural support.  Translated into the requirements for a data center, these are: low power & inexpensive sensors, mesh connectivity, smart software to analyze and act on the data (analytics), standardization and APIs across technology stacks, interoperability across vendors, and ways to share data that retain security and privacy.

Many of these enabling factors are readily available today.  Data centers must have telemetry and communications.  If you don’t have it, you can add it in the form of mesh network sensors.  Newer data centers and equipment will have this telemetry embedded.  The data center industry already has standards that can be used to share data.  Smart software capable of aggregating, analyzing and acting on this data is also available. Security isn’t as well evolved, or understood.  As more data becomes available through the Internet of Things, the network must be secure, private and locked down.

Transitions always involve change, and sometimes challenge the tried and true ways of doing things.  In the case of industrial IoT, I really think that change is good.  Telemetry and analytics reveal previously hidden information and patterns that will help facility professionals develop even more efficient processes.  Alternately, it may help these same professionals prove to their executive management that existing processes are working very well.  The point is that to date, no one has known for sure, because the data just hasn’t been available.

The emergence of IoT in the data center is inevitable, and facility managers who embrace this change and use it to their operational advantage can turn their attention to more strategic projects.

My next blog will address how telemetry and IoT can break down the traditional conflicts between facilities, IT and sustainability managers.

Stay tuned.

Analytics in Action for Data Center Cooling

When a data center is first designed, everything is tightly controlled. Rack densities are all the same. The layout is precisely planned and very consistent. Power and space constraints are well-understood. The cooling system is modeled – sometimes even with CFD – and all of the cooling units operate at the same level.

But the original design is often a short-lived utopia. The realty of most data centers becomes much more complex as business needs and IT requirements change and equipment moves in and out.

As soon as physical infrastructure changes, cooling capacity and redundancy are affected.  Given the complexity of design versus operational reality, many organizations have not had the tools to understand what has changed or degraded, so cannot make informed decisions about their cooling infrastructure. Traditional DCIM products often focus on space, network and power.  They don’t provide detailed, measured data on the cooling system.  So, decisions about cooling are made without visibility into actual conditions.

Analytics can help. Contrary to prevailing views, analytics don’t necessarily take a lot of know-how or data analysis skills to be extremely helpful in day-to-day operations management. Analytics can be simple and actionable. Consider the following examples of how a daily morning glance at thermal analytics helped these data center managers quickly identify and resolve some otherwise tricky thermal issues.

In our first example, the manager of a legacy, urban colo data center with DX CRAC units was asked to determine the right place for some new IT equipment. There were several areas with space and power available, but determining which of these areas had sufficient cooling was more challenging. The manager used a cooling influence map to identify racks cooled by multiple CRACs. He then referenced a cooling capacity report to confirm that more than one of these CRACs had capacity to spare. By using these visual analytics, the manager was able to place the IT equipment in an area with sufficient, and redundant, cooling.

In a second facility, a mobile switching center for a major telco, the manager noticed a hot spot on the thermal map and sent a technician to investigate the location. The technician saw that some of the cooling coils had low delta T even though the valves were open, which implied a problem with the hydronics. Upon physical investigation of the area, he discovered that this was caused by trapped air in the coil, so he bled it off. The delta T quickly went from 3 to 8.5 – a capacity increase of more than 65 percent – as displayed on the following graph:

 

DeltaT

These examples are deceptively simple. But without analytics, the managers would not have been able to as easily identify the exact location of the problem, the cooling units involved, and have enough information to direct trouble-shooting action within the short time needed to resolve problems in a mission critical facility.

Analytics typically use the information already available in a properly monitored data center. They complement the experienced intuition of data center personnel with at-a-glance data that helps identify potential issues more quickly and bypasses much of the tedious, blood pressure-raising and time-consuming diagnostic activities of hotspot resolution.

Analytics are not the future. Analytics have arrived. Data centers that aren’t taking advantage of them are riskier and more expensive to operate, and place themselves at competitive disadvantage

Data Center Capacity Planning – Why Keep Guessing?

Capacity management involves decisions about space, power, and cooling.

Space is the easiest. You can assess it by inspection.

Power is also fairly easy. The capacity of a circuit is knowable. It never changes. The load on a circuit is easy to measure.

Cooling is the hardest. The capacity of cooling equipment changes with time. Capacity depends on how the equipment is operated, and it degrades over time. Even harder is the fact that cooling is distributed. Heat and air follow the paths of least resistance and don’t always go where you would expect. For these reasons and more, mission-critical facilities are designed for and built with far more cooling capacity than they need. And yet many operators add even more cooling each time there is a move, add, or change to IT equipment, because that’s been a safer bet than guessing wrong.

Here is a situation we frequently observe:

Operations will receive frequent requests to add or change IT loads as a normal course of business.  In large or multi-site facilities, these requests may occur daily.  Let’s say that operations receives a request to add 50 kW to a particular room.  Operations will typically add 70 kW of new cooling.

This provisioning is calculated assuming a full load for each server, with the full load being determined from server nameplate data.  In reality, it’s highly unlikely that all cabinets in a room will be fully loaded, and it is equally unlikely that the server will ever require its nameplate power.  And remember, the room was originally designed with excess cooling capacity.  When you add even more cooling to these rooms, you have escalated over-provisioning.  Capital and energy are wasted.

We find that cooling utilization is typically 35 to 40%, which leaves plenty of excess capacity for IT equipment expansions.  We also find that in 5-10% of situations, equipment performance and capacity has degraded to the point where cooling redundancy is compromised.  In these cases, maintenance becomes difficult and there is a greater risk of IT failure due to a thermal event. So, it’s important to know how a room is running before adding cooling.  But it isn’t always easy to tell if cooling units are not performing as designed and specified.

How can operations managers make more cost effective – and safe – planning decisions?  Analytics.

Analytics using real-time data provides managers with the insight to determine whether or not cooling infrastructure can handle a change or expansion to IT equipment, and to manage these changes while minimizing risk.  Specifically, analytics can quantify actual cooling capacity, expose equipment degradation, and reveal where there is more or less cooling reserve in a room for optimal placement of physical and virtual IT assets.

Consider the following analytics-driven capacity report.  Continually updated by a sensor network, the report clearly displays exactly where capacity is available and where it is not.  With this data alone, you can determine where capacity exists and where you can safely and immediately add capacity with no CapEx investment.  And, in those situations where you do need to add additional cooling, it will predict with high confidence what you need. (click on the image for a full-size version)

Cooling Capacity

Yet you can go deeper still.  By pairing the capacity report with a cooling reserve map (below), you can determine where you can safely place additional load in the desired room.  You can also see where you should locate your most critical assets and, when you need that new air conditioner, and where you should place it.

(click on the image for a full size version)thermalcircle

Using these reports, operations can:

  • avoid the CapEx cost of more cooling every time IT equipment is added;
  • avoid the risk of cooling construction in production data rooms when it is often not needed;
  • avoid the delayed time to revenue from adding cooling to a facility that doesn’t need it.

In addition, analytics used in this way avoids unnecessary energy and maintenance OpEx costs.

Stop guessing and start practicing the art of avoidance with analytics.

 

 

Maintenance is Risky

No real surprise here. Mission critical facilities that pride themselves on and/or are contractually obligated to provide the “five 9’s” of reliability know that sooner or later they must turn critical cooling equipment off to perform maintenance. And they know that they face risk each time they do so.

This is true even for the newest facilities. The minute a facility is turned up, or IT load is added, things start to change. The minute a brand new cooling unit is deployed, it starts to degrade – however incrementally. And that degree of degradation is different from unit to unit, even when those units are nominally identical.

In a risk and financial performance panel presentation at a recent data center event sponsored by Digital Realty, ebay’s Vice President of Global Foundation Services Dean Nelson recently stated that “touching equipment for maintenance increases Probability of Failure (PoF).” Nelson actively manages and focuses on reducing ebay’s PoF metric throughout the facilities he manages.

Performing maintenance puts most facility managers between the proverbial rock and a hard place. If equipment isn’t maintained, by definition you have a “run to failure” maintenance policy. If you do maintain equipment, you incur risk each time you turn something off. The telecom industry calls this “hands in the network” which they manage as a significant risk factor.

What if maintenance risks could be mitigated? What if you could predict what would happen to the thermal conditions of a room and, even more specifically, what racks or servers could be affected if you took a particular HVAC unit offline?

This ability is available today. It doesn’t require computational fluid dynamics (CFD) or other complicated tools that rely on physical models. It can be accomplished through data and analytics. That is, analytics continually updated by real-time data from sensors instrumented throughout a data center floor. Gartner Research says that hindsight based on historical data, followed by insight based on current trends, drives foresight.

Using predictive analytics, facility managers can also determine exactly which units to maintain and when – in addition to understanding the potential thermal affect that each maintenance action will have on every location in the data center floor.

If this knowledge was easily available, what facility manager wouldn’t choose to take advantage of it before taking a maintenance action? My next blog post will provide a visual example of the analysis facility managers can perform to determine when and where to perform maintenance while simultaneously reducing risk to more critical assets and the floor as a whole.