Climate Change Is Real: What to do now and what to plan

This article first appeared in the November 2019 edition of Data Economy magazine.

No matter what you believe is causing climate change, temperatures are rising and extreme weather events are becoming more frequent.

NOAA, Princeton University, and the European Academies’ Science Advisory Council have all found that the likelihood of “100-year” weather events has increased, with a significant uptick in probability in just the past 5 years. Data centers in remote areas may become more difficult to reach physically after a catastrophic event. Water, power, and electricity costs are increasing. And the world’s exploding reliance on digitization makes regulations around uptime reliability more likely.

Any one of these factors can impact data center operations. But together, their impact means change and challenges are ahead.

With few exceptions, existing data centers were not designed with climate change in mind, particularly with regard to thermal reliability. There are more days with extreme high temperatures, and cooling capacity degrades as it gets hotter outdoors.

For a data center running near full load, heat events will increase in both frequency and severity with climate change. Extreme weather events are increasing in frequency, strength, and lasting longer. More rainfall during hotter weather adds humidity as another cause for concern.

What can data centers do to prepare for these eventualities? There are short term and long-term actions that are worth consideration.

Load-leveling

Most data centers do not operate at full load, which so far has prevented climate change-related temperature effects from becoming an issue. Data centers can make this partial-load ratio an intentional plan to offer protection.

If you have some data halls that are fully loaded and others that are partially loaded, then you can either move some of the load into the partially loaded halls or move some of the cooling equipment into the fully loaded halls. However, this would require data centers to know the actual operating capacity of their facility, and not just the designed capacity. To safely do this, you will need metrics to understand real-time airflow conditions and equipment health.

Of course, data centers typically run less efficiently at partial load and experience poorer PUE, which means this method will also require an automated control system that can avoid excessive cooling in partially loaded halls.

More free cooling

While free cooling capacity is highly dependent on the weather, it can still help even when it’s hot outdoors. During an extreme weather event, warm air is still better than no air. This means data centers will need to ensure power is available to operate free cooling in extreme conditions.

The fans used in free cooling systems use less power than the pumps and compressors of mechanical cooling systems, so putting them on UPS could be a good backstop that increases the resilience of your data center.

More capacity

Longer term, new data centers should be designed to make it easier to add capacity over time to accommodate future concerns such as higher loads, denser equipment, and more frequent extreme weather events. For example, data centers serving 5G networking equipment will see significantly higher heat density than the telecom data centers of previous generations. As this transition unfolds it will be important for the cooling infrastructure to keep up.

Whether a facility operator retrofits their cooling infrastructure or adds cooling capacity, the end result is that the make, model, and design of the cooling system will become increasingly heterogenous.

Having a control system that can handle a heterogenous mixture of cooling equipment from different vendors with different designs will become increasingly important. Flexible technology will help data centers adapt to ongoing change.

Changing the Climate Inside Your Data Center

Climate change is not an external force that data centers need to protect themselves against, but rather a market force with which they need to keep up. As temperatures rise outside, or density increases temperatures, cooling must keep pace.

As hot weather persists, maintaining airflow of any temperature is better than not circulating air. And as heat causes more outages, increased capacity will give you the buffer to stay up and running.

For a data center operator, the key takeaway about climate change is that no one is debating that it’s happening, so making sure your business has the resiliency and redundancy to weather any storm is the best possible plan of action.

You can also read this full article in the online version of Data Economy, on page 71.

DATA CENTRE DYNAMICS: OUR MISGUIDED FAITH IN THE FIVE NINES

Deep down, everyone knows that the five nines (as we see in 99.999 percent uptime promises everywhere) is merely a concept for reliability. The ‘five nines’ mean that there is only a 0.001 percent probability of failure in an interval of time. From a time perspective, it means that a given service will never be down for more than 0.001 percent of the time, which translates to just five minutes per year.

This type of “high nine” reliability metric is commonly applied to components in technical buildings, such as line cards in a switch or power supplies. But a data center is a complex interconnected system of components. Its overall reliability will be driven by its least reliable components. Since all components are not equally reliable, this means the concept of high-nines reliability, even if correct for some components, is more of a marketing statement than an accurate assessment of overall data center reliability.

“The five nines, in the majority of cases, is a marketing figure that doesn’t stand up to practice, isn’t supported by evidence, and doesn’t show forward-looking risk,” according to Andy Lawrence, VP of Research at Uptime. In addition, a 2018 survey by 451 Research found that 48 percent of respondents experienced a major IT/data center outage within the last three years, with two of those failures involving 911 Emergency switching gear that had been moved into data centers.

Clearly data centers don’t deliver five nines of service reliability. So, what are the weak links in the reliability chain? Take a look at cooling.

Data center cooling is typically designed to withstand a 50-year or a 100-year weather event, which sounds like very high reliability. But a 100-year design means that there is a one-percent probability of such an event occurring every year, which is just two nines, not five! If the life of the data center is 20 years, then designing it for a 100-year weather event translates to an 18 percent chance that the weather will exceed the design condition at some time during the life of the data center.

What has made this risk factor more tenable is that most data centers don’t run at full load. But this doesn’t make it an acceptable business strategy. Everyone in the data center business is pushing for higher loads in existing facilities. So if your design is only two nines, and the only thing saving you from failure is a sales guy who can’t make quota, you have a business problem.

Another consideration is that cooling capacity degrades with time due to wear and tear. So, if a data center started off able to withstand a 100-year weather event at full load, it may only be able to withstand a 50-year weather event after a number of years.

Cooling system reliability becomes an even bigger concern with climate change. Average temperatures around the globe are increasing while extreme weather events are getting even more extreme. 100-year weather events are becoming much more frequent. Both the imminent and probable impact of these climate-changing conditions is well documented by 451 research and others.

What users really care about

I agree with Andy Lawrence’s opinion about five nines. Even if five nines is a reasonable reliability standard for the internal components of a data center, it has no bearing on the overall reliability of a data center.

Ultimately, consumers of data center services don’t care why service outages happen. They just care that they might happen, and did happen. I think it is time for more focus on the weak links in the reliability chain and less reliance on five nines statements. The impact of both natural wear and tear along with climate change makes the cooling system one of the weakest links in the entire data center reliability chain and worthy of reliability and optimization focus.

How Did I Live Without My AI-Driven Cooling?

Driving the other day, I decided to grab a quick bite to eat on the way home. I quickly launched my maps application, searched on the type of food I wanted, picked a local place, made sure they had decent reviews, checked their hours, and started the navigation guidance to tell me how to get there quickly.

When I did this, I was hungry. A few seconds later, I was on my way to solving that issue.

But I didn’t break it down that I was using a mobile-sized computer to triangulate my position on the globe from satellites. I didn’t then overlay a series of restaurants from a back-end database on top of that map, which was then integrated with a reviews database as well as location-specific information about that restaurant and its hours of operation. I didn’t follow that up by evaluating different routes from my current location to the restaurant, and deciding which one to take.

This was all on auto-pilot. I decided I wanted food, looked up restaurants, made sure the food was good, the place was open, and went. This took just seconds of my time.

We get so much information from simple swipes and glances that we forget what’s really guiding all of those interactions under the hood.

All the ways that we live, work, drive, interact…have all gone beyond the scope of what many of these technologies were originally designed to do.

And it only makes sense that this sort of distillation of technology to simplify our lives has also found its way into the data center, especially with the advancement of artificial intelligence for optimization and operation of cooling systems. Data Center Knowledge described recent advancements in an article on machine learning.

We’re not quite at the fully automated, human-to-computer interfaces seen in futuristic shows like Star Trek, but the day is rapidly approaching when you can “make it so.” Just like the technology above, you’ll wonder how you ever managed without AI-driven cooling(tm).

In an AI-driven data center, you can already:

  • Continually monitor conditions on your console or mobile device, from anywhere
  • Know which racks have redundant cooling so you can orchestrate variable workloads automatically
  • Identify the effects of “hands in the network” by viewing real-time or time-sequenced heat maps and data
  • See where the cooling is being delivered using Influence Maps™
  • See when floor panels haven’t been put back or blanked
  • Verify that work has been completed successfully using data and performance metrics (and hold vendors accountable)
  • Review anomalies that result from unexpected behavior even if they have already been mitigated by AI-driven cooling, and then review the data to see what and where you need to focus

This real-time information is immediately and continually visible from your dashboard. Walking the floor is only necessary for physical or configuration changes.

You can already see – and be able to prove – whether you really need that new CRAC, or if by shifting IT load or cooling you’ll net the same effect. You can see if your free cooling is operating as designed and have the data to troubleshoot it if not. AI-driven cooling automatically resolves issues and gives you the additional time – and critical data — to investigate further if need be.

AI-driven cooling enables autonomous, truly remote data centers to become even more cost effective as your best facility personnel can manage your most critical facilities – from miles or continents away.

Highly variable data centers which house very high-density high-heat-producing racks, in the proximity of others that don’t, will be easier to manage with less stress. Because AI-driven cooling understands the distinct cooling requirements of any situation it can automatically manage airflow within the same room for optimum efficiency.

When Fortune Magazine forecasted the “25 Ways AI is Changing Business,” they said that “the reality is that no one knows or can know what’s ahead, not even approximately. The reason is that we can never foresee human ingenuity, all the ways in which millions of motivated entrepreneurs and managers worldwide will apply rapidly improving technology.” But just as you and I have already seen what AI and mobile phone technology has done for our lives, so will it be for data center infrastructure.

And, like the power available through our mobile phones, someday soon we’ll wonder how we ever managed without AI-driven data centers.

With Data Centers, What Can Happen Will Happen (Eventually).

Because data centers and telecom switching centers are designed to withstand failures without interrupting business operations, a 3 a.m. emergency due to a malfunctioning air conditioner should never occur – in theory. But Murphy’s Law says that if a single failure can create an emergency, it will. So, to date, operators have had to react to single-component failures as if they are business-critical. Because they might be.

In my previous blog, I pointed out the two components of risk: the probability of and the consequence of failure. While both of these components are important in failure analysis, it is the consequence of failure that’s most effective at helping decision-makers manage the cost of failure.

If you know there is a high probability of impending failure, but you don’t know the potential consequence, you have to act as though every threat has the potential for an expensive business interruption. Taking such actions is typically expensive. But if you know the consequence, even without knowing the probability of failure, you can react to inconsequential failures at your leisure and plan so that consequential failures are less likely.

In the past, the consequences of a failure weren’t knowable or predictable. The combination of Internet of Things (IoT) data and machine learning has changed all that. It’s now possible to predict the consequence of failure by analyzing large quantities of historical sensor data. These predictions can be performed on demand and without the need for geometrical data hall descriptions.

The advantage of machine learning-based systems is that predictive models are continually tuned to actual operating conditions. Even as things change and scale over time, the model remains accurate without manual intervention. The consequences of actions, in addition to equipment failures, become knowable and predictable.

This type of consequence analysis is particularly important for organizations that have a run-to-failure policy for mechanical equipment. Run-to-failure is common in organizations with severe capital constraints, but it only works, and avoids business interruptions, if the consequence of the next failure is predictable.

Predicting the consequence of failure allows an operations team to avoid over-reacting to failures that do not affect business continuity. Rather than dispatching a technician in the middle of the night, an operations team can address a predicted failure with minimal or no consequence during its next scheduled maintenance. If consequence analysis indicates that a cooling unit failure may put more significant assets at risk, the ability to predict how much time is available before a critical temperature is reached provides time for graceful shutdown – and mitigation.

Preventative maintenance carries risk, but equipment still needs to be shut off at times for maintenance. Will it cause a problem? Predictive consequence analysis can provide the answer. If there’s an issue with shutting off a particular unit, you can know in advance and provide spot cooling to mitigate the risks.

 The ability to predict the consequences of failure, or intentional action such as preventative maintenance, gives facility managers greater control over the reliability of their facilities, and the peace of mind that their operations are as safe as possible.

The Real Cost of Cooling Configuration Errors

Hands in the network cause problems. A setting adjusted once, based on someone’s instinct of what needed to be changed at one moment in time, is often unmodified years later.

This is configuration rot. If your data center has been running for a while, the chances are pretty high that your cooling configurations, to name one example, are wildly out of sync. It’s even more likely you don’t know about it.

Every air conditioner is controlled by an embedded computer. Each computer supports multiple configuration parameters. Each of these different configurations can be perfectly acceptable. But a roomful of air conditioners with individually sensible configurations can produce bad outcomes when their collective impact is considered.

I recently toured a new data center in which each air conditioner supported 17 configuration parameters affecting temperature and humidity. There was a lot of unexplainable variation in the configurations. Six of the 17 configuration settings varied by more than 30%, unit to unit. Only five configurations were the same. Configuration variation initially and entropy over time wastes energy and prevents the overall air conditioning system from producing an acceptable temperature and humidity distribution.

Configuration errors contribute to accidental de-rating and loss of capacity. This wastes energy, and it’s costly from a capex perceptive. Perhaps you don’t need a new air conditioner. Instead, perhaps you can optimize or synchronize the configurations for the air conditioners you already have and unlock the capacity you need. Another common misconfiguration error is incompatible set points. If one air conditioner is trying to make a room cold and another is trying to make it warmer, the units will fight.

Configuration errors also contribute to poor free cooling performance. Misconfiguration can lock out free cooling in many ways.

The problem is significant. Large organizations use thousands of air conditioners. Manual management of individual configurations is impossible. Do the math. If you have 2000 air conditioners, each of which has up to 17 configuration parameters, you have 34,000 configuration possibilities, not to mention the additional external variables. How can you manage, much less optimize configurations over time?

Ideally, you need intelligent software that manages these configurations automatically. You need templates that prescribe optimized configuration. You need visibility to determine, on a regular basis, which configurations are necessary as conditions change. You need exception handling, so you can temporarily change configurations when you perform tasks such as maintenance, equipment swaps, and new customer additions, and then make sure the configurations return to their optimized state afterward. And, you need a system that will alert you when someone tries to change a configuration, and/or enforce optimized configurations automatically.

This concept isn’t new. It’s just rarely done. But if you aren’t aggressively managing configurations, you are losing money.

When Free Cooling Isn’t Free

Published in Data Center Dynamics.

The use of free cooling systems is quickly becoming common practice – particularly in new mission critical facility builds. Using outside air, either directly or indirectly, to cool ICT equipment is undeniably compelling, both logically and financially.

But is free air really free? Not always. Free cooling systems add considerable complexity to the operation and maintenance of mechanical equipment. If this complexity isn’t recognized or managed well, free cooling will add to energy costs and increase operational risk.

Watch the weather

Weather is the most obvious variable. Free cooling capacity declines in hot weather, requiring a design that either allows for elevated indoor temperatures or combines free cooling with conventional mechanical cooling to ensure that indoor temperatures remain within an acceptable range.

Multiple operating modes are another complicating factor. For example, the free cooling system at Facebook’s Prineville data center (pictured) uses eight distinct operating conditions to optimize the use of direct outside air and direct evaporative cooling under different weather conditions. Free cooling systems that use direct outside air augmented by compressorized cooling have at least three distinct operating conditions.

Maintenance also becomes more complex. Free cooling adds to the number of moving mechanical components (e.g. air dampers and actuators) that are in direct contact with outdoor air. Outdoor air is corrosive, which can cause the dampers and actuators to get stuck, and either fail to provide cooling or cause the system to bring in hot outdoor air when it should not. Free cooling systems with evaporative cooling have the added maintenance of cooling water, which requires chemical treatment and periodic flushing.

This complexity can significantly impact the energy reduction that free cooling can deliver, while creating real thermal management problems.

High failure rates

Accordingly, the high failure rates of free cooling systems are well documented in energy efficiency and building technology literature. A particularly good and practical paper entitled Free Cooling, At What Cost was written by Kristen Heinemeier and presented at the 2014 ACEEE Summer Study on Energy Efficiency in Buildings. My direct experience with free cooling systems throughout the US and Europe is completely consistent with Heinemeier’s paper. Specifically, I have seen even higher failure rates in mission critical facilities than in the commercial buildings referenced in Heinemeier’s paper.

Heinemeier examined the prevalence and impact of air-side economizer (direct free cooling) failure. She found that although economizers are an excellent energy saving technology, they do not perform well in practice. In California alone, she cites that in surveyed facilities, the economizer is disabled and outside air dampers are closed 30 – 40 percent of the time. She states: “This type of failure means that the economizer is not providing any savings, and that the building may not be bringing in any outside air. Other studies have found that the high-limit setpoints, set by technicians, are incorrect on the majority of RTUs in California, resulting in very few hours in the ‘free cooling’ range.”

I recently toured five sites in two countries, owned by different multinational companies, using cooling equipment from three different manufacturers.

Among the dozens of free cooling units that I observed on this trip, nearly all either had a problem that limited capacity and function or weren’t working at all. Problems included controller configuration, sensor failure, installation faults, and mechanical failures.

Some examples:

  • In one site, the outdoor air was cool but the outside air dampers were fully closed and the unit was recirculating indoor air. The temperature remained within an acceptable range; however, this was because the DX compressors were running unnecessarily – at massive cost. The operators knew that the free cooling should be operating, but didn’t know why it wasn’t. The facility had been operating that way since the free cooling units had been installed – about a year prior. Inspection of the units revealed that the controls weren’t configured properly, and that misconfigured control logic was preventing the free cooling from operating. I saw a similar scenario in a second site.
  • At another site I observed that the controls were working and appeared to be pulling in outside air. However, the discharge air on one particular unit wasn’t as cold as I would have expected. Inspection of the unit revealed that BOTH the outside air dampers and the return air dampers were closed. The damper actuator clamp on the outside air damper had either fallen off or been removed, leaving that damper stuck in the fully closed position. This problem was identified by analyzing data from the cooling optimization sensor network
  • At yet another site, I saw that the controls were working, the dampers were working and that cold air was produced – just not very much. We measured a large temperature difference in the outdoor air intake across the outside wall. The outside air duct was installed with a flanged connection to the wall. At a nearby site with the same free cooling equipment, the outside air duct penetrated the wall. The flanged installation caused the cooling units to draw air from the hollow wall construction, reducing the capacity of the free cooling by up to 40 percent. This problem was also identified by analyzing sensor network data.

What’s important to note is that while in each case the free cooling system had problems, they were all fixable problems – often with little or no investment. More significantly, operators didn’t always recognize that their free cooling was compromised, nor how it could be fixed. Besides the additional energy costs and potential thermal risk incurred by this lack of visibility, these facilities were on the verge of spending a lot of money in pursuit of a solution, when in fact their existing equipment would achieve the desired operation.

Monitor your cooling system

Because free cooling systems are highly efficient when they do work as intended, best practice would suggest that risk mitigation and visibility through a monitoring system is required to realize the safe operation and full benefit of free cooling. In California, Title 24 requires diagnostics for use with free cooling systems. Dynamic monitoring, analytics, and diagnostics in conjunction with visual inspection will reveal issues and help ensure the ongoing and proper operation of free cooling within a complex cooling infrastructure. In mission critical facilities that are operated lights-out, use of remote monitoring and analytics combined with intelligent alerting is the only way to ensure reliable operation of free cooling.

As free cooling becomes a standard means of cooling mission critical facilities, consideration of the risk and complexity it adds is critical. Data-driven oversight of cooling operations, in combination with a layer of smart analytics and control, is the best-practice way to ensure your thermal environment continually operates in the most efficient way possible. This oversight also ensures that you continue to optimize your capital investment, even as conditions, weather and physical changes occur over time.