Avoiding Risk in Data Centers Sometimes Means Counter-Intuitive Thinking
Sound data center risk mitigation practices can also lead to energy cost savings. But sometimes the route there is counter-intuitive.
Always-on, always-cold is still a commonly-used strategy for data center cooling operations – and for good reason. This type of operation is fairly easy to implement and monitor, and running all the CRACs all the time logically reduces the risk of downtime should a unit fail. In this operating strategy, the CRACs are run at a low set point. They operate at lower than required temperature to mitigate the risk of hot spots or to add ride-through time in the event of a cooling system failure.
While this seems a logical and prudent practice, if you dig a little deeper, you’ll see that it’s not quite as risk adverse as it initially appears – and, more importantly, it misses a larger opportunity for significant energy cost savings. Let’s examine each practice individually.
Continuous operation of all CRACs, including redundant (backup) CRACs, wears all units out prematurely. Increased runtime for any piece of equipment that wears out with use naturally reduces its lifecycle.
Leveling CRAC runtimes, in which each CRAC is set to run approximately the same number of hours, has the same issue. This practice might extend the time to first failure; however it also increases the risk of catastrophic failure (i.e. simultaneous failure of all units).
And then there’s the issue of low setpoint thresholds. Common thinking regarding cold operations is that an overall cooler temperature will use the thermal mass of the infrastructure to provide extra time to react in the event of a cooling system failure. However, when all CRACs operate equally, each CRAC runs at a lower (less efficient) utilization, meaning that the discharge air from each CRAC will be higher. Some CRACs, in effect, may not be cooling at all, which means that in a raised floor data center, those that are not actually cooling are blowing return air into the underfloor plenum. Since the largest source of thermal mass in a data center is the slab floor, this means that this ”always-on”, and/or low set point approach to CRAC operation may not yield the best utilization of thermal mass.
A “just-needed” operation policy is preferable in terms of both catastrophic risk mitigation and energy efficiency. In this case, the most efficient CRACs are operated most of the time, and the less efficient CRACs are kept off most of the time – but held in ready standby. Even when CRACs are nominally the same, there can be significant differences in their cooling efficiency due to manufacturing variability. These differences, if measured or characterized, can be utilized to further optimize efficiency and mitigate the risk of catastrophic failure.
Sometimes the obvious or even most commonly used cooling strategy isn’t the best strategy, particularly as rising energy costs become more of a concern. An operating strategy that recognizes and anticipates the possibilities of “little failures,” while focusing on the avoidance of catastrophic failure and reducing energy costs, is not only forward looking but also represents best practice.
Speak Your Mind