Machine Learning

Why Machine Learning-based DCIM Systems Are Becoming Best Practice.

Here’s a conundrum.  While data center IT equipment has a lifespan of about three years, data center cooling equipment will endure about 15 years. In other words,  your data center will likely  undergo five complete IT refreshes within the lifetime of your cooling equipment – at the very least.  In reality, refreshes happen much more frequently. Racks and servers come and go, floor tiles are moved, maintenance is performed, density is changed based on containment operations – any one of which will affect the ability of the cooling system to work efficiently and effectively.

If nothing is done to re-configure cooling operations as IT changes are made, and this is typically the case, the data center develops hot and cold spots, stranded cooling capacity and wasted energy consumption.  There is also risk with every equipment refresh – particularly if the work is done manually.

There’s a better way. The ubiquitous availability of low cost sensors, in tandem with the emerging availability of machine learning technology, is leading to development of new best practices for data center cooling management. Sensor-driven machine learning software enables the impact of IT changes on cooling performance to be anticipated and more safely managed.

Data centers instrumented with sensors gather real-time data which can inform software of minute-by-minute cooling capacity changes.  Machine learning software uses this information to understand the influence of each and every cooling unit, on each and every rack, in real-time as IT loads change.  And when loads or IT infrastructure changes, the software re-learns accordingly and updates itself, ensuring that the accuracy of its influence predictions remains current and accurate.   This ability to understand cooling influence at a granular level also enables the software to learn which cooling units are working effectively – and at expected performance levels  – and which aren’t.

This understanding also illuminates, in a data-supported way, the need for targeted corrective maintenance. With a clearer understanding and visualization of cooling unit health, operators can justify the right budget to maintain equipment effectively thereby improving the overall health and reducing risk in the data center.

In one recent experience at a large US data center, machine learning software revealed that 40% of the cooling units were consuming power but not cooling.  The data center operator was aware of the problem, but couldn’t convince senior management to expend budget because he couldn’t quantify the problem nor prove the value/need for a specific expenditure to resolve the issue.  With new and clear data in hand, the operator was able to identify the failed CRACs and present the appropriate budget required to fix and replace them accordingly.

This ability to more clearly see the impact of IT changes on cooling equipment enables personnel to keep up with cooling capacity adjustment and, in most cases, eliminate the need for manual control.  A reduction of the corresponding “on-the-fly, floor time corrections” also frees up operators to focus on problems that require more creativity and to more effectively manage physical changes such floor tile adjustments, etc.

There’s no replacement for experience-based human expertise. However, why not leverage your staff  to do what they do best, and eliminate those tasks which are better served by software control.  Data centers using machine learning software are undeniably more efficient and more robust.  Operators can more confidently future proof themselves against inefficiency or adverse capacity impact as conditions change.  For these reasons alone, use of machine learning-based software should be considered an emerging best practice.

2012 Retrospective

It’s getting better all the time.

Despite our relentless drive to consume more and more data, driven by ever more interesting and arguably useful multimedia applications, energy consumption of data centers is growing slower than would be predicted from historical trends.

For that success, we should be proud, while remaining focused on even greater efficiency innovation.

Large companies have stepped up with powerful sustainability initiatives which impact energy use throughout their enterprise. We’ve gotten better at leveraging natural resources, like outside air to moderate data center temperatures.  We are using denser, smarter racks for space and other efficiencies. Data center cooling units are built with variable speed devices improving energy efficiency machine-by-machine. Utility companies are increasingly offering sophisticated and results-generating incentives to jump-start efficiency programs.

These and other contributing factors are making a difference, clearly proven in Jonathan Koomey’s Growth in Data Center Electricity Use 2011 report which showed a flattening, versus a lockstep correlation of energy usage to data center growth. Koomey and other analyst growth estimates projected a doubling of world data center energy usage from 2005 to 2010.  Actual growth rates were closer to 56%, a reduction that Koomey attributes both to fewer than expected server installations – and a reduced use of electricity per server.

I am proud of what our industry – and what our company – has achieved.  Consider some of this year’s highlights.

The New York Times raised the profile – and the ire  – of the data center industry calling attention to the massive energy consumed by, well, consumers.  Data center facilities and analysts alike responded with criticism, saying that the article ignored the many and significant sustainability and energy use reductions now actively in use.

Vigilent received an astounding 8 industry awards this year – recognizing our technology innovation, business success and workplace values. I’m very proud of the fact that several of these awards were presented by or achieved in partnership with our customers.  For example, Vigilent and NTT won the prestigious Uptime GEIT 2012 award in the Facility Product Deployment Category.  NTT Facilities with NTT Communications received the 2012 Green Grid Grand Prix award, recognizing NTT’s innovative efforts in raising the energy efficient levels of Japan by using Vigilent and contributing DCIM tools.  And Verizon, in recognition of our support for their commitment to continuing quality and service, presented us with their Supplier Recognition award in the green and sustainability category.

We moved strongly into Japanese and Canadian markets with the help of NTT Facilities and Telus, both of whom made strategic investments in Vigilent following highly successful deployments.  Premiere Silicon Valley venture firm Accel Partners became an investor early in the year.

We launched Version 5 of our intelligent energy management system adding enhanced cooling system control with Intelligent Analytics-driven trending and visualization, along with a new alarm and notification product to further reduce downtime risk.

And, perhaps most satisfyingly of all, we helped our customers avert more than a few data center failures through real-time monitoring and intercession, along with early notification of possible issues.

This year, we will reduce energy consumption by more than 72 million kWh in the US alone.  And this figure grows with each new deployment.  We do this profitably, and with direct contribution to our customer’s bottom line as well through energy cost savings.

Things are getting better. And we’re just getting started.

Cooling Failures

The New York Times story “Power, Pollution, and the Internet” highlights a largely unacknowledged issue with data centers, cooling.  James Glanz starts with an anecdote describing an overheating problem at a Facebook data center in the early days. The article then goes on to quote: “Data center operators live in fear of losing their jobs on a daily basis, and that’s because the business won’t back them up if there’s a failure.”

It turns out that the issue the author describes is not an isolated incident. As data centers get hotter, denser and more fragile, cooling becomes increasingly critical to reliability. Here are examples of cooling-related failures which have made the headlines in recent years.

Facebook: A BMS programming error in the outside air economizer logic at Facebook’s Prineville data center caused the outdoor air dampers to close and the spray coolers to go to 100%, which caused condensate to form inside servers leading to power unit supply failure.

Wikipedia: A cooling failure caused servers at Wikimedia to go into automatic thermal shutdown, shutting off access to Wikipedia from European users.

Nokia: A cooling failure led to a lengthy service interruption and data loss for Nokia’s Contacts by Ovi service.

Yahoo: A single cooling unit failure resulted in locally high temperatures, which tripped the fire suppression system and shut down the remainder of the units.

Lloyds: Failure of a “server cooling system” brought down the wholesale banking division of the British financial services company Lloyds Banking Group for several hours.

Google: For their 1800-server clusters, Google estimates that “In each cluster’s first year, … there’s about a 50 percent chance that the cluster will overheat, taking down most of the servers in less than 5 minutes and taking 1 to 2 days to recover.”

It is no surprise that data center operators live in fear.  What is surprising is that so few operators have mitigated risk through currently-available technology. It’s now possible to non-intrusively upgrade existing data centers with supervisory cooling management systems that compensate for and alert operators to cooling failures. Changes in IT load, environmental conditions, or even human error can quickly be addressed, avoiding what could quickly become an out-of-control incident that results in downtime, loss of availability, and something that’s anathema to colo operators: SLA penalties.

It’s incumbent on facilities operators and business management to evaluate and install the latest technology that puts not only operational visibility, but essential control, in their hands before the next avoidable incident occurs.

Data Center Risk

Surprising Areas of Data Center Risk and How to Proactively Manage Them

Mission critical facilities need a different level of scrutiny and control over cooling management.

It’s no surprise that cooling is critical to the security of these facilities.  With requirements for 99.999 uptime and multimillion dollar facilities at risk, cooling is often the thin blue line between data safety and disaster.

And yet, many mission critical facilities use cooling control systems that were designed for comfort cooling, versus the reliable operation of hugely valuable and sensitive equipment.

When people get warm, they become uncomfortable. When IT equipment overheats, it fails – often with catastrophically expensive results.

In one recent scenario, a 6-minute chiller plant failure resulted in lost revenue and penalties totaling $14 million.  In another scenario, the failure of a single CRAC unit caused temperatures to shoot up to over 100 degrees Fehrenheit in a particular zone, resulting in the failure of a storage array.

These failures result from a myriad of complex, and usually unrealized risk areas.  My recent talk at the i4Energy Seminar series hosted by the California Institute for Energy and Environment (CIEE) exposes some of these hidden risk areas and what you can do about them.

You can watch that talk here:

 

Cleantech Evolves

Smart Loading for the Smart Grid – New Directions in Cleantech

I recently participated in a TiE Energy Panel (The Hottest Energy Startups: Companies Changing the Energy Landscape), with colleagues from Primus Power, Power Assure, Mooreland Partners and Gen110.

The panel concurred that the notion of Cleantech – and the investment money that follows it – has shifted from a focus on energy generation to a focus on energy management.   To date, this is primarily because cheaper energy sources, hyped in early Cleantech press, haven’t materialized.  It’s hard to compete with heavily subsidized incumbent energy sources, much less build a business for what’s perceived as a commodity business.  There are exceptions, like solar energy development, but other alternative sources have languished financially despite their promise.

The investment shift toward energy management is also a result of emerging efficiency-focused technology.  Data Center Infrastructure Management or DCIM is all about smart management – with an emphasis on energy.  Gartner believes that there are some 60+ companies in this space, which is rapidly gaining acceptance as a data center requirement.

This shift is also supported by the convergence of other technology growth areas, such as big data and cloud computing, both of which play well with energy management.   As our increasingly sensor-driven environment creates more and more data – big data – its volume has surpassed the ability of humans to manage it.

And yet the availability of this data, accurate, collected in real-time, inclusive of the dimensions of time and location, represents real promise.  Availability and analysis of this information within individual corporations and perhaps shared more broadly via the cloud, will reveal continuous options for improving efficiency and will likely point to entirely new means of larger scale energy optimization through an integrated smart grid.

The days of facility operators running around with temperature guns and clipboards – although still surprisingly common today – is giving way to central computer screens with consolidated and scannable, actionable data.

This is an exciting time.  I’m all for new ideas and the creation of less expensive, less environmentally harmful ways to generate energy.  But as these alternative options evolve, I am equally excited by the strides industry has made for the smarter use of the resources we have.

The wave of next generation energy management is still rising.

Data Center Brains

If I Only Had a Brain… said the Data Center

Maintenance…

My recent blog talked about the fact that intelligent cooling management system reduces wear and tear on cooling equipment.  It does this in part by avoiding short-cycling.  Additionally, intelligent cooling improves thermal stability, reducing further wear and tear on IT equipment.

Beyond reducing the life of equipment, undue wear and tear causes catastrophic failures which are always unbudgeted and expensive.  Intelligent cooling management extends the life of equipment and reveals potential equipment issues before they can cause problems.

Capacity Boost…

I’ve also described how intelligent cooling management allows you to do more with less.  When equipment is managed just right, and efficiency is managed moment by moment, the mixing of hot and cold air is avoided, return air temps are higher and the capacity of the cooling equipment increases.  This capacity boost allows you to add more IT equipment avoid buying/adding more cooling equipment and ultimately avoid or postpone co-locating or building a new data center as your IT needs expand.

Adding a Smart Layer…

Intelligent cooling management can be added in a lightweight overlay to legacy cooling infrastructures.   The benefits are instantaneous.  You gain system-level coordinated control, new insights through visualization of data center floor cooling operations, and sophisticated cooling control diagnostics  –  without buying a single piece of new cooling equipment or hiring professional service oversight.  And these benefits are equal opportunity – they can be gained from old, new and multi-vendor data centers.

Every data center has untapped potential to work better and deliver more.   By giving your data center a brain, you can increase its brawn as well as its endurance.

Machine Learning

Why Machine Learning-based DCIM Systems Are Becoming Best Practice. Here’s a conundrum.  While data center IT equipment has a lifespan of about three years, data center cooling equipment will endure about 15 years. In other words,  your data center … [Read more]

2012 Retrospective

It’s getting better all the time. Despite our relentless drive to consume more and more data, driven by ever more interesting and arguably useful multimedia applications, energy consumption of data centers is growing slower than would be predicted … [Read more]

Cooling Failures

The New York Times story “Power, Pollution, and the Internet” highlights a largely unacknowledged issue with data centers, cooling.  James Glanz starts with an anecdote describing an overheating problem at a Facebook data center in the early days. … [Read more]

Data Center Risk

Surprising Areas of Data Center Risk and How to Proactively Manage Them Mission critical facilities need a different level of scrutiny and control over cooling management. It’s no surprise that cooling is critical to the security of these … [Read more]

Cleantech Evolves

Smart Loading for the Smart Grid – New Directions in Cleantech I recently participated in a TiE Energy Panel (The Hottest Energy Startups: Companies Changing the Energy Landscape), with colleagues from Primus Power, Power Assure, Mooreland Partners … [Read more]

Data Center Brains

If I Only Had a Brain… said the Data Center Maintenance… My recent blog talked about the fact that intelligent cooling management system reduces wear and tear on cooling equipment.  It does this in part by avoiding short-cycling.  Additionally, … [Read more]