Cooling The Data Center

There’s no perfect solution to data center cooling, but multiple approaches are being developed.

popularity

Since British mathematician and entrepreneur Clive Humby coined the rallying cry, “Data is the new oil,” some 20 years ago, it has been an upbeat phrase at data science conferences. But in engineering circles, that increasingly includes a daily grind of hardware challenges, and chief among them is how to cool the places where all that data is processed and stored.

An estimated 65 zettabytes of data already has been generated [1], and that number will keep growing, incurring enormous environmental costs worldwide. On the individual corporate level, there already are tremendous capital/operating expenditures allocated for cooling and maintenance, including staffing personnel and replacing equipment.

This data center challenge has led to a range of approaches to cooling. While no one can claim the problem is solved, there are refinements to older approaches, along with a few novel ideas that give some hope of better balancing demands.

“We’ve come a long way in data center cooling,” Rita Horner, senior technical marketing manager at Synopsys observed. “We used to just be blowing air, which meant conditioning the entire room or the building structure. We were, in effect, throwing money at the problem randomly. As time went on, we realized we were wasting a lot of money and energy, and became smarter in identifying where the issues are.”

Data center layout
On a basic level, there have been changes to data center floor plans, such as the contemporary approach that includes a well designed layout, with power consumption determined by area in a data center, usually based on the 2’x2’ standard unit for calculations. Because servers are the most power-hungry devices in a data center, the server racks are deliberately stacked in “hot” and “cold” aisles, with the respective sides facing away from each other.

The racks that are side-by-side all face together, and the backs are where the hot air is blown out. “In addition, these data centers are usually on an elevated surface, where cold air is blown from the bottom to take advantage of heat rising, so you only are flowing the air in one place and effectively flowing the cold air on top of the heated surfaces,” Horner said. “Then, the hot air on the back is sucked up.”

One variation of this floor plan involves containment aisles, in which hot and cold aisles are separated by an enclosed space, which balances the temperatures.

Alternatives to air
Cooling issues are exacerbated by the increasing demands of ML/AI, as well as the growth in power of next generation hardware. Top-end accelerators derived from GPUs can use as much as 500W to 600W each at their peak power, according to the Uptime Institute.[2] But the problem isn’t just at the extremes. The Institute also showed that the thermal power rating for mainstream uses more than doubled in 10 years. Using AWS at an example, the numbers shot from 120W at the highest in 2010 to 280W at the highest in 2020. Some next-generation mainstream server processors will move into the 350W to 400W range by 2023, which indicates that some high-volume server configurations will approach 1kW of power at full load.

This led the Institute to note that high thermal power and lower temperature limits of next-generation server processors will challenge the practicality of air cooling and frustrate efficiency and sustainability drives, and that thermal power levels are fast approaching the practical limits of air cooling in servers.

Liquid cooling
That’s hardly the end of effective cooling inside of data centers, however. Until the advent of the minicomputer in the mid-1960s, and the PC in the 1980s, nearly all computers were liquid-cooled. In 2005, IBM once again began offering liquid cooling as an option for its increasingly thin blade servers. Coupled with higher compute density and increased clock frequencies, the temperatures began rising to the point where some servers needed to be throttled — particularly those at the top of the server racks. Since then, liquid cooling has been slowly making its way back into high-performance computers, starting with gaming PCs, and for good reason.

“The thermal conductivity of water is about 25 times greater than that of air,” said Steven Woo, fellow and distinguished inventor at Rambus. “And if the volumetric flow rate is high enough, the heat carrying capacity of water can be orders of magnitude higher than air.”

Physics also helps with efficient designs. “Normally, with air cooling, there’s a minimum gap you need between modules, which is required to blow air through,” Woo said. “By using liquid in pipes, you could space the modules closer than you would normally be able to do with air, which increases the compute density of the whole system. One benefit of water liquid cooling is that now I can do more compute per cubic foot.”

Given that, water-cooling — and even misting — have joined air-cooling in the list of data center cooling options.

Injecting mist into the data center can be much less costly than blowing cold air. And as anyone who’s ever sat on an Arizona patio knows, it’s highly efficient. With or without misting, humidity in the data center always needs to be monitored. Too little and there’s a risk of electrostatic discharge. Too much and there’s a risk of corrosion.

There also has been some smart rethinking of how cold a data center needs to be. Rather than cooling to the point that the humans in the data center were miserable, some data center operators realized they could save costs by using warmer air (or water) to maintain optimal temperatures in their equipment.

In fact, the Class A2 equipment thermal guidelines from ASHRAE (the American Society of Heating, Refrigerating and Air-Conditioning Engineers) allow for operating temperatures up to 35°C (95°F) as standard — and even higher under specific circumstances. However, a new standard H1 (high density) recommends a lower air supply temperature band between 18°C (64.4°F) to 22°C (71.6°F), as reported by the Uptime Institute.

A multiphysics simulation can help model whether a water-based or air-based cooling system would be most effective.

“Computational fluid dynamics applies to any fluid,” noted Marc Swinnen, director of product marketing at Ansys. “So the same equations should apply to both air and water, because in principle they’re both fluids, although the heat capacities are different due to differences in viscosity and other characteristics.”

These results should provide additional data points for comparing the total cost of ownership of a water-based system to an air-based system.

Environmental alternatives
One noted downside is that the energy to keep systems at optimal temperatures can undo even the best environmental and CapEx goals. “You can rent a rack in a data center for $2,000 a month, but have a $1,000 electricity bill, even if it’s only 2’x2′,’” Horner noted.

Environmentally-friendly cooling approaches range from building data centers underground, or at least keeping them in basements, to locating them in colder climates. With the use of solar panels, even the desert can be a reasonable location for a data center as solar panels absorb energy during the clear days. At at night, without water vapor to trap heat, the temperature can drop substantially — enough to provide outside cooling. In fact, according to NASA, while daytime desert temperatures can average 38°C/100°F, nighttime temperatures can drop to -4°C/25°F.

One of the most basic considerations for data center location may be what local utility charges are in the planned location, and whether it may be particularly susceptible to the effects of climate change.

Enabled by recent photonics advances in both speed and carrying distance, it’s also possible to shift workloads between data centers with little detectable latency. In what is sometimes called the “follow-the-sun/follow-the-moon” approach, workloads are traded off between data centers in different time-zones, as nighttime comes on in various places, enhancing natural cooling.

The boldest geographic solution is the ocean, as Microsoft demonstrated two years ago, when the company hauled its underwater data center experiment out of the water off Scotland’s Orkney Islands. Its Project Natick not only showed cooling advantages, but allowed for extreme internal environmental control, without risk of typical problems like contamination, humidity, or even just being jostled by human workers. “It gives a near constant boundary. The temperature outside of the container stayed fairly constant because the ocean has so much water that its heat capacity is enormous,” said Woo.

It’s a fair assumption that underwater data centers are likely to become a regulatory nightmare. Nevertheless, Microsoft’s experiment still may point the way to lasting changes on dry land. Its successful execution required designing a system that could run maintenance-free for almost five years.

At the time of Natick’s emersion from the depths, Norman Whitaker, Microsoft’s managing director for special projects, told the New York Times [3] that with such low maintenance requirements, it could be possible to strip data centers of parts that only exist for ease of human use. If such a design could be commoditized, the cost of both initial and replacement hardware, as well as labor, all could be potentially reduced. It might be possible to place streamlined servers in optimally cool environments and simply let them run without physical interventions for years.

Microsoft doesn’t have a data capsule in the water now. “We will continue to use Project Natick as a research platform to explore, test, and validate new concepts around data center reliability and sustainability, for example with liquid immersion,” said a company spokesperson.

Finally, putting a new spin on the concept of in the cloud, start-up Lonestar has raised $5 million for lunar data centers. Its first launch is scheduled for June 2023.

Immersion cooling
There are other options, as well. Immersion cooling has been used and/or discussed for some decades. The basic concept is that the computing happens inside a non-conductive dielectric liquid to cool systems, and it is again looking like a leading contender to solve contemporary thermal problems. Back in the 1980s, it seemed as if immersion cooling would migrate from supercomputing sites to corporate data centers, but the formerly advanced idea became passé when CMOS transistors became popular and helped contain thermal budgets.

That turned out to be merely a temporary truce with physics. In today’s world of finFETs, advanced packages, and data measured in zettabytes, immersion cooling has regained popularity, with an entire subsection of the industry devoted to providing solutions. According to Technavio, the global data center liquid immersion cooling market size is estimated to grow by $537.54 million from 2023 to 2027.

At the exascale, there’s even a hybrid of liquid and foam, which HPE Cray uses in its Shasta supercomputer. Woo got a look at the interior of a Shasta at the ACM/IEEE Supercomputing Conference in 2018. What was most notable, he said, was the combination of liquid cooling and foam placed near memory.

“As you look at the blades, you can see tubes that are parallel to each other,” said Woo. “Those are actually memory modules that they’re running between. The tube is wrapped with a compliant pink foam that touches the DIMM modules, transferring the heat into the tubes where liquid is flowing. It’s all one continuous circuit of liquid that’s going through everything to wick the heat away.”

Fig.1: Cray (now HPE Cray) Shasta Computer Blade with cooling system. Source: Steve Woo

Fig.1: Cray (now HPE Cray) Shasta Computer Blade with cooling system. Source: Steve Woo

Different approaches
In addition to trying to anticipate problems through modeling, Horner suggested that AI can pre-empt problems in the data center. “By adding more and more intelligence where you cool it, you can reduce your cooling costs and be more focused in where you need it. A lot of high-end data centers are using AI for monitoring where they need the cooling and when they need it. By identifying the device, the equipment, and the time that the cooling is needed, they can channel the cooling to that location instead of cooling the entire system or entire building or entire rack or the room.”

There’s general agreement that thermal problems won’t be fixed merely by novel cooling systems. Answers must start back at design. “There’s only two ways to make this work,” said Mark Seymour, distinguished engineer at Cadence. “One is standardization of materials and components. The other way is for people to be more cognizant of what they’re integrating with when they design a system.”

It’s a problem of scale and interacting domains. On one side, chip and package designers are concentrating on their set of thermal issues. On the other, data center designers are concentrating on theirs. As a result, because no one’s thought about the big picture, an individual server heating up could kick off a cascade within a data center.

“We’ve got a coupling between two systems that are independently developed, that nobody’s actually got control over,” said Seymour. “When they want to get rid of heat, designers need to start thinking about whether the mechanisms they are designing will be compatible with the mechanisms that are likely to be there in the data center. When somebody designs that data center, they’ve got to be asking themselves what sorts of things they are going to be housing, and what they are expecting.”

Conclusion
Reducing energy usage is critical. There is a limit to how much energy can be created, and data centers are consuming at least several percent of that number, with no end in sight for how much computing needs to be done in the future.

“We’re really doing this because we’re trying to make a better planet, so it’s not an option,” said Seymour. “We have to get better at this, especially given the growth that’s going to occur. We have to work out how to not make it a dramatic problem for the world. We have to make that step. The challenge for designers is how to make that step. Is it greener to just switch to new data centers that can be equipped with the latest cooling systems? Or is it greener to run the building stock for as long as possible because there’s embodied carbon in new buildings? These are the sorts of questions that we as an industry have to grapple with.”

Others agree. “We’ve depleted all the low hanging fruit in terms of reducing power on a chip level, and are getting to the point that shrinking technologies are not really helping us to improve power or improve performance,” said Horner. “This is why it’s challenging to design in a smaller technology now, and why it’s harder to get the power gains out of this smaller technology node. Still, we need to do something about it because we cannot afford for these data centers or our typical compute to increase in power consumption. We now are facing the reality that everything counts.”

References

    1. Statista. Volume of data/information created, captured, copied, and consumed worldwide from 2010 to 2020, with forecasts from 2021 to 2025. https://www.statista.com/statistics/871513/worldwide-data-created/
    2. Bizo, D. Silicon heatwave: the looming change in data center climates. UI Intelligence Report 74. Q3 2022. Uptime Institute Intelligence. https://uptimeinstitute.com/uptime_assets/4cf0d2135dc460d5e9d22f028f7236f7b5c3dd2f75672c3d2b8dfd4df3a3eea6-silicon-heatwave-the-looming-change-in-data-center-climates.pdf
    3. Markoff, J. Microsoft Plumbs Ocean’s Depths to Test Underwater Data Center. New York Times. 1/31/2016


Leave a Reply


(Note: This name will be displayed publicly)