Liquid Cooling, Meeting The Demands Of AI Data Centers

Adopting liquid cooling technology could significantly reduce electricity costs across the data center.

popularity

Many Porsche “purists” reflect forlornly upon the 1997, 5th generation, 996 version of the iconic 911 sports car. It was the first year of the water-cooled engine versions of the 911, which had previously been based on air-cooled engines since their entry into the market in 1964. The 911 was also the successor to the popular air-cooled 356. For over three decades, Porsche’s flagship 911 was built around an air-cooled engine. The two main reasons often provided for the shift away from air-cooled to water-cooled engines were 1) environmental (emission standards) and 2) performance (in part cylinder head cooling). The writing was on the wall: If Porsche was going to remain competitive in the sports car market and racing world, the move to water-cooled engines was unavoidable.

Fast forward to current data centers trying to meet the demands for AI computing. For similar reasons, we’re seeing a shift towards liquid cooling. Machines relying on something other than air for cooling date back at least to the Cray-1 supercomputer which used a freon-based system and the Cray-2 which used Fluorinert, a non-conductive liquid in which boards were immersed. The Cray-1 was rated at about 115kW and the Cray-2 at 195kW, both a far cry from the 10’s of MWs used by today’s most powerful supercomputers. Another distinguishing feature here is that these are “supercomputers” and not just data center servers. Data centers have largely run on air-cooled processors, but with the incredible demand for computing created by the explosive increase in AI applications, data centers are being called on to provide supercomputing-like capabilities.

At this year’s Hot Chips 2024 conference, Tom Garvens, VP Hardware Solutions at Supermicro, gave a presentation, “Thermal Techniques for Data Center Compute Density.” Figure 1 below shows the huge increase in thermally dissipated power (TDP) ratings for GPUs.

Fig. 1: Data center power challenges.

The slide focuses on GPUs, given the rise in the usage of GPUs for AI applications. Note that before 2015, < 200W was the standard, and this was pretty much true for CPUs too. As the slide points out, CPUs are now running up to 500W and GPUs are at 1kW and climbing. This also goes for other processors like TPUs. In a recent conversation with a Google engineer, I mentioned 1kW chips and she replied that they were already building 1.2kW implementations. Roman Kaplan, principal AI performance architect at Intel, mentioned in his Gaudi 3 AI accelerator presentation that Gaudi 3 was rated at 900W (air-cooled) and 1200W (liquid-cooled). He also mentioned that customer demand was currently higher for the air-cooled version.

The slide in figure 1 also shows <15kW/rack as being common. If you have 200W CPUs, you can throw a fair number of CPUs into the rack, but remember that there are a lot of other components also drawing power besides the CPU. If you start placing 1kW GPUs into that rack, there aren’t going to be many GPUs in the rack and most of the rack will be empty (wasted) space. So 120kW/rack is now available in some data centers, which is basically the same power as the original Cray-1 supercomputer.

From a total cost of ownership standpoint, the total power cost is not only for the power supplied to the equipment but also for the cooling of the data center. Figure 2 below shows how data centers have been working to increase their power efficiency (PUE).

Fig. 2: Data center power efficiency.

Whether a trend is sustainable is largely driven by economics. In the case shown here with data centers at PUE > 1.5, there’s still a good chunk of efficiency on the table to pursue. Garvens even mentioned scenarios where the heat being pumped out of the data center could be captured for use in heating office space during cold weather or hydroponic agriculture to grow crops quickly. In theory, these scenarios could lead to PUEs <1.0.

Figures 3 and 4 below show the opportunities and benefits of moving to liquid-cooled data centers.

Fig. 3: Outcomes.

Fig. 4: DLC data center benefits.

There’s the potential for a significant, 40%, reduction in electricity costs across the data center. A reduction in carbon emissions related to the total power reductions is also better for the environment. The increased density now allows for more compute-capability in the same floor area and enables future (higher TDP) solutions.

Fig. 5: Economic advantage of DLC vs air cooling.

Figure 5 above sums up the economic advantage of using direct liquid cooling vs. air cooling. These numbers strongly support, especially for AI-targeted data centers, the use of liquid solutions. Much like our sports car example, the future of AI data centers is also liquid-cooled.



Leave a Reply


(Note: This name will be displayed publicly)