Data Center Thermal Management Improves

CFD, multiphysics, and digital twins play increasing role in addressing heat within and between server racks.

popularity

Thermal issues are plaguing semiconductor design at every level, from chips developed with single-digit nanometer processes to 100,000-square-foot data centers. The underlying cause is too many devices or services that require increasing amounts of power, and too few opportunities for the resulting heat to dissipate.

“Everybody wants to try to do more in a small volume of space,” said Steven Woo, fellow and distinguished inventor at Rambus. “Take a server, for example. They’re so much more capable today than they were 10 years ago. The issue is that power hasn’t scaled like it used to in the past, so now if you want to do lots more work in your server, you have to burn more power to do it. Twenty years ago, a server might dissipate a couple hundred watts. These days, for the latest ones NVIDIA just announced around Grace Blackwell, the whole rack is 120 Kw, and the individual servers are many kilowatts.”

And Blackwell is just the beginning of this trend. “With things like Blackwell, you’re going to see a series of what we call ‘super chips,'” said Shankar Krishnamoorthy, general manager of Synopsys‘ EDA Group. “These super chips are going to quickly get to 1 trillion transistors by 2030. Blackwell is 200 billion. Intel announced Clearwater Forest, which is 300 billion. You’ll see that number quickly get up to a trillion. The reason that’s possible is the ability to keep integrating multiple dies, either side-by-side or stacked, along with memory, AI accelerators, compute arrays, all together in a single package. That process of bringing so many dies together is going to create a whole bunch of electrical issues, power issues, thermal issues, stress issues, mechanical issues, and all these things depend on each other. If there’s a section of your package that’s giving off too much heat, the timing properties of that section will be impacted. If there is stress or warpage in another part of your package, that directly impacts the extraction or the signal integrity.”

All of this will have a big impact on data center design, which has come a long way from the days when a few desks were pulled out of an office space and replaced by some servers. “Like chip design itself, data centers now must rely on sophisticated computer-aided multi-physics and digital twins for thermally optimal layouts that stay within ASHRAE (American Society of Heating, Refrigerating and Air-Conditioning Engineers) guidelines and don’t risk overheating components, such as DRAMs, which can fail under too much heat,” said Mark Fenton, senior product engineering manager at Cadence.

In order to keep thermal conditions under control, operators traditionally over-provision the cooling by putting excess cold air into the data center. “A decade ago, people cared about risk more than anything else, with energy being a side issue,” said Fenton. “You didn’t want any compute to go down. Operators would say, ‘Let’s just make it freezing cold and it will be fine.’ All they cared about was making sure everything stayed online.”

That approach has been obsolesced by the skyrocketing cost of energy, which has created both financial and legislative pressure to abandon over-provisioning. “Previously, the data center industry had been pretty much unregulated,” said Fenton. “There were a few metrics like PUE, but no one had to report on energy usage, or water and carbon emissions. That’s now starting to happen, with things like the European energy directive coming out of the E.U., as well as upcoming U.S. legislation, particularly in California, which is leading the way with trying to cap the energy used by the data center industry.”

The rapid advancement of AI also is disrupting the industry, pushing the limits of current processing technologies and having a significant impact on power consumption. In fact, data centers are now the fastest-growing industry in terms of CO2 footprint impact. To address this challenge, simulation — particularly computational fluid dynamics (CFD) — is becoming an indispensable tool to optimize cooling power consumption. In fact, cooling currently accounts for 40% of total power consumption, according to Antonio Caruso, electronics portfolio manager for Simcenter at Siemens.

“By using CFD early in the design process, engineers can avoid over-engineering and design more efficient cooling systems, such as liquid cooling, which are required to support the increased compute performances,” said Caruso. “The integration of 1D and 3D simulation, enabled by technologies like reduced-order modeling, can significantly speed up the verification and validation (V&V) chain. By leveraging these simulation tools and techniques, engineers can design and optimize data center systems that meet the demands of a rapidly changing industry while minimizing environmental impact.”


Fig. 1: HVAC current flow in the data center. Source: Siemens 

Multiphysics’ role in data center cooling
To preemptively understand where heat is likely to arise, and reduce the need for expensive cooling solutions, data center designers now use multiphysics modeling to understand thermal airflows. This approach to thermal dynamics in the data center is much the same, but also crucially different, from the multiphysics used in chip design, explained Marc Swinnen, director of product marketing at Ansys.


Fig. 2: Thermal mapping of heat and mechanical stress in 2.5D and 3D-ICs. Source: Ansys

 “One of the underappreciated aspects of multi-physics is the multi-scale of it,” said Swinnen. “It isn’t just the same thing, but bigger. There are emergent effects. The whole approach involves system tools. The questions you ask the data, the answers you want, all those change as you go up through the scale levels. It’s like six orders of magnitude difference between the board and chip level. Now you want to scale that up to an entire building and simulate the entire thing with huge air conditioners. In principle, it’s the same physics, but it isn’t really when you scale it up to that level.”

The data center industry, where rooms are built with the assumption that they’ll be in place for a decade or more, is inherently risk-averse. For example, despite liquid cooling techniques that have been used for decades in HPC installations, many data center administrators still worry about leaks and do not take on the potential risks of mixing electronics with liquids. That makes multi-physics simulations, as well as digital twins, especially appealing because they allow for safely experimenting with how effective proposed cooling solutions would be, as well as dozens of test runs of the effectiveness of proposed layouts.


Fig. 3: Digital twin model depicting airflow between server racks. Source: Cadence

“It’s always this perfect storm where they’re having to balance this very high-powered equipment that’s being placed in rooms that weren’t designed for it,” said Fenton. “It’s a very difficult challenge, which is why simulation is so well suited to be able to test these implementations ahead of time. Then they can proactively see, if it doesn’t work, what can be done from an engineering standpoint to make sure that it can. That’s where the digital twin aspect comes in because the digital twin has models of all the different elements, giving you a virtual area that’s risk free. For example, you can push the temperatures up and see what you’re saving on cooling, without putting critical infrastructure at risk.”

Lowering heat at the chip level
On the other side of the equation, design engineering teams also are looking at ways to keep data centers cooler by lowering temperatures in the components that go into them as they try to navigate the increasing thermal complexities of designing at advanced nodes.

“Usually on a single chip, especially a digital chip, you don’t have to do much because the design rules have taken heat into consideration,” said John Ferguson, product management director for Calibre nmDRC applications at Siemens EDA. “You’ve got safeguards that prevent anything from getting too hot and everything is right on the silicon, so you’ve got a path to let the heat out across the substrate itself. When you get into multi-die in a package, things get more complicated, because now your power supply chain has gotten larger. That means it’s more resistive. You’re generating more heat. You also may have more than one chip, possibly stacked in three dimensions. It may not be just wires coupled together and getting hotter, but active chips too close to each other that are getting too hot and heating each other up. So you get a coupling effect.”

Backside power delivery is likely to help with this because it reduces the distance the current has to travel to bring the power. “However, that backside power is only for the first chiplet,” Ferguson noted. If you’ve got something stacked on top of it, you need to connect to that somehow. If you use something like a TSV, that can have other issues, such as additional stress. You want to be careful how high you stack, and consider staggering the structure a bit so that everything doesn’t crumble around the edges.”

All of these complexities underscore the need to shift left, wherever possible, and work with digital twins to reduce thermal issues and other potential complications.

“You can design your package and the placements of the chips, and put everything together to get a very accurate view of the thermal problems and where they are,” Ferguson said. “But if you go down that path and you find an error, it’s too late. You’ve just spent the last two years designing this device, and now you’ve discovered you have a problem. You have to have some predictive ability early on in the design flow to help you. You’re going forward to find potential problems as soon as you can in the process so that you have time to go back and make corrective decisions.”

Other thermal reduction techniques
In addition to the incorporation of digital twins, Ferguson said some thermal reduction is also possible by switching from electronics to photonics materials. “Many of the data centers are moving to photonic interconnects,” Ferguson said.  “The big benefit of photonics as a carrier is that light doesn’t exhibit the same resistive issues that electrons in a wire do, so you don’t get a lot of heat generated. The downside is that light is very reactive to both temperature changes and mechanical stresses, even more than electricity. The same kinds of analysis that can be done on an IC or package can largely be done on the photonic side as well. Adding this kind of analysis to the elements in a data cabling can help ensure that the data center at large is going to be operating successfully.”

For the photonics industry, the need to reduce heat in the data center is currently the killer app. “New optical component technologies are emerging to improve the power consumption of optical interconnects in data centers, particularly in back-end networking for AI training clusters (scale-out), which use a higher ratio of optical content than front-end networks in data centers due to the high radix requirements of AI networking,” said Manish Mehta, vice president of marketing and operations in Broadcom’s Optical Systems Division. “For example, a 51.2T Ethernet switch that facilitates connectivity between many GPU servers will consume 1kW of power using optical transceivers available today. Next generation DSP technology and linear pluggable optics (LPO) will reduce the power consumption into the 600W-750W range.”

Tying these choices together with digital twins allows for a world in which thermally optimal decisions can be made in a way that complements the enormous sophistication of leading-edge data centers.


Fig. 4: Cadence Reality DC Digital Twin model depicting airflow patterns. Source: Cadence

“You have to model your data center as a system,” Ansys’ Swinnen said. “You have to gather all the heat flows, all the geometries in heat production, including that a chip’s thermal conductivity is anisotropic. These elements don’t conduct heat in the x and y direction the same way. There’s a very strong directional component. You can simulate all of that with CFD software.”

Conclusion
In the end, the best argument for simulation is reality. “The biggest challenge for most data centers is that the data center consultants will produce a perfect design, where all of the compute is sitting in a good environment, the airflow is well balanced, and there are no thermal problems,” said Cadence’s Fenton. “That design is then handed over to the owner-operator. On day one, as soon as they are given the keys, they will break the design almost immediately by putting in some new hardware they hadn’t thought of when originally spec’ing the center, or buying compute that breathes in a slightly different way, or suddenly liquid cooling comes along and they want to try some AI compute that needs liquid cooling. Immediately, that ‘perfect’ design has been broken. That leads to the thermal issues because they start to use the room in a way that it wasn’t designed for. Simulation is the only real way that people can proactively manage that, rather than putting something in, turning it on, and hoping it’s okay.”

Related Reading
AI Drives Need For Optical Interconnects In Data Centers
Old protocols are evolving as new ideas emerge and volume of data increases.
Cooling The Data Center
There’s no perfect solution to data center cooling, but multiple approaches are being developed.



Leave a Reply


(Note: This name will be displayed publicly)