Balancing Power And Heat In Advanced Chip Designs

Both need to be dealt with at all stages of the design flow. Why it’s now everyone’s problem.


Power and heat use to be someone else’s problem. That’s no longer the case, and the issues are spreading as more designs migrate to more advanced process nodes and different types of advanced packaging.

There are a number of reasons for this shift. To begin with, there are shrinking wire diameters, thinner dielectrics, and thinner substrates. The scaling of wires requires more energy to drive signals, which in turn increases resistance, and therefore heat. That, in turn, can have a significant impact on other components in a chip, because there is less insulation and less ability to dissipate the heat across thinner substrates.

One of the most common ways to address heat is to reduce the operating voltage in various blocks. But that approach only goes so far, because below a certain voltage level, memories begin to lose data. In addition, tolerances become so tight that various types of noise become much more problematic. So heat needs to be addressed differently, and that now affects every step of the design process.

“Thermal effects matter a lot more than they used to, but none of these effects is actually new,” said Rob Aitken, distinguished architect at Synopsys. “Somebody, somewhere always had to worry about some of them. What’s different now is that everybody has to worry about all of them.”

Others agree. “There’s been a decade long trend to push the power supply lower to reduce the chip power,” said Marc Swinnen, director of product marketing at Ansys. “But that means that now we have near ultra-low voltage and near-threshold operation, so what used to be minor issues are now major.”

And those issues are significantly more difficult to solve. In the past, power and thermal issues could be dealt with by adding margin into the design as a thermal buffer. But at advanced nodes, extra circuitry decreases performance and increases the distance that signals need to travel, which in turn generates more heat and requires more power.

“Next-level technology nodes, as well as advanced packaging technologies, come at a price of much higher power densities within die and package, which need to be considered right from the specification and design phase,” said Christoph Sohrmann, group manager virtual system development at Fraunhofer IIS’ Engineering of Adaptive Systems Division. “This means that designers will have to create designs that are much closer to the critical temperature corners than ever.”

In 3D-ICs, thermal maps for various use cases need to be integrated with floor-planning and choices of different materials.

“3D-ICs have multiple die stacked on top of each other because you cannot grow your phone bigger than what it is,” said Melika Roshandell, product management director at Cadence. “But you can stack the die, and that brings a lot of thermal issues. Imagine you have multiple die on top of each other. Depending on how they are put in, different thermal problems can occur. For example, if one IP is put in one die, and another IP is placed on the other die — let’s say the CPU is in die number one and the GPU is in die number two — when there are different benchmarks and they want to activate the first die, or if they have a GPU-intensive benchmark and they want to activate the second die, much of the time you have to dice your chip into multiple parts, import it into the software, and then do the analysis. But for 3D-IC, you cannot do that because thermal is a global problem. This means you have to have all the components — 3D-IC plus its package, plus its PCB — to do the thermal analysis. A lot of designers are facing this challenge right now.”

When the voltage roadmap stopped scaling
In 1974, IBM engineer Robert Dennard demonstrated how to use scaling relationships to reduce the size of MOSFETs. This was a significant observation at the time, when the smallest MOSFETs were about 5µm, but it failed to take into account what would happen at the nanometer scale when threshold voltages and leakage current would significantly change.

“There’s become this framing that Dennard didn’t consider threshold voltage, but it’s actually right there in the paper as a defined term,” said Aitken. “The key thing that broke Dennard scaling is when voltage stopped scaling. If you followed the theory, when you scaled the device down, you would also scale everything by 0.7, so if you followed the original math, our devices should be operating at about 20 millivolts. When voltage stopped scaling, then the nice cancellation property that produced the equal power piece out of the equation broke.”

To meet the challenge, the industry has worked on different approaches to manage power and keep it under control. Turning down or turning off portions of an SoC when they were not in use is one such method, first popularized by Arm. It is still widely used for powering down smart phone displays, one of the most power-intensive parts of a mobile device, when the phone detects no one is looking at it.

“At the chip level, dark silicon is one way to manage chip power,” said Steven Woo, fellow and distinguished inventor at Rambus. “Moore’s Law continued to give more transistors, but power limits mean that you can’t have them all on at the same time. So managing which parts of the silicon are on, and which parts are dark, keeps the chip within the acceptable power budget. Dynamic voltage and frequency scaling is another way to deal with increased power, by running circuits at different voltages and frequencies as needed by applications. Reducing voltage and frequency to the level that’s needed, instead of the maximum possible, helps reduce power so that other circuits can be used at the same time.”

Minimizing data movement is another approach. “Data movement is also a big consumer of power, and we’ve seen domain-specific architectures that are optimized for certain tasks like AI, that also optimize data movement to save power,” Woo said. “Only moving data when you have to, and then re-using it as much as possible, is one strategy to reduce overall power.”

In additional to design and architecture, there are traditional and advanced ways of cooling to lower power. “There are more examples of systems that use liquid cooling, usually by piping a liquid over a heatsink, flowing the liquid away from the chip, and then radiating that heat someplace else. In the future, immersion cooling, where boards are immersed in inert liquids, may see broader adoption,” Woo explained.

Multi-core designs are another option for managing power and heat. The tradeoff with this approach is the need to manage software partitioning, as well as competing power needs on chips.

“As you divide tasks between different cores, different processing engines, you realize that not all the steps are homogenous,” said Julien Ryckaert, vice president, logic technologies at imec. “Some actually require very fast operation. The critical path in the algorithm needs to operate at the fastest speed, but most of the other tasks running in other cores are going to operate and stop because they’ve got to wait for that first CPU to finish its task. As a result, engineering teams have started to make their processing engines heterogeneous. That’s why today mobile phones have what we call the big.LITTLE architecture. They have a combination of high speed cores, low power cores, and even high speed cores. If you look at an Apple phone, it has four high-speed cores, one of which is dimensioned for operating at the highest supply voltage.”

Nearing threshold voltages
Dropping the voltage has been a useful tool for lowering power, as well. But like many techniques, that too is running out of steam.

“You wind up with ultra-low voltages that are barely scraping about half a volt,” said Ansys’ Swinnen. “At the same time, transistors have been switching faster. When you’ve combined very low voltage and very high speed switching, you can’t afford to lose anything if the transistor is going to switch.”

What makes the reliance on lowering voltage even more problematic is the increasing resistance caused by thinner and narrower metal layers and wires. Suddenly more power has to be squeezed through longer wires that are thinner, so the voltage drop problem has become more acute, and it means there is less room for any voltage drop.

“The traditional answer was: If the voltage drops, I just add more straps or make the wires wider,” Swinnen said. “But now, so much of the metal resource is dedicated to power distribution, every time more straps are added, that’s an area that cannot be used for routing.”

Even worse, the geometries of the wires themselves have also reached a limit. “You can only make a copper wire so thin before there’s no more copper,” said Joseph Davis, senior director of product management for Siemens EDA. “The copper wires are cladded, which keeps them from diffusing out into the silicon dioxide and turning your silicon into just a piece of sand, because transistors don’t work if they have copper ions in them. But you can only make that cladding so big, so the size of the wires is quite limited, and that’s why the RC time constants on advanced technologies only go up — because you’re limited by the back end,”

EDA developments underway
For all these reasons, there’s an increasingly reliance on EDA tools and multi-physics simulations to bear the complexity.

Cadence’s Roshandell explained that one problem traditional solvers have for 3D-ICs is capacity. “There is a need for tools that have the capacity that can read and be able to simulate all of the components in a 3D-IC without too much simplification. Because it’s simplified, it will lose accuracy. And when you lose accuracy you end up realizing there may be a thermal problem in die number two. But you simplified it to the bone to be able to do the analysis, so you cannot catch that thermal performance. That is one of the biggest issues in EDA as we move to 3D-IC. That is the problem that everybody needs to address.”

The EDA industry is well aware of these needs, and it’s actively developing solutions to address what designers need for 3D-IC, which includes working closely with foundries. There is also a tremendous amount of development on the analysis front. “Before you tape out a chip, you want to have some sort of idea about what’s going to happen when your design goes into a particular environment,” Roshandell said.

All of this is not without cost, as these complexities become a tax on design time, Aitken said. “You can hide it to some extent, but not entirely. The tool now has to do this detailed thermal analysis that it didn’t have to do before, and that’s going to take some time. It means that either you’re going to have to throw more compute at it, or you’re going to have to spend more wall clock/calendar time to deal with this problem. Or you’re going to have fewer iterations. Something has to give. The tools are working to keep up with that, but you can only hide some of these things to a certain extent. Eventually, people will know it’s there. You won’t be able to hide it forever.”

In order to do this safely and to keep the design within the specification, new EDA tools for thermal are needed. “In terms of EDA tools, field-solver-based simulators may need to be replaced by more scalable ones, such as those based on Monte-Carlo algorithms,” said Fraunhofer’s Sohrman. “Such algorithms are scalable and easy to run in parallel. Moreover, thermal sensitivity analysis might find its way into thermal simulation, giving the designer feedback on the reliability of the simulation results. The same goes for model-order reduction (MOR) techniques that will be needed in the future.”

Still, there may be a potential solution in silicon photonics. “IBM’s done it, and there have been chips where they place mirrors on top of the packages and they do some really fancy stuff,” said Siemens’ Davis. “But it’s where 3D stacking was 10 years ago, which means even silicon photonics may run into limits.”

Optical signals are fast and low power, and they were once seen as the next generation of computing. But photonics has its own set of issues. “If you’re trying to have waveguides for passing optic signals, they also have size limitations,” said John Ferguson, director of product management at Siemens. “On top of that, they’re big. You can have electrical components with them or stacked on top of them, but either way, you have to do work to convert the signals, and you’ve got to worry about some of the impacts like heat and stress. They have a much more significant impact on optical behavior than they do even on the electrical. For example, just by putting a die on top, you could be changing the entire signal that you’re trying to process. It’s very tricky. We’ve got a lot more work to do in that space.”

While the chip industry continues to scale logic well into the angstrom range, thermal- and power-related issues continue to grow. This is why all of the major foundries are developing a slew of new options involving new materials and new packaging options.

“At the new nodes, if you took an existing finFET transistor and scaled it, it would be too leaky,” Aitken said. “A gate all around transistor is better, but it’s not magically better. It does look like it’ll scale a little bit going forward. All of these things seem to be caught in a loop of everything looks okay for this node, it looks plausible for the next node and there’s work to be done for the node after that. But that’s been the case for a long time, and it still continues to carry on.”

And in case it doesn’t, there are plenty of other options available. But heat and power will remain problematic, no matter what comes next.


Saikatm says:

Thanks Karen for this article, excellent read

Leave a Reply

(Note: This name will be displayed publicly)