Solving Thermal Coupling Issues In Complex Chips

Challenges mount, especially in 3D-ICs and chips developed at leading-edge nodes.

popularity

Rising chip and packaging complexity is causing a proportionate increase in thermal couplings, which can reduce performance, shorten the lifespan of chips, and impact overall reliability of chips and systems.

Thermal coupling is essentially a junction between two devices, such as a chip and a package, or a transistor and a substrate, in which heat is transferred from one to the other. If not managed properly, that heat can cause a variety of problems, including accelerated aging effects such as electromigration or faster breakdown in dielectrics. In general, the higher the voltage and the thinner the wires and substrates, the more heat that is transferred. And while these effects are well understood, they become much more difficult to deal with in leading-edge designs and 3D-ICs, where thermal couplings need to be modeled, monitored, and managed.

“Thermal effects are becoming first-order effects, like timing,” said Sutirtha Kabir, R&D director at Synopsys. “If you don’t take that into account, your timing and your classical PPA are also going to pay. This is so critical that now, in the sign-off stage, we are including thermal effects in timing analysis. If you don’t take those into account, the timing at room temperature, for example, is not going to give you the sign-off that you’re looking for.”

This is especially important where reliability is critical, such as in automotive, aerospace, or collaborative systems. For these applications, models are needed to consider such thermal couplings as die-to-board, die-to-die, device-to-interconnect, and device-to-device.

“There are accompanying thermal models needed for each level due to the large differences in the geometry dimensions,” said Ron Martin, working group virtual system development at Fraunhofer IIS’s Engineering of Adaptive Systems. “So a thermal model may relate to the device-to-interconnect and device-to-device level thermal coupling.”

What is thermal coupling?
Thermal coupling occurs when a current flows through devices or interconnects, and heat is generated. “This heat is transferred via conduction throughout the silicon substrate,” said Calvin Chow, director, application engineering at Ansys. “Multiple aspects of physics are involved in the performance and reliability of electronics, such as electromagnetics, structural and thermodynamics.”


Fig. 1: Simulating heat dissipation at thermal couples. Source: Ansys

A significant portion all of the power consumed in electronics is converted to heat. As the temperature changes, so do the material properties, which in turn affects the physics. “Hence, the temperature is coupled to the various physics involved in reliability and performance through the temperature dependence of material properties,” said Chris Ortiz, senior principal application engineer at Ansys.

Fig. 1: Thermal coupling interrelationships. Source: Ansys
Fig. 2: Thermal coupling interrelationships. Source: Ansys

Simply put, thermal coupling boils down the multiple ways that heat is dissipated via convection, conduction, and radiation, said John Ferguson, director of product management at Siemens Digital Industries Software. “For a given situation the question becomes how these effects are impacting each other, to figure out what the total heat transfer is,” he said. “Equation-wise, it’s not trivial. It’s not like you plug in a few numbers and you’re done. It’s much more sophisticated. You really have to do it through simulation, because they are impacting each other, and that’s exactly the challenge of it. One of them causes something to get hotter, which then causes something to get cooler in another spot on one of the other transform mechanisms.”

Usually, parameterized thermal compact models of the transistor are given in a foundry’s process design kit (PDK), and these models provide rough estimations of the hotspot temperatures, neglecting layout effects and coupling between transistors. “In most industrial design flows, one single temperature of the hottest transistor is assumed for the whole die which could lead to overdesign of the interconnects, and potentially to unnecessary performance losses,” said Fraunhofer’s Martin.

To avoid over-design, a linear thermal coupling model of the interconnect layers is introduced as the product of a coupling factor  and the transistor temperature (Ttransistor)

Tlayer = αlayer x  Ttransistor

The coupling factors for the interconnect layers are technology-dependent and can be provided by the foundry, as shown in figure 2.

Fig. 2: An IC-stack showing different modeling approaches. Source: Fraunhofer IIS EAS

Fig. 3: An IC-stack showing different modeling approaches. Source: Fraunhofer IIS EAS

“The realistic temperature distribution on different interconnect layers is shown in blue, the single temperature assumption in orange, and the temperature from the coupling model in green,” Martin explained. “The figure shows how using single temperature assumption leads to a significant over-estimation of the interconnect temperatures, resulting in strong design pessimism. Given a coupling model of the die, designers can either avoid wiring in hotspot areas or modify the interconnect size according to the local temperature.”

The interconnect coupling model still neglects coupling between different transistors on the device layer. This only can be modeled with a post-layout, full-chip thermal analysis leading to a temperature distribution on all layers, ideally like the blue curve in the figure above. Usually this is done by grid-based solvers, such as finite element method (FEMs) or finite volume method (FVM) solvers.

“Due to the increasing layout complexity and the growing number of transistors, the grid size is reaching a computational limit, even on HPC systems. Therefore, either model-order reduction (MOR) techniques must be applied to the thermal solvers, or grid-based solvers must be replaced by more scalable ones, such as Monte-Carlo-based algorithms. These algorithms are easy to run in parallel and therefore perfectly suited for modern GPU-clusters,” Martin added.

Thermal considerations are magnified in 3D-IC approaches compared to their monolithic equivalents, and they can vary by the type of 3D-IC utilized. “For example, in conventional 3D with micro-pillars between the chiplet layers, the existence of the thermally insulating underfill between the chiplets causes a blockage for the thermal energy,” said Javier DeLaCruz, distinguished engineer and senior director of system integration at Arm. “This will block the lower die in a heat-up approach, which is energy through the top of the package to a heatsink, or a heat-down approach where the heat is dissipated through the PCB.”

The chiplets in these cases tend to be thicker, in the 50µm range, DeLaCruz explained, so better lateral spreading occurs relative to the hybrid bonded version. “Then, for hybrid bonded 3- IC devices, the chiplets tend to be considerably thinner, which complicates lateral spreading and exacerbates the creation of areas with poor thermal paths, creating hotter hotspots — especially when there is no continuous silicon above it. This would occur in the gaps between multiple upper chiplets on a larger lower chiplet.”

Thermal silicon may be used in these cases to reduce this impact, but not fully eliminate it.

“The use of hybrid bonding also may improve the thermal path for the stack, but consequently increase the thermal coupling between the die, which adds another level of complexity to the partitioning/floor-planning processes through to the simulation stages and beyond,” DeLaCruz said.

The goal, of course, is to avoid reliability and performance problems from thermal effects, but it is complicated by the numerous approaches that can be taken to achieve this.

At least the “when” is clear, said Ansys’ Chow. “These thermal considerations can impact performance and reliability of the design. Therefore it’s important to do this analysis at an early stage in the design cycle when working on the floor-planning, and power distribution topologies.”

Spreading power over a larger area will help with thermal dissipation, but it also adds to the cost.

“Adding extra power grid will more evenly distribute the current and help with reliability, but now signal routing will be more congested, and it may take more time to close timing,” Ansys’ Ortiz said. “Also, deadlines will limit the amount of time you have to complete a chip. In the end, it will hurt more than help to overdesign a chip.”

The better approach is to simulate the temperature-dependent physics accurately for a specific design and address the problem areas that get highlighted.

Avoiding problems also means the simulations must give accurate results, but Siemens’ Ferguson noted this requires a lot of data. “You need to know the detailed metallization of each die,” he said. “You need to know the power going in. Potentially, you need to know the switching frequencies of the transistors. You need to know the stack and everything associated with it. That part takes a lot of information upfront, and by the time you have all that, if you find a mistake, it’s way too late in the game to go back and make fixes — especially if you’ve already got your die fully defined.”

However, some things can be done early, including making simple assumptions on the power of the die. “You can treat it all as if every point in the die is uniform power,” said Ferguson. “You can treat the metallization as if it’s uniform across the die. And as you’re doing your planning, you can start stacking things together to see obvious issues. This is a good starting point. You also can do tradeoff analysis, such that, ‘If I put A on top of B, or if I put B on top of A, does it make a difference? As you do that, and things mature, you just keep adding the information in, so you start to get more accurate power analysis. You can plug that in and start seeing there may be high power in certain regions, indicating it’s going to get hotter faster. You can put in the metallization, if you have it, to start figuring out certain areas are denser than others. Certain areas have more oxide and less metallization, which is much more insulation. The heat is going to move less quickly through that, and you just keep on chugging it along.”

Additional complications enter the picture because heat will affect stress. “Stress will have an impact on your device behaviors, as well as temperature, and that ultimately means what you were estimating for power initially may not be 100% accurate,” said Ferguson. “So you need to go through the power loop again to pull all the information back in. It’s an iterative approach, which is currently makes it hard to get toward closure. That’s the biggest challenge today.”

At the same time, because each design situation is unique, it informs the type of analysis that should be done. “Think about a chip that has multiple functions, such as in a cell phone,” said Synopsys’ Kabir. “Different parts of the chip get used when you’re calling somebody versus when you’re watching a YouTube video or doing something else. The work function of the phone or the application is not going to be uniform all the time, which means certain parts of the whole chip are going to get hot at one time, but not all the time, and some parts may get hot simultaneously. This is why only doing static analysis is not enough. You have to run transient analysis and look at the time distribution of the heat, because even a spike at some point may actually fry some part of your chip.”

As a result, the entire spread of activity over a timescale has to be taken into account, noted Shekhar Kapoor, senior director of marketing at Synopsys. “All the way from the emulation to the timing analysis, and all simulations, have to be taken into account.”

To put this in perspective, more innovation is needed to improve the performance of chip designs, which EDA companies are working on. As technologies mature, the answers about thermal are expected to come more from both the software side, with better simulations, as well as the hardware side, with better physical cooling techniques.

“In the earliest stages of a design, it’s going to come from your software analysis, where you keep simulating different scenarios to come up with how you can best design this product,” said Melika Roshandell, product management director at Cadence. “After it’s designed and you cannot change anything, you have to rely on your hardware. ‘Do I want to add more liquid cooling in here? Do I want to have a bigger heatsink? The heatsink that we were thinking of is not working.’ All those things come after the design is complete. So in the earliest stages of the design, it’s software, definitely. But after the design is fully completed and it’s going to go to the customer, it’s the measurements and the hardware that give you the answer.”

But this does not explain to the designer who is using the tools why things still go wrong, and why the chip is still overheating.

“When you do simulations, keep in mind that you have a lot of assumptions, and some of those assumptions can go very differently from what you thought,” said Roshandell. “For example, you’re thinking that this IP is going to behave a certain way because you rely on a foundry to give you the leakage data. Then, you rely on your power team to have to figure out exactly at what voltage this power is going to be, and none of those things goes as planned 99% of the time. All of those assumptions can play a role in your simulations, which is why sometimes the simulation is not predicting exactly what is going to happen in the real world. So it’s not a tool problem. The tool is acting as good as the input you gave it. All of the assumptions are very important in the simulation to get accurate results.”

Deciding on the necessary level of accuracy is also no small feat. “What you’ll be seeing is two levels,” Ferguson said. “In one you’ll have constraints, almost like you’re doing a DRC rule, in that if you have a region with such and such temperature, I want it flagged. Who comes up with that value I don’t know for sure. Today it’s the individual users with maybe some help from the foundry. In the longer term, the foundries will be doing that to an extent. But it’s like a lot of constraints for something that’s a sign-off. Generally, they will pad it. So you may think, ‘I’ve got 10%, I’m going to force you to be 5%. That way you’re not as likely to have a problem.’ Over time, if they find out that was wrong because they’re getting failures, they adjust.”

Conclusion
Thermal analysis will be a function of the power distribution, which includes heat source locations and the thermal resistances throughout the die, said Ansys’ Chow. Design decisions will need to be based on knowledge about how the part/design will perform.

Ansys’ Ortiz agreed that accurate material property information will be needed for simulation, some of which will come from the foundry, because material properties may be a product of their IP and not generally available. “The material properties can then be used to simulate the physics and show how the physical system will behave. A design tool can then utilize this information to make changes to the layout or modify the cooling system.”

And with 3D-ICs, design will remain challenging because designers are still growing accustomed to new three-dimensional structures like bumps, TSVs, die-to-die interface bonding, said Synopsys’ Kabir. “Many of these will also be used to carry heat. Sometimes designers also will place dummy bumps to carry excess heat away. Taking all of this into account very early in your design flow is extremely critical. Even in the very nascent stage where you’re doing architecture, exploration and floor-planning, if you don’t take these effects into account, and you believe, ‘each of the individual IC owners are going to design the IC, and I’m going to bring it all back together and put the 3D IC stack together,’ you will find thermal challenges and issues. And at that point, you cannot ECO your way out. It’s too late in the game. Then, the 3D stack will sit on a package, and the packages will sit on a PCB.”

So from a silicon-centric thermal analysis becomes imperative. What kind of thermal insulation material are you going to use? Are you going to put in just a dielectric material to make sure that the heat goes out? What kind of heatsink are you going to bring in? “The traditional way of looking from a PCB inward toward the system and doing thermal analysis, which was done a lot later and maybe disconnected from IC design — that paradigm doesn’t fit anymore for the 3D-IC concept,” Kabir said.



Leave a Reply


(Note: This name will be displayed publicly)