中文 English

Designing For Thermal

Solutions are needed early as thermal becomes a systems issue.


Heat has emerged as a major concern for semiconductors in every form factor, from digital watches to data centers, and it is becoming more of a problem at advanced nodes and in advanced packages where that heat is especially difficult to dissipate.

Temperatures at the base of finFETs and GAA FETs can differ from those at the top of the transistor structures. They also can vary depending on how devices are used, how often and where they are used, and by the diameter of the wires used in a particular design, or even a particular area of a chip or package. It’s not unusual for systems to throttle back performance because some circuits are running too hot.

In addition, heat can cause premature aging in circuits and data loss in overheated DRAM. It can warp thinned wafers and interposers. And it can create mechanical stress from disparate expansion and contraction between different materials, causing problems that range from cracked solder balls at the corners of chips to collapsed structures within those chips.

Beyond what’s obvious to the naked eye, there are also disturbances at the quantum level, said Christoph Sohrmann, group manager virtual system development, Fraunhofer IIS’ Engineering of Adaptive Systems Division. “One observes an increase of thermal resistance through thin, nanoscale material layers due to phonon scattering. When going down even further to atomistic or quantum level, high temperature usually impairs the functionality of the nanosystem, such as by thermal broadening or thermal relaxation.”

What this means is thermal considerations need to be deal with much earlier in the design flow, where they can be identified, understood, and appropriately dealt with.

“Thermal issues have branched out from being a component issue to a system issue,” said Steve Woo, fellow and distinguished inventor at Rambus. “In the past, system thermals were thought about later in the design cycle, but you can’t do that anymore. Thermal constraints have to be treated as a first-class design parameter that are fully accounted for at the earliest stages of design. If you wait until the end, you run the risk of having to rip up everything in your system design and start over.”

John Parry, director of Simcenter for Electronics & Semiconductor at Siemens Digital Industries Software, agreed. “The earlier you can use simulation the better, because you’re often faced with many different choices as to how you could do something,” he said. “Quite quickly in early design, you have to explore, evaluate, and discount approaches or configurations that aren’t going to work. Simulation is very much the key to being able to do that efficiently. It also has the benefit that if you’re able to make good choices early on, you come up with a design that you’re confident you can make work. The shift left message to move simulation up the design flow is especially relevant when it comes to advanced packaging because of the intense thermal and mechanical challenges.”

Just putting a giant heatsink on a chip or package doesn’t solve the problem, particularly in complex designs. Before those cooling devices even get a chance to work, chips already may have been damaged.

“A lot of people start by focusing on performance benchmarks, but that’s where they wind up in trouble,” said Melika Roshandell, product marketing director at Cadence. “In the earliest stages of a design they decide on what technology, what architecture, where they want to put different IPs, and what frequency those IPs are going to be. And they think they can solve thermal issues later on by incorporating a fan and a heat sink. That kind of planning can lead to missing benchmarks because the temperature sensor on the IP may throttle it back.”

And while temperature sensors can reduce the risk of thermal runaway and melted chips, they can create problems of their own. They need to be carefully placed where heat is most likely to accumulate, which may not be obvious because other parts of a chip or package may act as a conduit to shift that heat.

“If your sensor is in the wrong place, you can think you’re at 80 degrees, but the real hotspot is may be 100 degrees,” said Calvin Chow, area technical manager at Ansys. “In that case, you’re not throttling at all and could have major damage on your chip as a result. You want to make sure you have the sensors in locations where you’re going to be generating that heat.”

Fig. 1: Multiphysics simulation, including the effect of heat on mechanical stress and heat maps for different areas of a package and board, and CFD to establish correct junction temperatures. Source: Ansys

Fig. 1: Multiphysics simulation, including the effect of heat on mechanical stress and heat maps for different areas of a package and board, and CFD (computational fluid dynamics) to establish correct junction temperatures. Source: Ansys

The good news is that temperature sensors are now being used in more sophisticated ways, said James Myers, system/technology program director at imec. “If you keep chips hot, then aging accelerates, so then you may end up running the chips slower, as well. One way people are coping with this is to use sensors to track temperature history. So rather than margin for the worst case aging scenario, for example, where you’ve got the chip always running at 90 degrees for 10 years, you can use your sensors to establish a time threshold. Instead of throttling at a threshold temperature, you can throttle after it has held that temperature for a certain time.”

2.5D, 3D chips
Thermal challenges are now an inherent aspect of advanced packaging, once-novel structures that are becoming common features in chip design.

“You’ve got multiple challenges that all interact with one another,” said Parry. “What might be a good solution from the point of view of one die, actually makes the situation on the next die worse. It’s almost like Whac-A-Mole. You solve one problem and it gives you a problem in another area. For example, thermal and mechanical challenges often require a tradeoff. We see that typically in bonding layers. These can help alleviate the relative expansion of materials on either side, such as between a chiplet die and silicon interposer. Making the layer, including the interconnect, thicker alleviates mechanical stress, but increases the resistance to heat flow through that layer, which in turn can make the mismatch worse, and make material selection and design decisions tricky.”

Others agree. “Thermal always has been a problem, but it’s exacerbated by 3D and 2.5D IC designs,” said Chow. “This is because when you have stacked dies, heat does not have a means of escaping as easily or dissipating. On top of that, you have thermal coupling between dies, which then can cause reliability problems that impact performance. Fundamentally, a designer has to make thermal decisions at an earlier phase, because of all the ramifications downstream from a reliability and performance perspective.”

This is particularly true as digital logic continues to scale. So even though GAA FETs can help with static leakage, dynamic power density continues to increase. That, in turn, generates more heat.

“If you compare thermal conductivity of silicon, which is fairly decent — approximately 150 watts per Kelvin — to a tiny piece of silicon, it slows down by about 30X,” said Synopsys fellow Victor Moroz. “So, the heat movement is 30X slower inside these tiny pieces compared to the usual big silicon wafer. The side effect is that the peak temperature inside a GAA channel gets higher than the finFET peak temperature. The local peak temperatures are higher, which speeds up aging and degrades the performance.”

Thinking ahead
There is no single best solution for all designs. There are many ways to solve thermal issues, but they may vary in terms of complexity, cost, and performance. This is why it has to be dealt with early in the design cycle, and it needs to be simulated in the context of how a device will be used and which components will be generating heat, how much, and how often.

The challenge is that heat builds over time, something that normally isn’t analyzed at the chip level. “For the purposes of thermal, these time constants need to be much longer to capture what the thermal behavior is,” said Chow. “This means we have to capture power over a very long time, take that emulation data, get power profiles, and apply them to the die to do the thermal analysis correctly, early on. Only then can the engineer say, ‘I have my power number, my power information, I can do my thermal analysis. Is this floorplan optimized? Do I have enough TSVs and microbumps to distribute the power?’”

Fig. 2: System-level thermal mapping. Source: Ansys

Fig. 2: System-level thermal mapping. Source: Ansys

At the same time, engineers tend to underestimate the complexity of self-heating and temperature distributions within a nanosystem, Fraunhofer’s Sohrmann said. “Thermal conductivities within a microsystem vary by orders of magnitude less than electrical conductivity, which makes accurate predictions of the temperature distribution much more challenging. The complexity of thermal analysis leads to engineers neglecting accurate simulations altogether and using worst-case values. This may lead to wasted design margins, which can become expensive in advanced nodes. More effort is thus needed to address this complex topic and create simplified models that are valid over a wide range of parameters, layouts or boundary conditions.”

Dan Doyle, director of NAND component marketing at Micron, offered some suggestions and examples to improve the system from a thermal perspective. “If there are several form factors to be produced, analyze the worst-case form factor in modeling and early tests. Ambient temperature is critical, and the worst-case scenario should be modeled. The workload also should be worst-case, and for a client system, data cache empty and data cache full power will typically differ significantly. Be sure to evaluate performance with a TIM (thermal interface material) if initial results are not satisfactory. And leverage real-world customer data whenever possible, as it can be instructive.”

Further, Tony Veches, director of product architecture at Micron, said it’s important to focus around the customer. “Because of the interrelation between process, design and packaging, close collaboration of engineers across those areas is essential to find innovative solutions for customers, It is imperative to have a detailed understanding of customer workloads and ensure tight collaboration between customer and supplier architecture teams to co-simulate and then jointly optimize the combined system.”

Hot and cold placements
Floor-planning becomes critical in thermal planning. “The key thing is conscious use of power, to make sure that every joule, every watt, is where you want it to be,” said Myers.

This is particularly important at advanced nodes, but it’s also important at mature nodes, particularly where AI accelerators and architectures may generate high temperatures, and in heterogeneous advanced packaging, where thermal effects will vary by different combinations and placements.

“You not only have to think about your component, but you have to think about where your component is placed,” noted Rambus’s Woo. “Are you getting clean air to cool it, or are you getting dirty air? Clean air is when it first comes into the chassis, and is always preferred as it tends to be the coolest temperature. Dirty air is more difficult to plan for, as it’s already floated over hot components. We routinely ask those questions from the beginning now. We didn’t necessarily always have to do that in the past.”

The ideal situation, according to Woo, is to keep the airflow as channeled as possible because undisturbed airflow has the best ability to wick away the heat. “Thermals are integral to chassis design. The hottest components should see the air coming in the chassis first, because that’s the lowest air temperature and will heat up as it goes back through the chassis. Thermal design has extended beyond the chassis to the data center with the use of hot and cold aisles. The front of servers face the cold aisles and they suck in the cool air. As the air travels through the server, it warms up and gets exhausted out the back of the server into the hot aisles.”

Bespoke silicon
The challenge to placement is that as a cost- and time-savings measure, chips often are ordered off-the-shelf. In such cases, the chip designer has no knowledge of where a chip will ultimately be placed, and may not have a system designer’s particular configuration in mind.

The answer, at least for some of the bigger players, is to create custom ICs. Dubbed “bespoke silicon,” these proprietary designs are created by in-house chip and systems teams working together, so everyone can design within the limits of the same thermal budget.

But for smaller companies, and for those with tighter budgets, the problems still remain. Chips may have been rated to a certain temperature, but not necessarily tested in all situations. “This underscores the importance of simulation,” said Parry. “Smaller companies are better able to embrace new approaches using simulation tools that are integrated into their product design suites.”

Dark silicon
One way of handling thermal issues is “dark silicon,” turning down or off circuits when they are not being used in order to save power. Arm pioneered the commercial application of this concept, and Myers was one of its proponents before moving to imec.

“It is very design-specific,” Myers explained. “In mobile, there’s a lot of heterogeneity, so there can be a new process that includes some new logic devices and more transistors. You could add another accelerator, RNNs, CNNs, and a specialized video codec. Those blocks will then power up for a while, but they’re not permanently hot.”

However, dark silicon is not an answer that can be applied in all situations. “In other kinds of design, like the massive GPUs for AI training, those will be on all the time,” said Myers. “Thus, you’ve got to spread the power evenly across them because it’s a parallel workload. If you just go in and add power to get performance, you’ll find when you get to packaging that you may have to really back off the frequency to live within the cooling constraints you have. So performance is only there if you can live within your thermal headroom.”

Liquid cooling
Heat sinks and fans aren’t the only ways to reduce heat. “There also is immersion cooling, where you literally take your circuit board and put it in some inert liquid,” said Woo. “They have no electrical capacity to charge, so they can’t short out the board.”

According to Woo, immersion cooling was first patented by Cray almost a half-century ago, but the expense kept it in exclusively in the domain of supercomputers. Now, it’s being considered for smaller systems, along with microfluidic cooling. In the latter idea, a cooling liquid travels through sealed internal channels. Microfluidic cooling is mostly in the experimental stage, but someday it could offer an immersion cooling-type solution for smaller, mobile devices.

There also are new twists being added to traditional approaches. “The top hyperscale data center operators have done lots of work on their cooling systems, including non-intuitive things like using warm water instead of cold water,” said Myers. “They don’t chill it before sending it back. They send it back slightly warm. That makes it faster and more cost-effective than chilling it back down.”

New issues—backside power delivery
One of the latest architectural innovations is backside power delivery, in which the backside of the wafer is used for power delivery, rather than simply being a passive carrier. So instead of building the electronics on one side of a piece of silicon, the power delivery is routed on the other side. That makes handling more difficult, but it eases the congestion significantly — at least in theory.

While the goal is clear, power density and scale may still may create thermal problems, which will need to be accounted for in planning, Synopsys’ Moroz said. “We rely on silicon dies because they’re thick, as in hundreds of microns, which means they can move heat away vertically and laterally. If there are a bunch of hotspots sprinkled around, you know that the heat will move laterally and make the temperature more uniform. So that a hotspot becomes less hot, while its neighbors get some of the heat. But silicon die with backside power delivery is super thin, as in hundreds of nanometers, which means that the lateral heat transfer is really bad. There is just not enough room to move heat laterally. This can be mitigated by using high pattern density for the backside copper wires so that copper helps the heat to escape.”

Moroz also noted there are techniques for delegating thermal management to special on-chip circuitry that detects such problems and manages them on the fly by re-directing circuit activity away from hot spots or slows things down whenever re-direction is not possible.

Even the best planning can’t eliminate the need for thermal power management, which may still require deliberately throttling performance. “It’s either slow or it melts,” one source said. The only way to get better control over the tradeoffs is to shift thermal considerations to earlier in the design process.

“You can say, ‘It’s going to get hot at a few hotspots, so let’s introduce a design margin that slows down everything by 20%,'” said Moroz. “Or you can have a model that reflects all your materials, components, and the configuration, and then it more precisely tells you where the hotspots are. That said, you can design a circuit that comes with 20% degradation in just certain locations, but performs better elsewhere.”

But that’s also assuming this is done early enough in the design cycle. “There’s no fixing thermal runaway later in production, warned Cadence’s Roshandell. “The only solution is a new tapeout.”

Related Reading
Keeping IC Packages Cool
Engineers are finding ways to effectively thermally dissipate heat from complex modules.
DRAM Thermal Issues Reach Crisis Point
Increased transistor density and utilization are creating memory performance issues.
Future Challenges For Advanced Packaging
OSATs are wrestling with a slew of issues, including warpage, thermal mismatch, heterogeneous integration, and thinner lines and spaces.
Thermal Floorplanning For Chips
Many factors influence how hot a die or IP will get, but if thermal analysis is not done, it can result in dead or under-performing systems.
Mapping Heat Across A System
Addressing heat issues requires a combination of more tools, strategies for removing that heat, and more accurate thermal analysis early in the design flow.

Leave a Reply

(Note: This name will be displayed publicly)