Semiconductors have become limited by heat. Good design can reduce it, and help dissipate it.
Power consumed by semiconductors creates heat, which must be removed from the device, but how to do this efficiently is a growing challenge.
Heat is the waste product of semiconductors. It is produced when power is dissipated in devices and along wires. Power is consumed when devices switch, meaning that it is dependent upon activity, and that power is constantly being wasted by imperfect devices and wires. Designs are rarely perfect, and some of the heat comes from activity that performs no wanted function. But at some point, a design team has to work out how to get rid of the heat, because if they don’t, the lifetime of the product will be very short.
Only three processes govern the transfer of heat — conduction, convection, and radiation. In simple terms, conduction applies to solids, convection to liquids and gases, and radiation to vacuums, of which there are very few in a semiconductor.
“There are three steps associated with heat,” says Marc Swinnen, director of marketing for Ansys’ Semiconductor Division. “There’s production, conduction, and dissipation. You produce the heat, you conduct it out to somewhere, and dissipate it. Power analysis tells you where the heat is being produced. Conduction and dissipation is a physics analysis that includes fluidics. All three have to be included in a system analysis because there is feedback between them.”
That becomes much harder as transistor density increases. “Most people have access to changing the conductive path,” says Karthick Gopalakrishnan, product engineer for the Celsius Thermal Solver in the Multiphysics System Analysis group at Cadence. “There is potential to improve the materials and the design itself to take more heat away through conduction from your heat dissipating device. There’s a challenge in that the thermal real estate around these devices, unless we’re working with massive servers, is really small. You have to think about material improvement, intelligent use of the thermal real estate around your chips or packages or PCBs. What you really want to do is improve your conductive heat transfer rate.”
Just putting a large heatsink on the device can cause additional problems if proper analysis is not performed. Getting that right requires consideration for air flow and the mechanical design of the space in which it resides so that the impact on other devices is taken into account.
Even heatsinks have limits. “There are many ways to remove the heat from the system, such as forced liquid cooling,” says William Ruby, director of product management for the Synopsys EDA Group. “We are seeing many advances in some of the more advanced packaging. With 3D-IC designs, forced air flow and liquid cooling can be brought to bear. There are some newer concepts about being able to mitigate the heat through special vias to help spread it.”
Unlike electrical conductivity, where there are orders of magnitude differences between a conductor and an insulator, thermal conductivity is somewhat limited. “Silicon has a conductivity of 100 to 120 watts per meter Kelvin (W/(m x K), which isn’t bad as a material for conducting heat,” says John Parry, industry director for Electronics & Semiconductor of the Simcenter portfolio within Siemens Digital Industry Software. “Copper is only 400, and copper is typically what’s used as the best conductor of heat that is financially economic.”
Fig. 1 Various large to small heatsinks and cold plates, as seen at Semicon West 2023 at the Malico Inc. booth. Source: Semiconductor Engineering / Susan Rambo
There are other economic considerations. “The main cost driver for a datacenter is not the cost of the heatsinking method, but instead the operating costs that manage heat transfer at the datacenter level,” says Javier DeLaCruz, fellow and senior director for System Integration & Development at Arm. “There is a finite amount of electricity coming into the datacenter, which is shared between feeding the compute system and extracting the heat. Performance per watt must therefore be the metric of interest, not performance alone.”
Heat can have a significant impact performance. “Even when optimal heat dissipation strategies are followed, each die will heat up differently during circuit operation, degrading performance,” says How-Siang Yap, product manager at Keysight EDA. “The dynamic temperature can change a device’s electrical characteristics, such as gain, impedance, and load-pull mismatch, as well as higher-level waveform characteristics such as error vector magnitude (EVM) and adjacent channel leakage ratio (ACLR) in RF circuits for digitally modulated signals. The impact penalties can be higher in analog systems.”
Analysis is not easy. “Today’s chips are so complex that it becomes difficult to define how activity can be created that will show the worst-case conditions,” says Ansys’ Swinnen. “When you are looking at timing errors caused by temperature, you’re looking at nanoseconds, a few microseconds at most. Secondly, the time constants for electrical parameters and for thermal parameters are very different, like two orders of magnitude at least. When you get heat that blossoms, it slowly dissipates through the chip, and next door, so you’re going to see heat increasing because of what happened two seconds ago in the block next door.”
Heat distribution within a chip
Heat has a tendency to go in all directions. “You can’t really stop the heat going anywhere,” says Siemens’ Parry. “You can coax it, but it’s very unlike the electrical world where the difference between a conductor and an insulator is maybe 20, 21 orders of magnitude in electrical conductivity. Electrically you can make you can make current go where you want it, but thermally you really can’t.”
Because heat is dependent upon activity, the surface of a chip is not at a constant, evenly distributed temperature. “You may have a hotspot created by a very computationally intensive portion of the design, like a hardware accelerator,” says Synopsys’ Ruby. “Another portion of the chip could be less active or only used in a particular mode of operations. The temperature gradients on a chip are workload or activity dependent.”
Dispersing heat is simple enough in theory, but much harder in practice. “You want to minimize hotspots by spreading the heat as much as possible on any layer,” says Cadence’s Gopalakrishnan. “You have to consider where things are placed. Moving something to the edge of the die is not always possible, because there you don’t get heat spreading in one direction.”
And while you may not be able to control heat, you can understand how it spreads. “If you model the current flowing through wires on a chip and look at the heat flux that comes from that, it doesn’t get very far before it all just merges together,” says Parry. “You can look at the temperature profile, and that doesn’t really show anything like the difference you get between the traces and the insulator between them. If you look at the temperature profile, you will barely be able to detect where the metal traces are. But if you look at the heat flux, it’s orders of magnitude higher in the metal than it is in the insulator.”
That makes things a little easier. “It makes things easier when modeling a lot of this stuff,” adds Parry. “You can get quite accurate results by not modeling the individual wires on the die surface, the layers of metallization, but just using average material properties, which is a very common thing to do.”
One effective technique is to utilize thermally aware floor-planning and cell placement. “The fundamental idea is to do placement to minimize both the peak temperature as well as the temperature gradients,” says Ruby. “With a physically aware RTL power analysis tool, you can analyze the initial placement and then feed that power profile data into thermal analysis. That is shift left from doing analysis based on the final sign-off, or completed physical implementation, which may be too late to start changing the macro floor plan. We can also look at things like via density, bump density, and different metal densities.”
For 3D-ICs, TSVs have been talked about as a way to create heat corridors. “Better TSV placement can help,” says Gopalakrishnan. “But there’s a limit to that because they do take up valuable real-estate on the die. There is a lot of potential to move things around when it comes to floor planning, either at the chip level when you’re talking about tiles or power blocks or functional units, or at the routing level where you’re trying to add TSVs. One of the biggest advantages for them is that you can target hotspots when you’re working that close to the die or the power source.”
But the impact is limited. “They are to some extent being used as heat corridors, but if you think of them as being copper, they’re only four times as conductive as the silicon that they’re going through,” says Parry. “Consider a unit cell, 10-by-10, and you’ve got a TSV in every corner. That’s 4 in 100. Because the TSVs have only four times the conductivity of the silicon they go through, you’ve added maybe 16% to the effective conductivity of the die. It doesn’t have a big effect thermally, and while they do help, it’s not a silver bullet.”
Another emerging technique is backside power delivery. “Backside power helps with power delivery, but makes heat dissipation more of a challenge,” says DeLaCruz. “The bulk silicon, which was previously a great mechanism for locally spreading heat, has evolved from approximately 800 microns thickness to just a single micron, making the local hotspots more difficult to manage. TSVs do not make thermal management easier, they just make it different, as TSVs help in a very localized way and only in the axis perpendicular to the transistors. The oxide liner around TSVs also impedes the lateral thermal energy dissipation.”
3D adds new thermal problems. “If you think of the glue layers between die, which is very common to find, they serve the purpose of mechanically fixing the die together,” says Parry. “You need a certain thickness. Otherwise, the shear is too high on the interconnects between the die, and you get electrical breakages. Unfortunately, those glue layers are a relatively soft material compared to the silicon die, and also tend to have a relatively low thermal conductivity. You have this tradeoff between thermal and mechanical. Thermally, you’d like to have that layer as thin as possible, to make heat conduction through that layer as efficient as possible. Mechanically, you’d like to have a thick layer because that allows you to take up the mismatch in the displacement between the two die with relatively little shearing of the material in between.”
Heat distribution outside of the chip
Heat can escape either through the top of the package, and then possibly into a heatsink, or out through the bottom and the PCB it is connected to. “If you’ve got a plastic over-molded BGA, then you’ll be putting the vast majority (80% to 90%) of the heat into the board,” says Parry. “If you’ve got a package with a really good conduction path to the lid, you could probably arrange for a good 80% to 90% of the heat to go that way. You can control it, depending on the packaging approach that you’ve taken, but not completely. Some always goes the other way.”
Where you want the heat to go is application-specific. “In servers, there’s a lot of space around the package that you can utilize,” says Gopalakrishnan. “You tend to fill that up with active or passive heatsinks, and fans that help to dissipate a lot of heat away. The PCB itself is not going to play a major role in dissipating heat away. When you go to mobile devices, that is not a solution because maybe about half of the heat goes through the bottom and the remaining half goes to the top. In that case, the PCB is going to play a major role in dissipating heat away from the chip.”
When space is limited, it becomes a lot more difficult. “There are different ways this can be achieved depending on the specific market,” says Arm’s DeLaCruz. “For example, in smartphones, the use of highly conductive films such as graphite or graphene films are prevalent due to the minimal volume and effective heat spreading in the system. In the infrastructure space, the use of active and passive 3D vapor chambers are enabling operation into the many hundreds of watts range.”
Liquid cooling is another possibility. “Convection is where we are seeing a lot of progress recently,” says Gopalakrishnan. “You have fans, liquid cooling, and two-phase systems. We also have advanced systems like immersion cooling at the data center level. You see a lot of roadmaps for design engineers and companies that make equipment and systems, who are adding liquid cooling as part of the roadmap. This is because if you just add a heatsink on a device and expect it to cool, it hits that limit when your heat dissipation is more than 1 kilowatt per meter squared. With a fan, that is around 10 kilowatts per meter squared. But we have 1 megawatt per meter squared these days with advanced server equipment chips. You really have to explore these strategies.”
Not everyone thinks it will find quick adoption. “While we expect liquid cooling to happen in specialized deployments such as supercomputing clusters, it is less likely to take root broadly,” says Madhu Rangarajan, vice president of products at Ampere Computing. “It is important for silicon designers to take practical infrastructure limitations into account while creating new technologies and work in concert with system designers and data center designers to drive them into broad deployment. We expect that most CPUs deployed for the next five years will still need to be air-cooled in a TCO effective manner.”
Models and analysis
Thermal may be one of the stumbling blocks for a third-party chiplet market, because chiplets will require a thermal model. “The individual chiplets can’t actually be designed in isolation from one another,” says Parry. “Each one needs to know about heat sources on its neighbor die. There’s a lot more collaboration needed in the development of these high-density advanced package designs. The way these things are developed has to change in order to make the design tractable.”
Creating models is not simple. “There are a lot of things you really wouldn’t want to disclose in a chip thermal model,” says Gopalakrishnan. “There are efforts to add the self-heating effects, the thermal resistance characteristics of the chip in the form of a reduced-order model, or some kind of approximation that would not necessarily involve someone knowing every single geometric detail that exists in the chip. Currently, that’s how some of these chip models are being generated.”
Tools need to change, as well. “The 3D-IC world is the world of comprehensive models, and model-based analysis needs to take place,” says Ruby. “You can’t afford to do everything flat as we do today. On a single chip, we do timing sign-off and power sign-off flat at the netlist level. In the 3D-IC context, it may become impractical, so we need to start looking at modeling various components.”
And ultimately it brings together design and packaging. “You need to combine the chip design workflow with the package design workflow,” says Parry. “You can’t treat them as one occurring before the other, where a chip is given to a packaging group, particularly in 3D-ICs. But it applies, to some extent, in 2.5D. The challenge is taking the type of simulation technology that we have, that’s traditionally used by packaging engineers, perhaps from a mechanical background, and making that available to people doing IC verification as part of the IC design flow. They may not be comfortable using the tool sets that mechanical engineers use. It’s a case of taking the technology and repackaging it so that it’s available to the people who need to use it higher up the design flows.”
Conclusion
Many chips face thermal barriers, and solutions to the problem are not easy. “The unfortunate fact is that thermal is the limiting factor on integration density,” says Swinnen. “We can design and manufacture incredible chips, except they’ll melt. It’s not a manufacturing limitation, not a design limitation. It is a physics limitation that we can’t get more heat out.”
While exotic solutions are available in some applications, the majority of markets have to find ways to do more with less, and that means more functionality per watt. The costs associated with this are much larger than the solutions of the past.
I think it’s also compelling to consider the use of alternative materials as part of substrate reconstitution strategies. Glass is a compelling material for use as a replacement substrate to buffer heat flow and/or enable liner-less TSV farms and facilitate targeted heat dissipation pathways.