Thermal Damage To Chips Widens

Heat issues resurface at advanced nodes, raising questions about how well semiconductors will perform over time for a variety of applications.


Heat is becoming a much bigger problem for semiconductor and system design, fueled by higher density and the increasing use of complex chips in markets such as automotive, where reliability is measured in decade-long increments.

In the past, heat typically was handled by mechanical engineers, who figured out where to put heat sinks, fans, or holes to funnel heat out of a chassis. But as more functionality is added onto a PCB or into an SoC, heat is becoming a much more important consideration at the silicon level—one that is difficult to predict, manage, and risky to ignore.

“Thermal has always been a problem, but it’s gotten worse as the chip, board and enclosures have gotten smaller,” said Greg Caswell, senior member of the technical staff at DfR Solutions, a reliability engineering services firm. He noted this problem has become noticeably worse in the past year. “We’re finding solder fatigue, plated hole fatigue, parts being mixed with different coefficients for expansion. If it needs underfill, that underfill doesn’t necessarily match up with the other parts. There are about 700 laminates to keep track of, and it all can change depending upon the type of board material. People say they’re using FR-4 boards, but there are 400 materials catalogued as FR-4. Over a period of 10 years, you start finding diurnal stresses, shock and vibration problems, weird temperature variations, and lots of different voltage levels. All of this plays into the ability of a product to survive.”

Survival is a relative term. Many designs now have to be fully functional for much longer periods than in the past because of demands for extended reliability in end markets such as automotive, aerospace, medical, industrial. Even chips in an automobile’s infotainment system need to last 10 to 15 years because of possible interaction with safety-critical systems.

“In a mobile device, the typical active life is 5,000 hours,” said Ron Moore, vice president of marketing for ARM‘s Physical IP Division. “For a server, it’s 100,000 hours. You need to do extra EM analysis, more analysis on flip-flops. So the physical IP is changing according to the physical requirements.”

This isn’t exactly a new topic of discussion for semiconductor engineers. In 2001, Pat Gelsinger—then Intel‘s CTO—predicted that within a decade the energy density of chips would be equivalent to the surface the sun if nothing was done. The solutions came in the form of multiple cores, dark silicon methodologies, new materials, and a number of of very good engineering and design techniques. But the problem never went away, and it has come back with a vengeance in finFET-based designs, particularly at the next couple of process nodes, forcing companies to consider 2.5D and fan-out packaging, new architectures and microarchitectures, and raising questions about the long-term impacts of even modestly higher temperatures.

“Thermal adds a whole bunch of unknowns,” said Aveek Sarkar, vice president of product engineering and support at Ansys. “You need to assess the thermal impact at the chip-package and system levels, at the chip level or the interconnect level, and if it’s a finFET you have to deal with localized heating. At 10nm and 7nm this is going to get worse. You have to predict what will happen with power, and then create temperature profiles for different power scenarios.”

Temperature is relatively steady state, compared with spikes in voltage drops, for instance. That makes it deceptive to deal with effectively. It should seem logical that, in conjunction with the thermal conductive properties of silicon, the heat should dissipate across and out of a chip. But in a densely packed SoC, not all of that heat can escape. Wherever that channel gets blocked, it can overheat components, sometimes completely on the other side of the chip.

“What’s changed is that now you need to consider thermal management closer to silicon,” said Robin Bornoff, FloTherm and FloVent marketing manager at Mentor Graphics. “If you look at the infotainment systems in automobiles, the environment is quite extreme. There is heat in the dashboard, and it’s hard for that heat to leave. There are not many cooling channels. That can cause the IGBT to experience radical changes and make it unreliable under certain driving profiles. It also can have an effect on digital displays, where the brightness changes or the color changes. We’re talking about large temperature gradients. For bond wires, which handle a large amount of power, there is a thermomechanical risk of bond wire failures.”

Predicting problems
Figuring out when and where thermal issues will crop up requires a combination of tools, history, and a healthy dose of luck.

“Everything might look okay, but 35 seconds into a simulation you find a power problem that’s generating thermal issues,” said Alan Gibbons, power architect at Synopsys. “You need a very accurate model with more details of what’s going on. But you don’t want to have to run it for the whole 35 seconds. So you swap in a more accurate functional model that is cycle-accurate, find the power hotspot, and then back out and move on.”

Things don’t always work out that well, though. “You may find a thermal issue in the core due to a software task on the wrong process or something is being done in software when it should be done in hardware,” Gibbons said. “This is a big challenge for the EDA community. We normally think about reliability in terms of power and performance, but it can be affected by power densities. If you have processors running at 2 to 3 GHz they dissipate a lot of power. Thermal considerations become more acute.”

This becomes even more problematic at advanced nodes, because margining costs power and/or performance. With less of a buffer, designs need to be more exact. But one of the goals of SoCs is fitting more functionality into a given space, so there are more variables in terms of use models.

“Modeling and simulation scenarios are different,” said Ansys’ Sarkar. “You have to understand under which conditions a function is operating. And you have to put that in context of the whole chip. So the chip may show 80° C, but it is no longer uniform so you have to recalculate the power for the temperature profile. An ARM block might be 85° and an instruction cache might be 75°. Calculating temperature is an iterative process between temperature and power. Once you get the temperature profile, then you have to figure out whether it’s too pessimistic or optimistic based up the lifetime of the chip. If you look at foundry electromigration rules, they say you get a 10-year life if you follow the rules, which have a certain temperature. If the temperature increases from 110° to 125°, the chip will fail faster.”

But with uneven temperatures across a die, it’s much harder to calculate the impact on reliability.

All of the major EDA companies are now working on this problem. “Traditionally, analysis tools have focused on package temperature,” said CT Kao, product engineering architect at Cadence. “But with a 10nm finFET, you don’t have the granularity you need to go from the PCB to the transistor. What’s needed is a physics analysis simulation. At the chip level, we can resolve place and route inside a chip and combine that with thermal. So we have granularity in that direction, but not directly for the PCB.”

What’s difficult to nail down is exactly what different engineers need at different times, even for the same design. Some need a detailed thermal analysis of transistors or groups of transistors, while others only need a system-level analysis. “And all of this has to be combined with experiments and good engineering judgment,” said Kao. “You don’t necessarily need to know the temperature of individual transistors if they’re next to each other, but you do need to know how chips behave under different functional requirements and how hot they are over time.”

FinFETs have provided a respite from leakage current at 16/14nm, which has been increasingly difficult to manage since 65nm. But the problem begins to grow again, starting at 10nm, and that drives up heat.

“Leakage didn’t go away and semiconductor physics is not changing,” said Drew Wingard, CTO at Sonics. “For one node it has become less important. What we’re seeing now is a lot more emphasis on clock control for power management. But the reality is that a large chunk of power is still in the clock tree. Another challenge is dynamic power management. There is no automation, so you need to work at the micro-architectural if not the architectural level.”

All of this has a direct impact on heat. The more things that are in the ‘on’ state, and the longer they remain on, the more heat they generate and the greater the thermal effects. Wingard said one solution is better clock control, because clocks can be turned off and start up in one clock cycle, which is very quick. “You can arrange power management in groups, so you turn them on in sequence. You also can turn on smaller ones first, so the inrush current is spread over a longer period of time. Then when you turn on the fat transistors, there is lower resistance.”

Advanced packaging is another option, and one that has gained more attention over the past year as high-bandwidth memory solutions began hitting the market. But there are plenty more options there, including how the individual die are packaged together.

“One of the key issues is thermal dissipation,” said Craig Mitchell, president of the Invensas business unit of Tessera. “That changes depending on the thickness of the die. If you reduce the thickness you can reduce the resistance to pull more heat out.”

Tessera has begun developing a different way of stacking DRAM, as well, staggering the die the way bricks are staggered, so that on every stack a portion of the DRAM is open. That approach allows more cooling and with shorter interconnects and faster memory access.

Also on the memory front, companies such as Kilopass have been working on one-time programmable memory as an alternative to other types of non-volatilve memory because of its resistance to heat. “Unlike embedded Flash memory, OTP handles extreme heat well,” said Jen-Tai Hsu, Kilopass ‘ vice president of engineering. “Both do okay with cold temperatures as low as -40° Celsius. But OTP works to 125° Celsius, while embedded Flash memory typically only supports to about 85° Celsius. With mechanical areas in the car reaching extreme temperatures and a need for memory that doesn’t fail, OTP is a better choice.”

There also has been a significant push to eliminate the problem in the first place. Mentor’s Bornoff said there has been research into new areas such as thermal through-silicon vias, as well, which act like chimneys out of a package. “The challenge is that if you experience any bottleneck, it backs up all the way to the heat source. The best way to deal with that is to transfer heat close to the source. The use of thermal vias is well established, but dedicated thermal TSVs is an area of active research. We still need to understand how many are needed and how those factor into design. But it could have a massive impact on the rest of a design.”

Bornoff said liquid channels etched on the underside of a die are another area of active research. So are new thermal interface materials. “We’re seeing new ones coming into play that use small parts of metal suspended in a substrate. Material science is helping here. We can do thermal simulation based on the thickness of materials and their different properties. Temperature is always a good leading indicator of failure mechanisms.”

Other issues
Heat has other effects that are just beginning to be understood in the semiconductor world as it crosses into the world of deep physics.

“One strong impact of higher temperature, when combined with high voltage, is an increased risk of latch up, and this is a serious reliability issue,” said Olivier Lauzeral, president and general manager of iROC Technologies, another reliability engineering services company. “Another impact from temperature is the actual flux of thermal neutrons in the room. These neutrons interact with Boron 10 dopants in silicon to produce alpha particles and lithium ions. The cross section (or probability of interaction) of thermal neutrons with Boron 10 varies as 1/√E, with E being the energy of the neutron which is positively correlated with temperature (hence the term thermal neutrons). So the higher the temperature, the higher their energy, the lower the probability of interaction with ^10B, the lower the flux of alpha or lithium ions.”

Heat has been responsible over the past year for more bit slips and data retention problems in flash memory, as well, according to Edward Wyrwas, senior member of the technical staff at DfR Solutions. “We’re also seeing the effects on gate oxide integrity and we’re seeing more negative bias temperature instability (NBTI). And as components like graphics cards begin doing more critical thinking, and we start using more memories and more FPGAs, temperatures will go even higher.”

Those problems will likely be compounded as the Internet of Everything begins kicking into gear, because many devices will need to be always on, and as more features are added into systems that can be affected by different use models. Both can affect heat.

“What’s necessary here is that you design chips so they accomplish a certain workload,” said ARM’s Moore. “So you’re making predictions as you’re analyzing reliability in a set of workflows. Maybe this application will drive nearly overload voltage. This is more of an implementation issue, but it’s an increasing trend. Implementation is more and more important, and it affects where you are in your margin.”

The bottom line is that thermal issues are increasingly part of the design, and need to independently and in conjunction with power, materials, architectures, processes and packaging. On a positive note, this does provide some very interesting multi-physics engineering problems to solve for years to come.

Related Stories
Reliability Adds Risk Over Time
Reliability Definition Is Changing

  • Dave Duchesneau

    I think there’s a discrepancy in one of the statements. Olivier Lauzeral, president & GM of iROC, said “One strong impact of higher temperature, when combined with high voltage, is an increased risk of latch up…”

    This first statement tells me that higher temps INCREASE the risk of latch-up, which makes sense.

    Mr. Lauzeral goes on to say, ““Another impact from temperature is the actual flux of thermal neutrons… [which] …interact with Boron 10 dopants in silicon to produce alpha particles and lithium ions. The cross section (or probability of interaction) of thermal neutrons with Boron 10 varies as 1/√E… So the higher the temperature, the higher their energy, the lower the probability of interaction with [Boron 10], the lower the flux of alpha or lithium ions.”

    This second statement tells me that higher temps DECREASE the risk of latch-up, due to the inversely varying relationship (1/√E), which does NOT makes sense to me. Admittedly I’m out of my element here, but intuitively I would think that increasing energy would INCREASE the flux of alpha ions, which would further INCREASE the risk of latch-up.