New Thermal Issues Emerge

Heat is becoming a bigger problem for chips developed at new process nodes and for automotive and industrial applications.


Thermal monitoring is becoming more critical as gate density continues to increase at each new node and as chips are developed for safety critical markets such as automotive.

This may sound counterintuitive because the whole point of device scaling is to increase gate density. But at 10/7 and 7/5nm, static current leakage is becoming a bigger issue, raising questions about how long finFETswill last before being replaced by gate-all-around FET. In addition, dynamic power density is increasing with gate density. Both of these factors are causing a growing list of problems ranging from thermal hotspots to noise.

Fig. 1: GAA FET. Source: EPFL

In large SoCs, thermal variation between blocks in large SoCs can be drastic-sometimes as much as 40 to 50° C (104 to 122° F).

“This is the difference in operating temperature between adjacent blocks due to the variation in activity level,” said Magdy Abadir, vice president of marketing at Helic. “This increase in temperature increases leakage power significantly. Moreover, resistance of a wire increases as temperature increases, which in turn means that the delay increase as temperature rises.”

Temperature has an adverse effect on the reliability of devices, but not necessarily in expected ways and not always immediately.

“Increased temperature increases metal migration and causes wires to get thinner,” Abadir said. “That also means that self-inductance of these wires will increase, which can give rise to an increase in magnetic coupling, which compounds aging effects. The temperature of a block is a function of its activity profile as well as the temperature of the surrounding blocks. Hence, floor-planning and placement have a strong impact on the thermal profile of the various blocks. Monitoring temperature can identify hot spots, which can trigger control mechanisms to slow down the activity in that area. That, in turn, can reduce the temperature and reduce the negative effects of high temperature, such as power leakage and reliability. Also thermal monitoring can give early indication of areas in the design that are susceptible to early failure.”

Thermal issues have been well documented in device scaling, but they are getting significantly more difficult to address at each new node.

“With decreasing feature sizes and dielectric thickness, there is increasing consideration of thermal and thermal-stress that could exaggerate the over-stress and aging effects,” said Karthik Srinivasan, principal applications engineer at ANSYS. “Encompassing all reliability effects in circuit simulations is challenging, and alternate methods are being actively looked at to address such concerns.”

On a macro-scale, designers are looking at predicting thermal hotspots in the design as early as possible, Srinivasan said. “Although the idea is not new, semiconductor companies and system houses are looking at a more methodical approach of estimating thermal profile and hot spots through simulation and emulation vectors. They are used to perform thermally aware placement and may be even extending it to perform software optimizations to avoid/minimize thermal issues.”

Also, with increasing adoption of advanced packaging, thermal effects such as thermal coupling across chips are becoming a concern. “Needless to say, thermal-induced stress/fatigue are also a challenge when it comes to the copper pillars, solder joints and other thinner geometries used in 2.5/3D packages,” he noted.

Automotive issues
The issues are potentially more acute in automotive applications, where chips are being developed at 10/7nm for the central logic in assisted and autonomous vehicles. Most thermal modeling for advanced-node chips occurs in devices such as a smart phone, which shuts down when left in the hot sun until it reaches an acceptable operating temperature. But automotive electronics need to function under all conditions, and temperatures under the hood are sometimes extreme. In desert regions in some parts of the world, ambient temperatures of 140º F (60º C) are not uncommon during the hottest months.

“The bad thing about thermal is that reliability is related to temperature,” said Jerry Zhao, product management director at Cadence. “This is why it’s important for automotive chip designers. By nature, the automotive industry requires very high reliability. These chips sit inside a car you should expect to run at least 10 years. Add to that the fact that sometimes it will be operating in a very hot environment, like Texas, where it is 115° F  (46° C) outside. Who knows how hot that is under the hood. And when you have very high temperatures and when the current is high, electromigration tends to be more severe, which is basically caused by higher temperatures. There is more discussion about thermal effects with the increasing need to be cognizant of failure rates in devices. If you have 100,000 parts, in 10 years how many will fail? Statistically speaking, that will give you a parameter of how reliable the design is. Of course, there are some technologies for how to diagnose it and make a fix based on the design.”

Fig. 2: Extreme heat warning in California’s Death Valley. Source: insideclimatenews

This is why automotive is becoming such an active market for identifying thermal effects for a variety of applications inside the vehicle, including high-power ICs.

“In terms of heat dissipation in ICs, things are pretty much under control,” said Mick Tegethoff, director of AMS product marketing at Mentor, a Siemens Business. “What engineering teams traditionally did was make sure that the junction temperatures on the chip were within a safe value, so they treated the whole chip as a unit as far as heat dissipation is concerned. They made sure that the junction temperatures were safe. Then, when they verified the design with SPICE simulation or standard cell characterization, they used the power, voltage, temperature (PVT) corners to make sure they’d meet specs across all of the temperatures. Obviously at high temperature things slow down. At low voltage things slow down. And at low temperature things get faster. These are all pretty solved issues.”

However, with high power devices – specifically, things that are driving motors – big FETs that drive a lot of current that generate a lot of power are needed, and the temperature has to be dissipated on those, Tegethoff said. “These big power FETs typically would be in a separate package, and design teams dealt with that by itself and figured out the same thing. Over the years they started to see-specifically when STMicro introduced the BCD process-that the thermal issues with the power device were so big that they had to worry about what it was going to do real time in terms of thermal gradience across the chip.”

This resulted in close work with ST on electro-thermal co-simulation technology.

“On a SPICE simulator, the SPICE models allow you to specify a temperature per device, so that’s not a problem,” Tegethoff explained. “The question becomes what you say the temperature is because it is going to vary. The electro-thermal co-simulation performs an electrical time domain simulation, alongside a thermal evaluation of the localized temperature of the circuit, step by step. There is a thermal solver working in parallel with an electrical solver, and then the temperatures in all the devices at this time are updated. Then they solve for the electrical characteristics, and then they may want to go ‘n’ number of time steps and do it again. They can pick how often they run it.”

This technology is used on chip designs today with very high power devices on the same IC with digital logic, such as for anti-lock brakes or any motor actuation that takes a lot of current. “This is even without talking about an electric motor driving the wheels,” said Tegethoff. “This is something that rolls the window up and down, or moves the seat, or puts on the brakes.

There more interest than ever in this approach for all kinds of SoCs, from high reliability applications to safety and industrial applications. But it’s not the only approach. Another way to accurately monitoring thermal conditions on an SoC is by using a locally placed, responsive embedded temperature sensor.

“To do this, high-precision low-power junction temperature sensors are commercially available, and they are meant to be embedded into ASIC designs,” said Stephen Crosher, CEO of Moortec. “These can be used for a number of different applications, including dynamic voltage and frequency scaling (DVFS), device lifetime enhancement, device characterization and thermal profiling.”

The bottom line is that monitoring voltage is a critical element in advanced-node SoC design, “Such circuits monitor voltage levels within the core logic voltage domains and provide accurate IR drop analysis,” Crosher said. “The measurement range can be customized to suit each technology. Supply monitors also can monitor analog (I/O) supply domains capable of monitoring supply droops and perturbations.”

This gets more difficult as SoCs are integrated into subsystems that are used in advanced packaging because thermal effects there are more difficult to predict. Heat resistance will increase through the package stack as some of the die will not have direct contact with a heatsink. Also, when two cores are operating close to each other on different substrates, unexpected hotspots can occur that push both over their thermal limits. Hotspots are likely also to lead to a more rapid aging in some regions of the SoC giving rise to unexpected early failures in the field. A problem that faces engineering teams is they lack the data needed to identify where such hotspots will form.

Engineering groups are looking for extensive TCAD/FEM simulations on real designs (or at least critical portions of it) to truly assess the impact of thermal stress, ANSYS’ Srinivasan said.

Other considerations
All of these effects are cumulative, too. In a system, thermal budgets are additive, and that system can be as large as the Falcon Heavy rocket, which was just tested by Tesla founder Elon Musk’s SpaceX.

According to the company, Falcon Heavy is the most powerful operational rocket in the world by a factor of two, with the ability to lift into orbit nearly 64 metric tons (141,000 lb). That mass is greater than a 737 jetliner loaded with passengers, crew, luggage and fuel. Falcon Heavy’s first stage is composed of three Falcon 9 nine-engine cores, whose 27 Merlin engines together generate more than 5 million pounds of thrust at liftoff, equal to approximately eighteen 747 aircraft. All of that thrust generates a tremendous amount of heat.

While details were not readily available on exactly how the rocket scientists monitored their design for thermal impact, Cadence’s Zhao noted, “I’m sure that [Elon Musk] is more concerned about the other part of thermal than the Tesla, because automotive is all battery powered and that’s another level of thermal on the die for his rocket. It’s not just the consumption of the power because, remember, power causes the thermal rise. It’s also reliability issues. If you keep running at very high temperatures, the failure rate is going to be higher.”

The takeaway is that thermal effects need to be anticipated in context of other parts of a system, regardless of whether that system is an autonomous electric vehicle or a rocket. And as semiconductors find their way into new devices and take on new roles in existing devices, there is a lot of work ahead in terms of modeling thermal effects within a device, as well as characterizing these devices in the context of other devices, systems, and use models that today may not be fully understood or accounted for.

Leave a Reply