Thermally Challenged


With leakage currents growing exponentially it becomes more important to perform thermal analysis for EM signoff. Will 3D-ICs help or hinder?

Chips run hot and the thermal densities increase with every reduction in fabrication geometry.

“When we go down to 16nm the local power density increases by 25% and the local gate density also increases by 25% to 30%,” explains Norman Chang, vice president of product strategy at Ansys/Apache.

In fact, this is becoming such a large problem that it is affecting the scaling process itself. “With 10nm and 7nm designs, one of things we are finding is that they are not packing the standards cells to be smaller in part because of the thermal density,” says Brandon Wang, director of 3D-IC solution at Cadence. Creating an optimized solution requires understanding the operating temperature of each part of the chip and how this can affect neighboring components.

Both types of power have to be considered—active or dynamic power, which is the power generated performing a useful function, and leakage power, which is the unintended wasted power that has no useful purpose. Leakage power has been increasing as a percentage of total power with each new technology node and at least one component of leakage, sub-threshold leakage, is very dependent on temperature.

Effect of Leakage Current at different Temperatures at 65nm. Source: Analysis of the Effect of Temperature and Vdd on Leakage Current in Conventional 6T-SRAM Bit-Cell at 90nm and 65nm Technology. Shukla et al. International Journal of Computer Applications, 0975 – 8887, Volume 26, No.1, July 2011)

Running Above the EM Cliff
With heat comes a host of other problems, most notably electro-migration (EM). Most chips have to go through EM signoff at a specified temperature, normally around 110˚ or 115˚. “This is above the cliff for EM degradation,” says Wang. “At 100˚ there is a huge degradation, and so at 115˚ you are adding 2X or 3X guard band to ensure that silicon will sign off.”

Chang agrees, adding that “for EM signoff, they may see too many violations and don’t have enough time to look at all of them. The EM limit is very sensitive to temperature so location specific EM violation due to temperature distribution is different compared to a uniform 110 degree EM violation.”

It appears as if the ability to perform signoff without full thermal analysis is becoming problematic. Overdesign is no longer an option because it creates its own string of problems. “Because you have to guard band by 3X you have to increase your buffer sizes and many other aspects of the chip, such as metal width that adds to capacitance, and this will increase dynamic current,” explains Wang. “By guard banding you are increasing cost and [dynamic] power.”

But the problems don’t end there. Thermal fluctuations induce stress. The thermal expansion coefficient of silicon, compared to PCBs or even to silicon interposers used for 2.5D assemblies, are quite different. The impact of expansion between the chip, package and other dies can create stress on the TSVs or the solder bumps. So stress analysis has to be considered. This could become even worse when dies of different materials are stacked. EM and stress both contribute to reliability.

3D to the Rescue Surprisingly, the migration to 2.5D and even 3D-ICs is now being seen as a possible savior instead of making the problem even worse. “You have a very good thermal conductor in the silicon interposer, which enables heat to be defused through the bottom of the package,” Wang says.

The advantages continue. Adds Chang: “2.5D or 3D is used primarily to shorten the communications channel. With an interposer, the communications is much shorter compared to chip-to-chip designs through a PCB. Cload is much smaller. With 3D dies, the parasitics associated with the channel are even less.”

In a typical chip there is a 10mm x 10mm geometry in the horizontal plane and many signals have to cross a significant portion of this space. When dies are stacked, the stacked wafers are thinned and so in the vertical plane we have dimensions of about 50µm.

An almost unlimited number of TSVs can be added, although they are quite hungry in terms of the silicon area they consume. This enables much wider parallel busses to be used, but with this width come additional problems such as induced jitter noise due to large amount of simultaneous switching noise. With decreased operating voltages, care needs to be paid to the eye diagrams to ensure signal integrity. But the TSVs also serve another purpose—they conduct heat out of the central chips.

Creating the Complete Thermal Model
One of the great benefits of 3D-ICs is that it will enable a kind of physical IP to exist where the creator of the 3D stack will not have to develop all of the dies themselves. For example, a processor or SoC vendor may utilize memory dies from another company. While this reduces the amount that has to be designed, thermal analysis still has to be performed on the whole stack and each die may have very different operating temperatures. A memory will often run at 85˚ to 90˚ whereas an SoC may reach 115˚. That peak temperature is unlikely to be seen across the entire chip because there will be areas of very low activity. For example, a USB controller could stay at around about 90˚ due to its low activity.

To perform chip level analysis, “the third party company will probably not want to send you the circuit design for the memory” says Chang, “so when you come to perform power and thermal simulation, you have to ask them for a power/thermal model of their device.” This model contains two things. First the external ports that connect to other chips and the parasitics associated with those ports and secondly the current demand of those ports. While this enables you to perform complete chip power integrity analysis it will not enable the complete thermal map to be created.

thermal map

Image courtesy of Ansys/Apache

Finding the thermal map for each die is only part of the problem. “In a classic thermal analysis we only do gradient analysis,” says Wang, “so in the scenario when everything is running what is the temperature profile? In a low-power design and particularly for 3D design, what matters more is the transient temperature profile.”

This problem comes about because of the duty cycles and the thermal time constants in various parts of the device. For example, the peak junction temperature of active transistors will vary in µS, whereas for the die it will be in the order of mS and for the package and substrate in the order of seconds. So, even though the entire chip may be below a critical temperature threshold, the instantaneous peak junction temperatures may be much higher than steady state values causing problems with circuit performance or reliability.

Related Stories: