3D-IC Reliability Degrades With Increasing Temperature

Electromigration and other aging factors become more complicated along the z axis.


The reliability of 3D-IC designs is dependent upon the ability of engineering teams to control heat, which can significantly degrade performance and accelerate circuit aging.

While heat has been problematic in semiconductor design since at least 28nm, it is much more challenging to deal with inside a 3D package, where electromigration can spread to multiple chips on multiple levels.

“Because the power distribution network carries current to all the transistors, over time electromigration occurs, which is a drift of electrons that can cause structural changes in the wires,” said Rajat Chaudhry, product management director at Cadence. “The more unidirectional current in the power network, the bigger the electromigration issue. Analysis must be done to make sure there aren’t electromagnetic defects related to electromigration such as measuring how much current each wire in the power network will carry, and based on the current density in that wire, how long it is expected to function without having some structural damage to the wire.”

EDA tools can account for electromigration in planar chips, but it’s not so straightforward in 3D chips and packages. “Straight at the boundary, where these two things contact each other, there have been various studies to try to understand this,” said Rob Aitken, a Synopsys fellow. “For example, if you’re using a solder ball or microball to make the connection, electromigration in those solder balls can be significant. They can physically deform, and they can break, especially if they get a lot of current. That’s an effect you care about. If you’ve got things that are just touching each other in their copper, for instance, the effect is different.”

How to account for these effects isn’t entirely clear. “Where this gets really interesting is in the sign-off and margining part,” Aitken said. “This is different than a UCIe-type situation, where there is a PHY on one side and a PHY on the other side, and it’s inherently able to deal with a lot of weirdness and signals and do error correction on the fly. If you just have a bunch of logic connections, and say, ‘I’m going to stack this on top of this other one, and I’m going to have 10 million logic connections running up and down between them,’ electromigration is something you’re going to have to care about, because it’s going to undermine some of the assumptions that would exist in a 2D design.”

In a planar design, for example, it’s assumed that if there is a signal path between metal layers, those metal layers will behave the same way. But in a 3D design, Metal2 may be in a different chip that looks nothing like Metal2 in a planar chip. “If it goes up and down a couple of times, now you’ve got a path that contains multiple sources of correlated variation, and multiple sources of independent variation that you have to account for,” he said. “You’re going to have to do that either by signing something off and declaring, ‘There’s this much margin, so I’m going to ignore it,’ or I’m going to try to match these things up somehow such that when I stack two die together I’m going to make sure they come from similar process corners. That can be tricky if you’ve got one at 3nm and one at 7nm. What does it mean to be similar?”

In 3D, temperature variation needs to be considered architecturally. Because a processor generates heat, it’s logical to assume that if another circuit is added on top of that, it will heat up proportionately to the transistor density and usage. And likewise, stacking memory on memory might be expected to have minimal thermal impact, but the reality is much more complicated due to proximity effects and how heat is dissipated between various components and layers.

“One part of your memory may be at a different temperature than the other part of your memory, and that doesn’t happen normally in a memory,” Aitken said. “In a 2D memory, the sense amps or the address circuitry or word lines might heat up, but the bit cells are mostly just cool. In a 3D version, if it’s stacked on some heat source, then you may actually have a chunk of the bit cells that are hot and a different chunk of the bit cells that are cool, which potentially will produce unexpected behavior — unless the person architecting the chip thought about them. There’s more work for architects thinking about what the thermal implications of heat coming from elsewhere might be.”

Put simply, 3D-ICs add a whole new level of complexity that can affect everything from how well a chip performs to how quickly different parts age.

“The power used to come from the board, then through the package and onto the chip,” Chaudhry. “Now what’s happening is there are multiple chips or chiplets packaged together, and sometimes they are stacked on top of each other, so the power distribution in many cases doesn’t just come through the package to the chip. It actually comes through the package, through one chip to the other chip. That adds another level of complexity for modeling the electrical characteristics of the power network, and it adds to the size of the power network because now you have multiple chips. That’s an area where more innovation will be required. We have to model all of the power, and include thermal at the same time. Thermal is a big issue in terms of reliability, and electromigration becomes exponentially worse with increasing temperature. That’s another reason you need to make sure that the chips operate within a certain temperature, because reliability degrades a lot as you increase your temperature.”

For 3D-ICs, the impact of heat is especially challenging. “The timing tool doesn’t expect the voltage to vary a lot,” he said. “Similarly, when reliability and timing are modeled, we don’t expect a lot of thermal variation on the chip. Those models are built on a certain range of temperature on the chip, like maybe 115°C, but it should not be 20°C on one side of the chip and 115°C on the other side. If you have a lot of variation, that can cause a failure. In 3D-ICs, as the whole system becomes bigger and you have power going from one chip to the other, you can see a lot of thermal variation across the chips, and that can cause failure in timing.

The chip or system needs to remain below a certain temperature, which is a top concern because very high temperatures can destroy both. “Temperature variation needs to stay within range,” Chaudhry said. “However, as these chips are stacked on each other, just controlling the total temperature itself is becoming a challenge. It requires a multi-physics solution, because thermal is a different kind of physics problem. Electrical analysis is a different kind of physics problem. There are multiple kinds of solvers interacting with each other. High thermal increases, like the resistance on the chip, change the electrical properties of the chip.”

Additionally, electromigration affects a 3D design differently than a 2D design in that the interconnects specific to 3D-ICs are more sensitive to multiple stresses within the systems.

“The interconnects specific to 3D-IC designs are very important for the reliability of the 3D-IC designs, and therefore designers need to check the electromigration,” said Norman Chang, fellow and CTO of the electronics, semiconductor and optics business unit at Ansys. “For 3D-IC design, different interconnects are used, including TSVs (through-silicon vias), interconnects in interposers, direct copper-to-copper bonding interconnect (on vertically stacked dies such as TSMC’s SoIC and Intel’s Feveros Direct), RDL in FOWLP on CoWoS, etc.”

Electromigration affects the useful lifetime of a device, as well. “Electromigration may cause the formation of voids or hillocks on interconnects, which consequently change the current density of the interconnects connected to the device and cause the aging of the interconnects and device,” Chang said.

Other factors that impact aging include negative bias temperature instability (NBTI), which causes Vth shift and loss of drive current (ions) with more impact on PMOS; hot carrier injection (HCI), with degradation of transistor switching performance caused by intense current; and time-dependent dielectric breakdown (TDDB), with transistor failure caused by persistent electric fields across dielectrics. “All these effects are very sensitive to temperature in 3D-ICs, and higher temperatures exacerbate the electromigration and aging of interconnects and devices,” Chang said.

While accounting for these effects is challenging, the impact is well understood. IR drop will increase systematically when voids start to occur, said Kristof Croes, scientific director for reliability, and group leader for imec’s Reliability Group. “It can either cause a slow degradation of the performance of the chip, or a catastrophic failure where a working chip ‘all of a sudden’ does not work anymore.”

This helps explain why disaggregating a chip into chiplets is so complicated. “There’s a lot of heating and cooling that goes on that has an impact,” said John Ferguson, director of product management at Siemens Digital Industries Software. “It introduces interstitial little cracks in your polysilicon or in your silicon itself, or in the metalizations, so even if you can DRC/guard-band and say, ‘I can get rid of the cases that I know are failures,’ when you have these little cracks, they don’t stay little forever. You start using the devices and they get hotter and colder, they’re shrinking and contracting, there is electromigration going on, and all kinds of weird stuff happens. Those problems get bigger until they blow up.”

To make matters worse, when all of these chiplets and interconnects are assembled into a package, it’s not possible to know the limits until the assembly system is defined. “That is very strongly dependent on all the pieces,” said Joseph Davis, senior director for Calibre interfaces and mPower product management at Siemens Digital Industries Software. “One of the great things people want to do now is put together chiplets from different manufacturers, which creates a system problem, where all those models and limits are now crossing from different foundries and going to a third party. This is incredibly challenging from an IP standpoint. If you really want to push the boundaries of what you can do from an integration standpoint, you’ve also drawn a box around what you can do because you have to do everything from a single manufacturer. The biggest reliability problem with electromigration between 3D versus a typical planar die is cracking. TSVs are huge compared to routing, but they don’t really get electromigration. There’s electromigration at the boundaries because of void accumulation, but the big thing is you’ve got this huge piece of metal, and it’s passivation and so forth in between these die. Now you take all the stresses and it cracks, but cracks are the way crystals relieve stress. You stress a piece of glass and eventually it fractures.”

The other main difference between electromigration in planar and 3D devices is that the power density has increased, and therefore the thermal density has increased in the package. “Electromigration is an exponential function of temperature,” Davis said. “The hotter it gets, the more likely it is to migrate, and therefore the lower current that you can handle. But with 3D, the research shows that reliability and electromigration with TSVs and the connections is more related to the stress induced by the thermal when you’ve got those dies together. So it’s not just current density. You’ve got current density in the individual chips that are exacerbated by the thermal. I’ve got a hot chip that is now impinging on other circuitry. I may not even have devices there, but I’m passing routing signals under that silicon, and that’s going to be affected. The three things come together are that you’ve got stress, thermal, and electromigration, which ultimately affect the current density that you can drive or your technology. It’s a circular problem. I’m going to choose this technology, which has these design rules in this area, and now I’ve got a thermal problem.”

With the complexity seen today, every stack is unique, which only complicates matters, he said. “Every time a customer says, ‘Hey, Mr. Foundry, I want to do this stack,’ they have to specify, ‘I want to put this chip with this other chip, and then this chip over here with this interposer.’ Then the foundry has to work with the EDA vendors to provide all the collateral. They can’t just take those standard PDKs and put some baling wire around them.”

The path forward
The only way to achieve design success is to start addressing the problems earlier in the design phase.. “Previously, you would design the chip, and the power distribution check was typically what we’d call a sign-off before you tape out the chip,” Cadence’s Chaudhry said. “You make sure everything’s fine. You get a few errors, and you fix them. But now, because stress issues are becoming so localized, if you don’t solve these issues really early on as part of the design, you could end up with tons of errors near the tape out date and you won’t have time to fix them. With 3D-ICs, you need to do early analysis because once you set up that stack of different die, then later on find out that you can’t supply power properly, you can’t change your die stack up easily. Early power and thermal analysis becomes very critical for 3D-ICs.”

To avoid some of the issues in 3D-IC designs, design teams should work closely with foundries, performing thorough simulation/analysis on these electromigration effects on interconnects specific to 3D-ICs, in addition to the regular metal layers, said Ansys’ Chang.

But a lot of this also comes down to the same rules that have been around for a while. “Try to avoid putting too much current through wires that are too small,” said Synopsys’ Aitken. “This is something designers have been aware of for a long time. We all have this model in our heads of how electricity works that just says there are these little electric balls and they roll along the wires, but it doesn’t really work that way. Current comes from electron movement, but electrons don’t move at the speed of light, and electric current does. So the distinctions of what actually happens and how charge gets from here to there, that part is what matters for electromigration. You can have a wire that on average isn’t carrying very much current, but which locally in very brief periods carries a lot. That’s what you have to architecturally think about when you’re designing a power network.”

Leave a Reply

(Note: This name will be displayed publicly)