Thermal mismatch in heterogeneous designs, different use cases, can impact everything from accelerated aging to warpage and system failures.
Thermal-induced stress is now one of the leading causes of transistor failures, and it is becoming a top focus for chipmakers as more and different kinds of chips and materials are packaged together for safety- and mission-critical applications.
The causes of stress are numerous. In heterogeneous packages, it can stem from multiple components composed of different materials. “These materials have different properties such as thermal expansion and conductivity,” said Melika Roshandell, product management director for Multiphysics System Analysis at Cadence. “When we have power in these devices that causes heat, various components in the design can behave differently. This can cause problems such as cracking of the BGA balls and buckling of the device, which can cause breakage.”
Norman Chang, fellow and CTO of the electronics, semiconductor and optics business unit at Ansys explained that on a 3D-IC, stress is mainly caused by the incompatible coefficient of thermal expansion (CTE) mismatch between two materials which cause warpage and displacement.
“For example, silicon material has CTE of 2.6, while package material has CTE of 6, and FR4 of PCB has CTE of 17 PPM/°C,” Chang said. “Thermal-induced stress can be caused by the different expansion and contraction rates of different material in 3D-ICs, with temperature gradients from different workload or thermal cycling during testing. There are package-related mechanical stresses, including dielectric cracking, interfacial cracks, solder joint fatigue, copper trace cracking, and package delamination due to thermal-mechanical stress, moisture swelling failure and vapor pressure induced ‘popcorn’ cracking due to hygro-mechanical stress, and electromigration electro-thermal-mechanical trace failure due to electro-thermal-mechanical stress. And for large 3D-IC designs, stress can cause warpage and stress/strain in areas such as extreme low-k dielectrics.”
All of this affects the reliability of devices. “There is intentionally-designed stress in certain silicon crystal orientation for the speed-up of mobility in the device channel,” Chang said. “However, the warpage and stress can impact device and interconnect performance, as well. It is not easy to calculate the electrical performance impact, given stress distribution in 3D-ICs. Foundries may give a guideline on the maximum warpage/stress allowed in 3D-ICs, and designers should avoid exceeding the constraints.”
Fig. 1: Multi-physics stress in an electronic system. Source: Ansys
A number of mathematical equations are used to model thermal-induced stress, most of which contain an Arrhenius equation component. “This says that there is a particular activation energy for some particular effect, and once you create that amount of energy, then the effect will happen,” said Rob Aitken, a Synopsys fellow. “As you increase the temperature in a system, you increase the amount of energy, you increase the likelihood of these events happening, and so you see an exponential increase in whatever it is depending on the temperature.”
There’s also Black’s equation for electromigration, which has a similar temperature dependence on mean-time-to-failure (MTTF), as well as a current density dependence, where higher is worse.
As the temperature rises, a number of effects — hot carriers, electromigration, voltage stress, and temperature stress — worsen. “That is coupled with statistical variation in temperature across die, which means in planar devices, there were effects like, ‘Oh, this thing gets worse.’ Bias temperature instability (BTI) is the classic example,” Aitken explained. “It gets worse. You shut the thing off, and it gets better, but then gets worse again, in a sawtooth-like behavior. If it never gets a chance to relax, then the sawtooth doesn’t come down quite as far the next time, and it just gets worse over time.”
The relevant equations here are statistical models, which assume that if you have enough instances of something, they will follow this behavior. Put simply, you expect they will fail over time at a predictable rate. That is the foundation of almost all reliability analysis, which essentially is a failure-in-time mechanism.
“While there are many equations, there’s also the challenge that you can model these things at a macro level, even though at the micro level you probably can’t,” Aitken said. “This means you have to assume that people who design the process are looking at the device level, figuring out what the reliability profile is at the device level, and creating some kind of aging model based on that. Again, you can have the fanciest solver in the world, but it’s not going to be able to solve all the instances that are out there. You have to pick and choose.”
When things got worse
Stress isn’t a new challenge, but prior to 90nm it was largely ignored.
“That’s when we started having to take it more seriously,” said John Ferguson, director of product management at Siemens Digital Industries Software. In SoC design, there are some strange issues in the way that we typically do CMOS design layout such as shared source drains, and maybe multiple actives within a well. These features create localized stresses which impact the individual transistors. That begs the question of, ‘If I laid all of these out the same, but some work differently than others, what’s going on? How can I prevent that?’”
While there is no single solution, there are ways to lessen the impact of various types of stress. “DRC rules are always put in to say, ‘If you put these things too close or too far, we know they’re going to be horrific,'” said Ferguson. “The 90nm node is when we started introducing the concept of adding the stress impacts to post-layout simulation with advanced properties. This means each individual device in the netlist gets a property that indicates something about its stress.”
That data is then fed into the process models, which at the time of 90nm was a big headache. “How to get those defined, including what happens when there are things in series, in parallel, as well as how those change, eventually got worked out. We were good for a while.”
Now, as the industry prepares for chiplets in various packaging formats, it becomes even more problematic.
“From a purely mechanical stress perspective, you’ve got to think about not just designing the chiplet and another one right below it,” Ferguson said. “There are many different considerations. You also have to understand that is putting stress on those devices. Thermal adds another impact. It’s basically thinking that if you put something on top of another thing, it makes it warmer. The more coats you put on, the heavier it is, and the warmer you get. This gets into the aging question of how long have you got? That’s the million-dollar question. On the verification side, I can tell you relatively, ‘These are nominal,’ or, ‘These are really risky and you should take a closer look, and possibly do some experiments with moving things around a little bit.’ But I don’t know how bad it is, so I can’t say, ‘You can use this 50 times before it’s done,’ or, “You can go a good 10 year, and you don’t have to worry about it.’”
AI’s impact
Another area where thermal-induced stress may wreak havoc is in designs containing AI/ML engines, which may be running at maximum speed for the majority of their lifetimes.
“There are some systems that people want to run as blazing fast as possible, and others they want to last forever, so they dial it way back,” said Steve Roddy, chief marketing officer at Quadric.io. “The same is true for the other effects of having these big machine learning chips — electrical integrity, voltage switching, and voltage droop. If I want to make an inexpensive piece of silicon, I am going to deliberately slow down my NPU. I’m going to stagger my clocks. I don’t want to switch everything on the edge. I want a deliberately bad clock tree (artfully designed) so I can not have big plates of metal stacked up on layers 8, 9 and 10 to prevent voltage drooping, and that kind of thing. There’s so much that’s idiosyncratic to how each chip designer is thinking about performance points, and longevity, and peak versus average compute capability, etc. It’s very situation dependent.”
Roddy puts thermal stresses into two categories. “There are transient issues, such as in cellphones when it’s doing one thing, and then you switch to your AI-enabled camera. Suddenly, now you’ve got the NPU cranked up and it’s doing bokeh effect, beautifying your face, making you look 20 years younger, and giving you a full head of hair, or whatever so you can take that perfect selfie. Phones and laptops are probably the only two categories where you’ve got really big beefy machine learning cores and general-purpose neural processor IP (GPNPUs) where they’re not always running. In everything else, you typically have designed the system for that NPU to be constantly on so you don’t get the transient thermal changes.”
Engineering teams building cell phones are accustomed to trying to keep it cold and low power until the GPU is cranked up to play a game, for instance, which causes thermal spikes from the activity. The design team would have done thermal envelope management for these temporary peaks of power consumption.
“But if it’s your smart security camera on your front porch, looking for porch pirates who are stealing stuff, it’s running all the time,” Roddy said. “And if you’re running things really hot all the time, using the big Nvidia GPU that’s blazing hot, you’ve got longevity problems from thermal degradation, so the lifespan is shorter.”
On the other hand, if it is an application like an Nvidia GPU card in a data center, that may be replaced every two years, anyway. “Some next chip is going to come along, and on a useful work-per-megawatt basis, you’re going to want to get rid of the three-year-old GPU chip because it’s doing a fifth of the work of the new one for the same power budget. Whereas if it’s in your car, you want your car to last for more three years, so people will run the junction temperatures of long-life products at a much less critical temperature. If it’s a GPU that’s doing data Bitcoin mining, you don’t care if it’s on the edge of dying for six months, because you’re going to throw it away anyway.”
Designing for thermal
Another big question is how to design in such a way that takes all of this into account.
Ferguson said this aligns to the general philosophy about shifting the work left. “Do it early, do it as soon as you can,” he said. “See what’s there. Adjust. Add more to it. Do another round, and then another round, and another. You have to keep going with each subsequent step. Once you take care of some of the problems, see what’s still remaining as you go. I don’t know a better way to do this. It’s an iterative process. You could make it an automated iterative process, but it’s still iteration in the end.”
A key aspect to visibility is using models early enough in the design phase to account for thermal-induced stress. Synopsys’ Aitken said the easiest way to model thermal-induced stress is at a macro level, where you can just say, ‘Given the effects that we know about – BTI, HCI (hot carriers injection), time-dependent dielectric breakdown (TDDB), and the like, we can model how those are affected by voltage and temperature. Then we can generate a modified library that accounts for all of that, and commercial tools can do this now.”
Another key capability of today’s tools is adding the workload component into the planning. “You really have to do that today because you get this weird behavior,” Aitken said. “A lot of these effects have this so-called healing property. If you run it for a while, and then you stop running it, you won’t be able to account for the aging. You don’t want to over-margin it because the device would never ship.”
Thermal models are essential to early analysis
Thermal models are complicated, but they are essential for chip and system architects to perform early thermal analysis.
Siemens’ Ferguson said thermal models have been around for packaging processes for quite some time. “Thermal models in the die? Not really so much. We are now putting multiple die into these packages, and to figure out the thermal issues you need to really understand the die itself. You can’t treat it like it’s a single, uniform object. It’s not. There are dense areas of metallization. There are not-so-dense areas. Glass and metal have very different properties. Stress properties and thermal properties, and the silicon itself, are all going to have different behaviors when they’re put into that system. This means it is a more system-level approach, and to get it accurate, you’ve got to have a certain level of detail.”
Joseph Davis, senior director for Calibre interfaces and mPower product management at Siemens Digital Industries Software explained, noted that understanding these effects is essential for high-end designs, particularly in the mobile space. “Mobile is all about form factor and battery life. Battery life is about reducing total power. Then you end up with 3D, but a very small package. It has to be thinned down as much as possible. Back in the day when you just put it on a board, you didn’t care how big that silicon was. Now, form factor makes a difference. There are things like die without packages, and die on die, and they’ve got to thin them down so they can stack three of them on top of each other. And with that thinning, you can’t relax the stress through all that mass.”
Thinning the metals affects how heat acts on the die. The heat is more intense, and there is less mass to dissipate that heat. That leads to a variety of difficult-to-find problems, such as the cause of silent data errors or why a device shut down
“Is it a thermal thing? Is it a defect thing? Is it an all-of-the-above-thing? It’s hard to know because a side effect of having to print things with EUV and ultraviolet light is that you cannot see them with microscopes,” said Arm’s Aitken. “You have to blast it with ultraviolet or with electrons or something to see it. There’s a whole realm of issues around inspection tools. They’re very expensive. They’re complicated to use. And they may or may not find what you’re looking for. You wind up with a streetlight effect, where if I can predict exactly what the failure mechanism is, I know exactly where it is, then I can go look for it and say, ‘Yes, I found this.’ But if you’re not quite sure what the failure mechanism is, there’s a chance that if you go look for it, you won’t find it. There’s also a chance that you actually may destroy the part of the chip that had the failure mechanism on the way to finding what you were trying to find. The challenge is that it’s really hard to find them. You wind up having to get a behavior signature out of the chip to say, ‘It exhibits this behavior signature. The theory of this particular failure mechanism matches the observed behavior. Therefore, it’s probably that.’”
There’s also an art of repeating the tests at high voltages and different temperatures. “The hardest problems to track down are ones where you can see it failing in some kind of long sequence of operation, but you can’t make it fail a classic scan test for one reason or another,” Aitken said. “That makes it a whole lot harder to diagnose. Simulation tools can help you do that, but they’re not magic.”
Ferguson agreed. “One of the biggest challenges with thermal-related aging issues is that it is difficult to trace them if you haven’t done any level of analysis up front. All of a sudden you may get some problem, and you know it was working on the test bench. Or you send it out to customers, and three months later you’re getting a whole bunch of returns because they all failed. That’s a big issue, and they spent a lot of time figuring out, ‘I found what isn’t working. Why isn’t it working?’ You don’t want to be in that spot. We have seen customers in that position, so the first thing we do is help them figure out where these failures are likely to be occurring so that you can make a change before you build it. Further, what is not done so much today, but is starting, is very early on to look at the manufacturing phase at both the die-level manufacturing as well as the assembly-level manufacturing. When you have multiple die, a lot of times with C4 bumps and the like, they go through different levels of heating as you’re putting the compounds together. They grow and shrink at different rates. So all of a sudden in your design, all the balls may be lined up to the pins, but the balls can grow faster than the die did, and now they’re shifted out and don’t line up, so you don’t have a connection. You have to catch this stuff. You have to know about that.”
Conclusion
Stress is one of the biggest challenges at advanced nodes and in advanced packaging, and it needs to be dealt with early in the design cycle.
“Thermal stresses can have a significant effect on a structure’s strength and stability, potentially causing cracks or breaks within certain components,” said Cadence’s Roshandell. “Such failures risk the overall reliability of the electronics, which can lead to possible weakening and deformation, and ultimately breakage. Design teams can help avoid stress problems by doing early design analysis to mitigate any risk.”
That stress can manifest itself in multiple ways, from aging to warpage. “For example, in the latest AMD 3D-IC design with SOIC technology, which was mentioned in Hot Chips 2021, the structural silicon is used to balance the structural integrity of the design,” said Ansys’ Chang. “Stress/warpage simulation should be performed before tape-out, as well, to make sure the maximum warpage/stress do not exceed the foundries’ guidelines.”
Related Reading
Chiplets: Deep Dive Into Designing, Manufacturing, And Testing
EBook: Chiplets may be the semiconductor industry’s hardest challenge yet, but they are the best path forward.
Incredibly well written and highly informative article.