Design Flow Challenged By 3D-IC Process, Thermal Variation

Rethinking traditional workflows by shifting left can help solve persistent problems caused by process and thermal variations.


3D-ICs are proving a challenge even for designers accustomed to dealing with power and performance tradeoffs, but they are considered an inevitable migration path for leading-edge designs due to the compute demands of AI and the continual shrinking of digital logic.

3D-ICs are widely viewed as the way to continue scaling beyond the limits of planar SoCs, and a way to add more heterogeneous devices developed at different process nodes into the same package. But regardless of whether it’s a planar SoC, or an assembly of chiplets, the laws of physics are insurmountable, and there are only so many tricks engineers can employ before they hit a wall.

The smaller wires in advanced nodes increase resistance, which increases heat. And larger structures, like 3D-ICs, can increase the range of thermal gradients. This is made worse by the fact that there are limited ways to dissipate heat in such structures. The negative results can range from subtle effects, like electromigration, to dramatic situations like chips catching fire. In addition, as manufacturing process nodes drop well into the single-digit nanometer range, and then the angstrom range, it becomes harder to control or account for variation, which can cause critical issues like increased noise and decreased reliability. All of this requires designers to navigate an ever-more fragile balance of optimal performance specs versus uncooperative physical realities.

The complexities of 3D-ICs increase the risk of what were once theoretical thermal issues, such as spontaneous DRAM refreshes, and thermal runaway that may force a device to shut down. In photonics applications, heat can interfere with communications by changing the wavelength of light.

Fig. 1: Temperature distribution of a chip and package assembly. Source: Ansys.

“Thermal can cause timing problems, too,” said Lang Lin, principal product manager at Ansys. “High temperature can cause high delay of the wire, which then reduces the speed of the circuit. We’ve heard from the foundries that thermal is the center of the world.”

Thermal and process variation can be standalone issues, or multipliers of each other’s problems. Either way, they can cascade and require foresight to solve. “These issues can be somewhat orthogonal,” said Michael White, senior director of physical verification product management for Calibre Design Solutions at Siemens EDA. “The thermal issues are going to have to be addressed at a more macro level. If I have a thermal issue, the easiest first step is changing the floor plan. If that isn’t enough, I can start contemplating things like putting thermal pillars through my active design to draw heat away from the hot area. After that, I need to start contemplating how the dies are sitting in the package. In the worst case, if I’m starting to run out of options, I have to change my package. I have to put heat sinks on it, and so on. Those are the options from simplest and lowest cost throughout the lifecycle of creating the overall assembly. If I can shift left, I can figure out those things early.”

Warpage a top issue
Currently, the worst challenge to 3D-ICs is thermal-induced warpage. Warpage has gone from an occasional problem to a persistent one, as highly packed configurations of heterogeneous materials increase temperatures and require sophisticated modeling of thermal coefficients to avoid yield loss. On top of that, substrates are thinner, which reduces their ability to channel heat out of a device.

“Nobody talked about thermal-induced warpage and stress analysis about a year ago,” said Melika Roshandell, product management director at Cadence. “It’s picking up because of 3D-IC. Designs are getting a lot more aggressive with thermal, as well as getting smaller and thinner, and that has a big impact on warpage. The big topic [at conferences] was thermal, but right now it’s thermal-induced warpage.”

The increasing size of interposers is contributing to the problem as well, said Lin. “Today’s 3D IC interposer has become larger, probably anywhere from 2,000 to 5,000mm². As a result, the warpage is becoming so large that it can’t be ignored anymore. Before it was approximately 5nm or so, but for a large interposer it could be even higher. That shift of distance between two materials can cause mechanical failures, a crack of the connections between the dies.”

Mechanical stress, another inevitable result of advanced designs, adds to the 3D-IC multi-physics concerns. “There are issues around mechanical stresses that go hand-in-hand with thermal,” said John Ferguson, product management director for Calibre nmDRC applications at Siemens EDA. “As you make things hotter, the wires tend to morph, so that changes the mechanical stresses. In particular, the worry seems primarily to be around the bumps. Are we getting good adhesion and good definition of the bumps, such that they’re making the ohmic contacts properly and adequately? Wafer cracking is another level of concern, and so is warpage in general. If you’re stacking things together, then you have to expect that both sides are planar or you have a risk of air gaps or some other form of gapping. You can’t ignore any of them.”

Adding in TSVs becomes an issue, too. “You’re making big holes and filling them up with other materials,” Ferguson explained. “How to do that without inducing some forms of warpage along the way is not so easy. It all comes down to how well you can control the chiplet process in the beginning to make sure they’re as flat as possible. The next part is being careful with how you are stacking things. For example, if we’re talking about stacking a die on a die, or a second die on a wafer, the first step is to make sure you have good planarity approaches. This is dependent on how you do your fill, as well as the manufacturing, chemical-mechanical-polish processes, and how tightly they can be tuned. That can get a little bit more challenging when we’re talking about very thin dies. If you’re talking about placing individual chiplets in context with other chiplets, there are situations where a chiplet intentionally overhangs another chiplet. It’s like having a diving board, where you’re partly on land on one side, and then you’re overhanging into the pool on the other side. That’s definitely a place where you can get warpage.”

The situation has become so bad that it’s affecting basic priorities. “An SoC designer only cares about three things — power, performance, and area — but thermal is becoming the fourth one. Everything that used to be PPA is changing to PPA-T,” Cadence’s Roshandell said. “Your performance is going higher. That means the power is going higher, and you want to reduce the area, so your thermal is going to be worse. In all of these things, your thermal is always against you. For package and PCB, you also have to care about signal integrity, power integrity, warpage and other things for which thermal is a global problem. You cannot look at your die in a little corner, and your package and PCB in another little corner, and say, “My package is fine, so I don’t have to worry about anything else.’ You have to consider your whole design when you do your thermal analysis. That’s why it’s so important for any engineer to look at the whole problem and then do their thermal analysis. For that, especially in 3D-IC, you need a lot of capacity in your tool to be able to read the whole design without simplification. If you simplify it, you will lose access, and a lot of engineers don’t know where to simplify. As an engineer, you need to keep in mind that it’s a global problem, and you need a tool that has the capacity to do the analysis in the earliest stages of design.”

If thermal analysis is not done early enough, problems are likely to occur. Many devices use PVT sensors to detect excessive heat, then employ thermal throttling as necessary. “Scattered thermal sensors on your chip can monitor the local temperature,” said Siemens’ White. “If the top local temperature gets too high, they then have the ability to slow down the local clock.”

However, that solution comes at a performance cost, which makes a device uncompetitive. “[Thermal throttling] doesn’t solve the problem,” said Marc Swinnen, director of product marketing for the Semiconductor Division of Ansys. “It detects it, and then pays the price to fix it. The nominally rated performance cannot be achieved because the chips will constantly be heating up and throttling themselves, so it’s very expensive, and it says we failed to predict this problem. It shows up and now we pay the price to reduce the power, but that’s not what you want. What you want is to be able to predict it.”

Shift Left
EDA experts continue to emphasize that the answer to predicting, and thus reducing, problems is shifting left. It’s far better to understand and fix potential problems early on.

“It used to happen that the designer would design the IC, and after everything was done, then they would send it to the analysis engineer. But that doesn’t work anymore, especially in 3D-IC,” said Roshandell. “You have to start doing design and analysis on day one, so that if you need to change anything in your design you can address it right then. Some of the Band-Aid solutions that we had before, such as adding fans and heatsinks, don’t work anymore because the thermal is happening so fast, that by the time it gets to heatsinks or liquid cooling, you’re already in a thermal runaway. You have to have a risk mitigation plan. The best-case scenario is to know what are the risks from day one, so you can have a solution for it.”

The challenges of shaking up traditional workflows should not be underestimated. Given human nature, change is often a hard sell. Designers and companies may not recognize the cost and time advantages of changing long-established workflows. It may take some hard economic analysis to convince skeptics that shifting left, although initially disruptive, will be more cost-effective in the long run.

“Shifting left is all about making the designer more productive,” said White. “The way we convince folks to adopt this practice is show them that while it takes hours to do something today, it can be done in only minutes. And after you’ve done it in only minutes, it’s a lot cleaner than it was using the classic approach. ‘See how much speed you can gain and how many fewer errors you have to debug? Doesn’t that sound better?’”

Shifting left can help create more robust and reliable designs, but with the right tools it also can help the entire design process move faster by cutting down on the number of iterations. “If you start doing thermal feasibility analysis early on, it can give you an idea of where your floorplan issues are going to be,” said White. “So you can change your floorplan even before you’ve properly laid it out. If you don’t think about thermal until you get to final packaging, and you’ve done the physical layout of each IC, it’s too late.”

The problems caused by heat and variation are not just short-term problems anymore. Reliability is now a major issue in multiple markets, where chips are used for critical applications and are expected to last longer. The best option for improving reliability is to plan ahead, and build in redundancy and resiliency wherever possible.

“A lot of reliability is just basic statistics,” said Rob Aitken, distinguished architect in Synopsys’ EDA Group at the time of this interview. “Say a particular event has some probability of happening. As we go to lower nodes, there are more devices, which means there’s a higher probability that something can happen. If you have a chip with 50 billion transistors, then there are 50 places where a one-in-a-billion event can happen.”

The reliance on TSVs to help with thermal management also may increase reliability issues. “One of the big reliability failure points within silicon are the vias,” said Swinnen. “Those are traditionally failure points.”

Put simply, what once were problems that could be disregarded have now become prominent issues, with heightened awareness one of the best means of prevention.

Trying to solve thermal problems may finally push the industry to shift left and to rethink silos of activity.

“When we visit customers, we wind up talking to a lot of different teams that are working in silos,” said Suhail Saif, principal product manager at Ansys. “We advise them that to successfully sign off their chips more efficiently, much earlier, and with minimum cost, they need to work together. From their perspective, it’s hard to do, but these advanced effects are making them try much more than before. There’s a real need to have a cohesive solution that works across the domains, makes them talk to each other, and realize that what they do on one side has a ripple effect on the other, and vice versa. Thermal is not the only reason to come together and work collaboratively, but it might be the most serious one. It might be the one with the most consequences to the chip design overall in terms of costs and release delays. From that point of view, thermal is the most important and most critical issue that companies cannot afford to ignore if they want to win market share.”

In the end, the answers to these problems are most likely to come from human collaboration, in addition to materials science or physics. “From the chip designer all the way to the end product, the silos are changing significantly,” noted Cadence’s Roshandell. “People are using the same databases for their analysis, and it’s the chip designers who provide the databases. We see a lot of change in the industry toward people working together within the same database.”

Related Reading
Variation Making Trouble In Advanced Packages
Disaggregating SoCs into multiple chips introduces potential issues that may take years to show up.
The Rising Price Of Power In Chips
More data requires faster processing, which leads to a whole bunch of problems — not all of which are obvious or even solvable.

Leave a Reply

(Note: This name will be displayed publicly)