Thermal Integrity Challenges Grow In 2.5D

Work is underway to map heat flows in interposer-based designs, but there’s much more to be done.

popularity

Thermal integrity is becoming much harder to predict accurately in 2.5D and 3D-IC, creating a cascade of issues that can affect everything from how a system behaves to reliability in the field.

Over the past decade, silicon interposer technology has evolved from a simple interconnect into a critical enabler for heterogeneous integration. Interposers today may contain tens of dies or chiplets, with millions of connections, and ever-increasing performance, power, and area requirements. In fact, it is not uncommon to see a heterogeneous integrated design on interposers with area above 2,000 mm², drawing 600 watts of power for the system, and requiring very high I/O bandwidth. With that kind of power, thermal integrity is now a first-order concern, and one that makes it much more difficult to sign-off on schedule with high confidence.

A number of tools exist to understand and model thermal effects in heterogenous silicon interposer designs, but most of those are disconnected today. Work is underway here, but it’s not a simple fix. It isn’t always clear exactly what the tools are supposed to do and how all the pieces fit together.

“The general challenge that people are facing is starting from the idea that, ‘Let’s just assume these things are tiny boards. We’ll apply the same techniques that we’ve applied at board level and package level forever, and we’ll just scale down and they will work,” said Rob Aitken, distinguished architect at Synopsys. “That’s kind of true, but there are a couple of new dimensions to it. First, the people doing this now are different people than have been doing it in the past. Previously, package and board engineers did this. Now it’s chip people working on it, as well.”

There are other significant shifts. “As we go to these 3D assemblies, it’s important to keep in mind that we’re crossing the domains of what used to be,” said Joseph Davis, senior director for Calibre interfaces and EM/IR product management at Siemens Digital Industries Software. “There used to be chip people who would put the chips in a package. Then we had system-in-package and MCM options. A lot of those lines have been blurred. So who owns what? There are the packaging folks who are doing packaging and systems simulation. To them, the whole die has a temperature, so the resolution there is on the centimeter size for looking at the heat dissipation on the board or in inside an enclosure. Then there’s the IC team, which now doesn’t have just one IC. There’s an assembly of ICs stuck all together. This IC team looks at things at the resolution of microns. They need to know the distribution across the whole thing, and so forth. The resolution there is a challenge. But really, the physics and technical problems are the easiest part to solve here. The real problem is that whenever you cross organizational boundaries, you have a real problem. We are now putting multiple die together sometimes from different technologies, sometimes from different foundries. Even within a single foundry, every die stack is unique. There’s not a process for getting all of that information into the tool.”


Fig. 1: Advanced packages using interposer, bumps, micro-bumps, and through-silicon vias. Source: Siemens

To make this work there must be communication among all parties — the company designing the chip, the EDA tool providers, the foundries, and the packaging house.

“Even if it’s one foundry, we all have to figure out all the things that need to come in and get all the things set up to do the work,” Davis said. “Then there are the packaging and the systems guys who are thinking in millimeters. So there are two very different user bases, each with different resolutions they work on.”

And there are different pitches and interconnects. “Especially with silicon interposer, you’re dealing with a different material,” said Synopsys’ Aitken. “A board or an organic substrate are similar materially, so there are all the practices that people used to have in boards of keeping everything balanced and building a test vehicle to test out the limits of this system. It would be nice if those all worked, but nobody’s quite sure to what extent they do. Physics is physics, but it changes. Things that used to be second-order effects can become first-order effects if you’re not careful. Understanding how and where that works is important even when you talk about a mathematical model.”

Changing starting points
A typical heterogeneous integration system is built up step by step.

“Starting from the package substrate of a system we actually mount the interposer on top of it with hundreds of thousands of bumps to connect,” said Lang Lin, principal product manager at Ansys. “Depending on the design integration plan, the designer would add a couple of die or chiplets on top of the interposer directly. Some of the die are connected through micro-bumps or copper-to-copper connections, and some other die can be further stacked up in a 3D fashion. Because of this integration, the interposer’s role is to connect millions of such micro-bumps or copper-to-copper connections reliably so that the whole system can survive in the field.”

This is often referred to as a chip-package system. The interposer is a bridge that contains the power delivery network for all these components. “It also delivers supply power to all the die and chiplets, and it hosts the whole chiplets and die,” Lin said. “But now the chiplets mounted on top consume a large amount of power, which can cause power integrity problems. These also generate a huge amount of heat when operating in the field, so now there is a chance an IC could burn a neighboring IC due to a thermal integrity problem.”

Put simply, heterogeneous integration can result in both power and thermal integrity problems — and even more.

“If you have 3D stacked die with high bandwidth memory, the power and heat problem also could result in a significant signal integrity problem,” he said. “This means all of these problems are coming together in heterogeneous integration systems. Designers play an instrumental role in making sure the power is delivered successfully, the heat is dissipated successfully, and that signal integrity is not compromised.”

Modeling an interposer-based heterogeneous design raises questions about the integrity of the models because there are so many variables involved. “You make an assumption that under certain conditions, such as, ‘The deflection of one material against another is linear.’ And then you’d say, ‘Well, actually, under some other conditions, it’s quadratic.’ But the quadratic models are much more complicated. Which one is the right one to use? People are still trying to figure out what the answer is, and how much of everything you need to care about.”

That is largely a function of the abstraction level. Davis noted that all of these thermal aspects could be solved with very gross modeling and averages. “With the newer technologies and the mixing of these technologies, we’ve got a lot of very good insulators in the system,” he said. “As we went to finFETs, things got worse. People started to say, ‘Heating problems are much worse.’ Why is that?’ It’s because with a planar transistor, all the heat was generated in the silicon. Bulk silicon is a pretty decent thermal conductor. Its thermal conductivity is around 150. With fins, you put the transistors on top, and it’s isolated by silicon dioxide, which is a really good insulator. There, the thermal conductivity is 1.4, so 100X. But wait a minute. I just wrapped my hot transistor in a glass pillow? What am I going to do with all that heat? I’ve got to have a way to get it out. That is done with TSVs and the like. Further, we can model this stuff. We model far more complex things than just the thermal every day with simulations and EM/IR. We have the capacity, yes. But getting all the data together is a real problem. The resolution in the system and the system being not the electronics, but the industry, is the biggest problem.”

Aitken said there are two aspects to this. “There’s the aspect of, ‘I have a system, I have a bunch of equations that I’m going to use in modeling the system, and I have tools that implement those equations.’ Then I get the output. The input part to that is really important, too, because all the die are not the same. All the materials can be slightly different. In addition, the workloads are different and sometimes unknown, so again you’re dealing with thermal issues that we’ve always thought about on a package, and thermal issues we always thought about on a chip, but now they’re all merging together and can’t be viewed as independent. That leads to the need to go and do as much analysis as you can when you’re putting these things together, but also the need to monitor what’s going on when you build it to make sure that your assumptions continue to hold. So you’ve got something that says, ‘Oh, we’re heating up here. This is bad. Let’s slow down.’”

Understanding heat flow is critical. Heat moves from the hotter end of an object to the colder end, but not always consistently. “The concept of thermal conduction is fairly easy to understand,” said Ansys’ Lin. “If you know Ohm’s Law from engineering 101, you know that you can model the object with an equivalent thermal resistance for the heat conduction path of the system. And given a particular power dissipation value as the heat sink to this system, you can easily solve the temperature difference between the two ends of the conduction path. Fourier’s Law of heat conduction describes how a system dissipates heat, and how the heat sink affects the temperature difference of the whole system.”

Mapping Fourier’s Law through a silicon interposer is more complex. “Assume there are two heat sources in a chiplet,” Lin said. “The chiplet consumes power for this silicon system, and the interposer is mounted on top of a package. In total there may be four different components or objects in the system. We can model the thermal resistance of four components. Given that two sorts of chiplets are heating up a system, we have two sources of Q (heat flow) that inject heat into the thermal conduction paths.”

Solving this can help designers understand the temperature difference between each of the components of the system, and then have a better understanding of the temperature distribution. But Lin said this model is not accurate enough, because each component has only an equivalent thermal resistance. “The thermal resistance is actually highly dependent on the material property of the entire object. Finite element analysis methods can be used to mathematically represent the physical component or system with its own material properties, and also boundary conditions across all the surfaces. Meshing technology is used to convert this IC layout geometry or object into recognizable elements. Two different mesh approaches can be used to model a whole IC layout, and with all the mesh elements we can solve this heat transfer law among all the elements of the system in a 3D fashion.”

This also serves to solve the temperature distribution problem, which is a more accurate thermal model of a realistic problem.

Looking at how this meshing affects the accuracy of solving thermal conduction, Lin said meshing technology needs to keep improving, which is challenging because from the older SoC technology to the latest 2.5D or 3D-IC technology, the systems have become much larger. There are tens to thousands of heat sources included in these complicated systems. As such, the meshing resolution has to be improved from centimeters to micrometers, and possibly even nanometers. “We need a much more granular solution down to the sub-block level of the integrated chips so that we can accurately model the thermal conduction paths. That’s very challenging, but it is a must for solving the thermal throttling problem of this kind of system and to make sure there’s no reliability or thermal integrity problems.”

This makes the whole divide-and-conquer approach much more difficult. John Ferguson, director of product management at Siemens Digital Industries Software, noted that historically thermal was done with a grid approach. “You break things up into little squares, and then how little you can go is dependent on the hardware you have, how much memory you have, and how long you are willing to wait for the answer to arrive. Those are the things you get to struggle with a little bit. But there’s another challenge with that kind of grid, too, because thermal by itself is an issue. It’s important you want to check and see if you do have a thermal problem that can’t be resolved in any other way other than redesigning.”

Thermal also has an impact on mechanical stresses and electromigration/IR drop. “Now you get in this situation where you’ve got to make sure that the grids you use can be aligned in some manner across these different things,” Ferguson said. “If you’re going to try to add them up, you can have one window that is halfway overlapping another window. How do you figure all that stuff out? It’s confusing. That’s still a big challenge. The industry is working toward getting away from a gridded window approach and doing something a little more holistic, which means looking at it from more of an equation approach and thinking about things more from a true physics perspective. Where does the temperature fall off, for instance? The whole gridding situation is a challenge in the industry for exactly these reasons. How do I know the right resolution to choose? Is it going to be accurate enough? Is it going to integrate with everything else I need it to? Getting away from that approach is an important step.”

Evolutionary changes
What solutions ultimately will look like in this space remain to be seen. Aitken points to broad experimentation today, and believes at some point the industry will start to coalesce. “Even looking at package options, there used to be a fairly small number of packages, and those were characterized by the package vendors,” Ferguson said. “So you knew what was going to happen if you took your design and put it in there. But now there’s any number of different package designs, even if you’re just restricted to silicon interposers. There’s lots of things people are doing with them, and lots of ways of putting them together. And there’s not really any consensus about which one is better. That means if you’re a manufacturer, now you effectively have to support a custom thing not just for every customer, but possibly for every design. This is, again, the search for, ‘Let’s try to do over design where possible just so that we think we’re safe. But at the same time, just be careful.’”

As other materials are introduced, new issues will be added. “You can put in some new materials that the people designing these things don’t have a long history with,” he said. “And depending where you go, you can find some details in the literature of their thermal conductivity and tensile strength. But who’s actually doing the measurements to really dial these things in? They change. You get one lot of your oxides coming in and the next lot is a little bit different. At least for thermal, we’ll have gridless analysis. But we still have the issue of needing to overlay that on a grid for the other thing that I’m going to try to pass this data through, where it needs to be consumed upstream, downstream, every which way. Ultimately, with all of this, the way our industry has always worked is we guard-band it. We say, ‘They say this thing is good to 10%. Let’s give a 20% window to keep ourselves safe. But that means you’re always leaving something on the table, too.”

Additionally, there’s the challenge of making decisions early on.

“We ask architecture questions about what can be put together,” Siemens’ Davis said. “Do I have a big enough package? Do I have a big enough heat sink? The way our industry has always handled that is budgeting and approximate models, so as you go higher up or earlier in the design system, you’ve got models. Sometimes a model is just an approximation of, ‘I think this chip is going to generate this temperature, and therefore it’s going to conduct like this.’ But now we’re seeing customers who are asking, ‘I built this part. I put it in this package. The new version is going to have some extra drivers in it that I expect to generate a lot of heat. Is my package going to be enough? Is my heat dissipation enough before I start affecting reliability?’ This has to be answered before you get to other issues. Electromigration is a function of temperature, so if the temperature — which is an exponential function — is 10 degrees higher than you expected, it might bring your lifetime down by 5 years. Engineering teams want to do that earlier, but they don’t have the information about the technology they will use, other than to say, ‘We’re thinking about this.’”

Conclusion
To be sure, there are techniques designers can use now if they are aware of all of these challenges. But there are a lot of elements in a complex heterogenous design, and this is becoming a much bigger challenge for design teams.

“They can apply a couple of different solutions from the device level,” said Lin. “They can reduce the power, balance the density of the power, and probably smartly partition their chip and chiplets in the way that the thermal conduction path is well balanced without any high temperature induced. At the system level, we see a lot of cooling solutions to say, ‘If the temperature is too high, let’s just throttle the system, make it not functioning, and let it sleep.’ We can also do something like thermal management, and electrical liquid cooling and forcing convections. And all of this we have already seen in the production of systems.”

Related Reading
Chiplets: Deep Dive Into Designing, Manufacturing, And Testing
EBook: Chiplets may be the semiconductor industry’s hardest challenge yet, but they are the best path forward.



3 comments

Riko Radojcic says:

Nice. Glad to see that this area is getting the attention. Thank You, Ann

David Kneedler says:

Nice high-level summary of some of the challenges this technology is facing.

Ed Korczynski says:

How to choose the right level of granularity within modeling, when even a rigorous PDK and test-structures may not predict the block-level thermal variations in final silicon ICs? Also, note that all thermal issues with finFETs get exponentially worse for ribbonFETs (GAA = insulator all around). Thermodynamics!

Leave a Reply


(Note: This name will be displayed publicly)