How To Cool 3D-ICs

Tool chains need improvement as chipmakers begin stacking AI chips, increasing the thermal density and unpredictability over time.

popularity

Experts at the Table: Semiconductor Engineering sat down to discuss how to cool 3D-ICs and what’s missing from the tool chain today, with John Ferguson, senior director of product management at Siemens EDA; Mick Posner, senior product group director for chiplet at IP solutions in Cadence’s Compute Solutions Group; Mo Faisal of Movellus; Chris Mueth, new opportunities business manager at Keysight Technologies; and Amlendu Shekhar Choubey, senior director of product management for Synopsys’ 3D-IC compiler platform. To view part one of this discussion, click here and part three here.


L-R: Keysight’s Mueth; Synopsys’ Choubey; Siemens EDA’s Ferguson; Movellus’ Faisal; Cadence’s Posner.

SE: A lot of the 3D-IC designs are one-offs for systems vendors. In many cases they will not be sold commercially, right?

Choubey: Some of it. One thing I’m seeing is that power delivery and thermal can be custom to an application. But we think you can develop platforms where you have the capability to do incremental analysis. Let’s say you’re stacking and you don’t even have RTL. You haven’t even zeroed in on your technology. But you can still have a platform where you can understand, and analyze, and see how your power delivery is going to work, what your TSV or bump planning should be, and then do a pretty early thermal analysis to understand what thermal and mechanical stress are going to be. As they mature RTL, they keep on refining that, which is definitely possible. When I look at the use cases leading the market for electronic design, it’s AI workloads. You need to manage AI on a monolithic die, or even a 2.5D interposer-based design. You have to grow vertically. You have to use all the real estate you have. And one part of the real estate is vertical. The innovators and companies that really want to advance into AI computing will have to take advantage of that. How do we make that possible? That’s the challenge for the ecosystem, for foundries, and for EDA companies. How do you make sure you can manage the whole process so it becomes cost-effective? This is a challenge that needs to be solved. You don’t have a way out of it.

Posner: TSMC gets to do that because they’re building everything. The one-size-fits-all is the only way to manage the complexity.

Mueth: You treat it like it’s a new foundry node, in a sense. But you don’t tell the foundry what to do.

SE: So it’s like a new set of restrictive design rules?

Mueth: Yes, with the manufacturing built in.

Choubey: If you look across foundries, there are a few fundamentals emerging. You have your organic interposer, silicon interposer, and bridge technologies, and then you have this wafer-on-wafer kind of thing, which is mostly TSMC right now. But if you look across boundaries, there are three fundamental technologies helping you grow out of the reticle-size box. We can develop an ecosystem around that, which takes these three technologies and makes them more accessible, more cost-effective, and more predictable. That’s where we need to go.

SE: The biggest problem everyone has is thermal density due to the amount of logic being packed in. How do you cool these things? Is it all going to be direct cooling or immersion cooling, or something else?

Ferguson: All of the above. You have micro-coolers. There are thermal interposers, but then you have to watch out for CTE mismatches. And there is convection, where you can get it. It’s anything and everything.

Choubey: It depends on how much power you’re burning and what kind of cooling you will need. But that part of the technology also will have to progress along with the process part.

Ferguson: Another aspect is different materials that can transport the heat more effectively, like ceramics and glass substrates. They’re not popular yet, because they’re not really ready, but everyone is excited about the prospects.

Faisal: There’s another side to this. As a designer, I can try to measure and correct for the thermals, or I can predict. I foresee a pretty major effort from everybody who cares about this problem trying to predict the thermals. This is an aftereffect of the chip trying to run faster. There are earlier lead measures that are available to indicate the thermals. That’s going to become very important. If I wait for the chip to heat up, and then measure and take action, it’s already too late. I’m already multiple seconds past the event that caused the thermal. But there are things you can do in the chip to be able to predict where the temperature is going to be many seconds later. What are the actions I can take now so that I never get there, or I take action before I get there?

SE: That’s like throttling?

Faisal: Yes, but it’s throttling with high precision. You can overdo it, and then you end up giving up performance. So it has to be dialed in with high precision and be spatially aware. You don’t want to just throttle based on 1 core if there are 100 cores.

Choubey: All of this will go hand-in-hand. You predict your thermal at the design phase, you design your cooling system, and then you have mitigation strategies in your system. It’s not one solution like, ‘How do I cool it?’ It’s how do I cool it depending on how well I have predicted ‘this much’ heating is going to happen.

SE: And all of this is workload-dependent. But what happens if those workloads change?

Choubey: That’s a very important question, and that’s why this whole methodology — the emulation, software validation — is going to be a key part. Even before you have written any netlist, you have to emulate your software workloads and understand where you are going to burn more power. How do you partition that design based on that work? And then you stack them so your heating does not cause a problem. So when you’re designing in the early stage, you have to make partitioning, floor-planning, and stacking decisions based on how your software workload is going to impact your power distribution in the whole system. That’s why all these things have to be tightly coupled. But you can do this and then decide later. It has to go concurrently.

Posner: At the Cadence Innovation Conference, which is internal R&D-to-R&D, one of the studies they were presenting was a customer’s 3D design. The focus was thermal balancing. The customer, which was spearheading 3D, had experienced power density, hot spots, and mechanical stress from warpage. What we had done with them was to run all the analyses, making sure the top and bottom dies were operating at the same temperature under maximum conditions. If you’ve got a hot thing next to a cold thing, that’s your thermal issue. Thermal becomes another optimization.

Choubey: That’s multi-objective optimization. You have to optimize for power distribution and for thermal behavior. A lot of automation and integration is needed. Validating your system, your software, running those workloads, and incorporating that information into your partitioning, floor-planning, and stacking is going to be critical.

Faisal: As an industry, we need to steal what the autonomous car industry is doing. When they finish the design of the car, it’s only half the story. The rest of the story is a runtime. How am I actually gathering a lot of data and optimizing that autonomous car throughout its lifetime? In San Francisco, Waymos are basically a continuous design. It’s learning from collecting runtime data and optimizing. Silicon needs to do that, as well, and it’s starting to happen. But it needs to happen a much larger scale. If I know I can sense and correct in the field, then I can make a choice about whether I want to over-design at design time. That needs to happen, because anytime somebody hands off silicon to a packaging guy, or a package to a PCB person, there’s a tax added. It’s 2%, 3%, 5% added anytime the boundaries cross. The reason is that information and learnings don’t flow at high bandwidth between those steps. By the time you have the chip, it’s 20% less efficient than what your design was. Using autonomous cars as a model, they are constantly learning and updating. Silicon should be constantly learning and updating, as well. The story doesn’t end just at design.

SE: A 3D-IC may age unevenly. How do we build redundancy into these designs in the right places? It really has to be managed as a system, but there’s no methodology for doing that at this point, right?

Mueth: It’s a hierarchical process. Your workflow has to take into account multiple hierarchies. But there’s also a potential to use machine learning to capture all of this expertise from humans and re-use it in future architectures. It’s not easy. You have many different handles and knobs you’ve got to turn. You have to go from the lowest level of implementation up to the architecture level, and everything in between, to optimize everything from dynamic operations to heat spreading and mechanical issues that will pop up. It has to be multi-dimensional, and those kinds of problems are hard to deal with.

Ferguson: It’s multi-dimensional, but there’s a circular loop to this. Power is generating heat, but then that heat is going to affect your wires, and that is going to change your power. So you have to iterate through in an automated fashion. And that’s just the beginning, because that also impacts your signals and your parasitics.

Choubey: It’s all interdependent. You cannot think, ‘I’ll fix this, then I will go fix that.’ By the time I fix that, this one is out of shape. You have to think about multiple objectives. How do I get visibility into all these aspects, and how do I analyze and solve for them in a cohesive manner so that when I’m trying to fix this problem, it does not make that problem worse. That’s why we need a unified view of the design, where you can see everything in one place and can quickly see that if I’m making a change here, what happens there. We need automation — most likely AI — that can explore the design space for this multi-objective problem and refine the solution to give you a more optimal design.

Mueth: It’s getting beyond what a single-discipline engineer can deal with. Normally, it’s been going from a single-discipline engineer to a design team with specialists. Now, even that is getting taxed. You have to make everybody focused on multi-physics. That’s where AI probably is going to come in.

Choubey: That’s a big change. The traditional IC designers never thought about multi-physics. Now you have to be aware and make provisions for all these effects because they are central to the design.

Mueth: It’s not enough to talk about one little thing that can blow your design.

Choubey: Tools have to be smart enough to give you the foundation to do that. It’s not humanly possible to think of all the effects that are coming in. Your tools have to be so well integrated and connected that you don’t miss things. Take a single thing like mirroring bumps that are slightly misaligned. You cannot run your physical verification every time. You need simple tools to check for those things when you go along so that your design is correct by construction. You can’t do lengthy physical verification every time you make a change. The tools have to become smarter. They have to keep pace with where the technology is going.

Related Reading
First Forays Into True 3D-IC Designs
Risk is high for pioneers of chiplet stacking, but the rewards could be significant. This will get easier, though.



Leave a Reply


(Note: This name will be displayed publicly)