What Can Go Wrong In Heterogeneous Integration

Workflows and tools are disconnected, mechanical stress is ill-defined, and complete co-planarity is nearly impossible. But there are solutions on the horizon.

popularity

Experts at the Table: Semiconductor Engineering sat down to discuss heterogeneous integration with Dick Otte, president and CEO of Promex Industries; Mike Kelly, vice president of chiplets/FCBGA integration at Amkor Technology; Shekhar Kapoor, senior director of product management at Synopsys; John Park, product management group director in Cadence‘s Custom IC & PCB Group; and Tony Mastroianni, advanced packaging solutions director at Siemens Digital Industries Software. What follows are excerpts of that conversation. To view part one of this discussion, click here. Part three is here.

[L-R] Dick Otte, president and CEO of Promex Industries; Mike Kelly, vice president of advanced packaging development and integration at Amkor Technology; John Park, product management group director in Cadence's Custom IC & PCB Group; Shekhar Kapoor, senior director of product management at Synopsys; and Tony Mastroianni, advanced packaging solutions director at Siemens Digital Industries Software.

[L – R] Dick Otte, Mike Kelly, John Park, Shekhar Kapoor, and Tony Mastroianni.

SE: With heterogeneous integration, there are a whole bunch of potential problem areas. What do you see as the biggest concerns.

Kapoor: Many things can go wrong. We are entering into 3DHI — HI is never used alone — where you have known reliability issues. But even before that, the very reason you’re trying to get into 3DHI is because you have some PPA goals and you may want some cost advantage. But what is the most optimal design? It’s easy to say, ‘I’m going to disaggregate.’ But when you do that, the whole IC design paradigm quickly shifts to the system design paradigm. At that level, you’re dealing with hardware design, software, different workloads, and which chiplets you need? There is no marketplace for chiplets today, but if you get some sub-systems, how do they fit into the overall scope?

Disaggregation planning is becoming a big challenge, and it’s where all these architectural pieces come into play. Where you face it first is in the implementation phase. The physical implementation engineers are feeling the brunt of it right up front because of the integration challenges they have to deal with. We’re moving quickly moving from co-design to co-optimization. In the implementation phase, we have to start thinking about all these second- and third-order effects, in addition to thermal. Optimizing the whole system, including the dies, the packages, the materials, across all these physical dimensions using multi-physics is becoming important. And finally, you have to make these devices reliable. When you’re putting these pieces together, it’s not enough to determine if there are known good dies. It needs to be a known good system. How do you remove the heat from this entire system or put monitors across these dies?

Park: This whole notion of known good die, or what we would now call a known good chiplet, has a long history. If you go back to when we had multi-chip modules and systems-in-package, the big risk was known good die. We started to get a handle on it with better wafer-level tests and things like that, but now we have smaller geometries we have to test. We’re seeing designs today with dozens or more chiplets in them. If one of those chiplets fails, you can test it at the wafer level, you can test it after it’s been diced, and then you also have to test it after it’s been assembled. But the risk of one of 30 chiplets failing is much greater than in a monolithic chip, where you can test it and it’s pretty much good to go. Who owns that problem? And how do you debug it? Do you throw the whole device out? Going forward, known good chiplets will be a huge factor.

Also, when we design chips, we do that based on a PDK we get from the foundry. The foundry invests in a process design kit, which gives us the data we need as ASIC designers to know what the technology is. We get the libraries, the sign-off design rules, and connectivity verification information. We know that whatever we’re creating, we’re going to be able to assemble that thing inside the foundry that provided the PDK because they’re guiding us. We don’t have that in packaging. Packaging is a little more like the Wild West. We don’t always have technology information, or placement information like how close can these two chiplets be to one another. All that formality we have on the ASIC side is still lacking on the packaging side. And that’s what gives designers confidence to say, ‘Hey, I created this multi-die system in my CAD tool. I’m going to send out the manufacturing output, and I know an OSAT can build this thing for me.’ It’s not quite that formal. We’re hearing a lot of talk about ADKs, or assembly design kits, which will help. And we’re hearing assembly languages like 3Dblox (TSMC), 3D Code (Samsung), and CDXML (Open Compute Project), which also will help. But those are some of the big potential problem areas going forward.


Fig. 1: Needs of IC and system designers are converging. Source: Cadence

Mastroianni: System desegregation is a very challenging problem. With the flexibility and technology we have available, there are many different ways you can decompose that system. But it’s not obvious to the architects how to best do that. It’s critical to be able to do some early analysis and look at the implications of the implementation — early predictive analysis of thermal issues or performance issues — because it’s very expensive to do these chips. If you’ve picked the wrong micro-architecture, that that could be game over. It’s a whole shift in paradigm, where traditionally it’s been an over-the-wall process. The system designers would develop specs, and then the RTL guys would write the code, and the chip guys would do the chip, and the DFT guys would work with them. And then the packaging guys would get involved when the chips were almost done.

But now, all this stuff is being done concurrently. You have to look at all these pieces together with all these disciplines. One of the challenges there is that the workflows and the tools are not integrated. The tools are still separate, and the designers have different areas of expertise. It’s going to take time to get the people using the tools, and the workflows, all working together to make sure they’re picking the right architecture. I see designers who now have to worry about thermal issues when they’re stacking dies and putting them in the package. That’s something only the packaging guys used to worry about. And you have to simulate mechanical stress. In traditional package design, you could do a test chip and evaluate the package, and be pretty confident it’s going to work. But when you have multiple chips there in a package on an interposer, every design is different. So you can’t rely on test chips. You really need multi-physics simulations to make sure you don’t have problems once you put everything together. And the last point is known good die, which is a misnomer, because when you’re testing die, you’re testing them at the wafer level. You’re not testing performance, and the 100% testing that typically is done in the package. So you can definitely run into problems. It’s important to have the ability to do to test those chips and make sure that you can test them in the context of the system, because once you put that in the package, the environment that you test at wafer probe is very different. There are different voltages, different thermal profiles, maybe even stress.

There’s a desire in the industry to deliver known good die and just pop them in and hope everything’s going to work. But in reality, we’re going to need the ability to be able to test those die once they’re integrated into the package. There are design techniques and DFT methods out there, but it is going to put the onus on the chiplet developers to provide those capabilities so that it can be tested within the context of the system and package.

Otte: We have to go one step further than known good die and start designing the chips with redundancy in them, so that you can have a defect that appears out of the blue and you have a workaround. There has to be extra capacity. The memory guys have done this by utilizing error correction code. We have to do a similar kind of thing, but more broadly, because there’s a limit as to how far we’re going to be able to get with testing and to guarantee they’re right. We have to build in redundancy somehow.

Mastroianni: Absolutely. You need the ability to detect the failure, so you need testability. And then you need repairability so you can work around those through redundancy. That’s an excellent point. And when you get into automotive and medical-type applications, you need triple redundancy. That has to be part of the system design.

Otte: And it’s not just at the initial build. It has to work over the lifetime of the product for 10 years. Over time, as things start to fail, the system self-corrects.

Mastroianni: You can have age sensors built into the chip, so you can predict when a when a part is starting to run out of gas. That’s important. In automotive and other applications, every time you start it up, you’re going to test everything to make sure it’s all working properly. If it isn’t, then you either don’t run the car, or you kick in some of the redundancy and the yellow or red ‘check engine’ light comes on.

Kapoor: Automotive is one area where it’s becoming mandatory. It’s also happening in data centers, where you have to ensure quality of service.

Otte: It’s a whole new paradigm that we have not done. There are several of those being forced on us as things get smaller. Another problem that we see in almost every job that comes to us related to chiplets — meaning devices with anywhere from 100 to 10,000 copper pillars to go on a substrate — is a lack of co-planarity of either the substrate or the die. Everybody thinks these organic substrates are flat. If you work the numbers, they have to be flat to 1 part in 10,000. In almost every job we get in-house, they will not work because of lack of co-planarity. We bought a couple of high-end tools so that for any of these high bump designs, the first thing we do is check its planarity. There’s no way we’re going to be able to get 1,000 bumps to make connections, but the assumption of a designer is that all 1,000 will work.

Kelly: With silicon, you think of an extremely elegant and nuanced system that has evolved over time to get to where we are today. If you look at PDKs and the work it took to get there, those are well-defined and very precise. And mechanically, with bulk silicon and a monolithic die, things are quite simple. You have a piece of bulk silicon that has a high modulus, a known CTE, and everything else is inorganic.

From a mechanical standpoint, when moving to a system-level heterogeneous package, you really have to be cautious. In the mechanical world, the scale is grossly different at the package level versus a piece of silicon. For example, with high-speed die-to-die buses, that’s fairly fine line circuitry. It could be a couple thousand lines between die, but its scale is orders of magnitude larger than in a die. So things that don’t necessarily matter in a die really do matter at the package level. There are so many things that can go wrong mechanically in a package if you don’t do it right. You need an awful lot more mechanical and thermal simulation than you ever had to do before in order to get to a place where you have confidence. If you look at electromigration, we’re running a lot of high currents through the package, as we always have. But now we’re distributing those currents into finer lines. Electromigration needs to be better defined and thoroughly tested and validated. We know how to do it, but you have to do more than you’re used to doing with that kind of validation. And thermal goes hand in hand with mechanical stresses. With temperatures distributed in x, y and z in the package, that impacts stresses, as well. So you have to get a handle on those things, and you have to be cautious.

In the packaging world, we talk about test vehicles. You build test vehicles that look just like a product and you understand the stresses, the strains, the thermal performance, and you try and de-risk it over time so that you have something coming out at the end that is sufficiently reliable. Caution is the note of the day. Do your due diligence on test vehicles and don’t assume anything. You may have 10,000 70-micron-diameter bumps, and you want to attach them to a non-flat substrate. You’ve got to manage warpage and all those sorts of things during assembly and know that they’re going to be okay.

We took a four-chiplet module, about 30 x 30 millimeters, and assembled it into a package. We temp-cycled that to 5,000 temp cycles, Condition B, which is a very harsh temp cycling condition that JEDEC specifies for IC packages that are laminate-based. There are places where, by switching up and changing the bulk CTE of a module compared to silicon, you can fundamentally lower stresses compared to a single big die on a very high CTE organic substrate. So there are some positive things coming up, where thankfully, the system is going to be quite forgiving in regard to stress. But there also are a lot of ways to do it wrong. Test vehicles and lots and lots of due diligence are required at this early stage, especially where you don’t have broad swaths of truth out there that you can always rely on. You’re sort of creating them piece by piece.

Kapoor: At the end of the day it boils down to your strategy. Analytics and monitor sensors are a big part of this. You can address all these problems with preventive measures, like redundancy, but then you need some sort of predictive analytics to take all this data and collate it all together in a meaningful way.

Kelly: In situ monitoring can be very extensive if the designers choose to include it. This is a really big deal, especially if you’re shooting for high reliability, where you can monitor the degradation of things like outputs, I/O, and even eye diagrams in situ during reliability testing. Some customers are definitely using it heavily.

Kapoor: It’s in situ and in-field. You need both kinds of data and workflows both ways. That’s an emerging need.

Kelly: People started out simply with one or two chiplets. Now some companies are going to have 15 or 20 chiplets in a package from different foundries and different nodes. Being able to monitor this complex system in situ is an attractive proposal.

Related Reading
Making Connections In 3D Heterogeneous Integration
New packaging options are stacking up, but taking advantage of them isn’t easy.
Mechanical Challenges Rise With Heterogeneous Integration
But gaps in tools make it difficult to address warpage, structural issues, and new materials in multi-die/multi-chiplet designs.
Chiplets: 2023 (EBook)
What chiplets are, what they are being used for today, and what they will be used for in the future.



Leave a Reply


(Note: This name will be displayed publicly)