Chiplets And 3D-ICs Add New Electrical And Mechanical Challenges

Reliability is now a system-level concern that includes everything from materials and packaging to testing with backside power.

popularity

Key Takeaways
• Chiplets and 3D-IC architectures add new thermal-mechanical stresses that can affect the reliability of entire systems.
• As chiplets are assembled into packages, defectivity targets become more stringent for each component in a system.
• Traditional silos are breaking down, forcing design teams to address issues such as materials choices that previously were handled by the foundry.


The rapid adoption of chiplet-based architectures in data centers is forcing big changes in every aspect of design, from chiplets to packaging and into the field. Costs are rising rapidly, concerns about reliability are increasing, and previous approaches to keep costs down and make sure devices work as expected are running out of steam.

The focus is no longer just about electromigration and power integrity. It now includes thermal-mechanical stresses that can vary by workload, by the number and types of interconnects, and by how far designs extend along the z axis. Modeling needs to be precise, and mitigation strategies need to be well understood both at the circuit and system levels. And EDA tools are evolving to address these issues, incorporating new capabilities for stress analysis, materials management, and interface verification.

“Reliability is the biggest challenge for chiplets and 3D-ICs,” said Pratyush Kamal, director of central engineering solutions at Siemens EDA. “It becomes extreme. Chiplets are designed to a certain level of defectivity, reliability, and constraints. Let’s say you have a monolithic chip. You design it for 10 defective parts per million (DPPM) when you decide to go the chiplet route, and you put in two of those instead of one monolithic design, or maybe three or four. Each of these parts can now individually fail in a package, and the package itself can add new failure modes. That means each of these chiplets now needs to be designed with a very low DPPM to collectively achieve the single DPPM target. That means the fundamental approach to 2D design itself must change.”

In theory, chiplets can help because they are smaller and more limited in function than an SoC, thereby reducing complexity and variation and making them easier to verify, inspect, and test. “They do not bring more challenges, and sometimes they even simplify the problems,” said Moshiko Emmer, distinguished engineer at Cadence. “Think of someone designing a system, and instead of doing it with a single SoC, breaking it down to multiple chiplets, usually around some specific functionality. This means that every chiplet is a smaller piece of silicon and has less content. Sometimes it’s the NoC and some features around specific functionality. In some cases, it needs to support fewer clock frequencies or lower power scenarios. Chiplets can simplify a lot of things.”


Fig. 1: Disaggregation and specialization. Source: Bryon Moyer/Semiconductor Engineering

Others agree. “Chiplets can enable more reliability because you’re able to use the technology that’s appropriate for the given circuit,” said Nigel Drego, CTO and co-founder of Quadric. “As we’re scaling down, analog becomes increasingly hard. SRAM has stopped scaling once you get past, say, 3nm or so, and if you’re struggling to just get what you need out of a process, then that doesn’t leave you as much time/effort for trying to make it reliable because you’re just focused on functionality. If you can use the technology that’s most appropriate for the circuitry you’re trying to build for a given application, then there are a couple of things that happen. One is cost reduction. If you can keep your analog at, say, 12nm — where it’s a very well understood process, it’s depreciated, circuit designers have done a lot with it already, and you’re not getting any gains from going down further — then why wouldn’t you just use that?”

Chiplet reliability and yield are only part of the picture, however. Packaging is more variable today, and so are the interconnects and bonds used to attach these chiplets to an interposer or substrate. All of that will likely change as chiplets become more mainstream, but it will take time.

“This is very temporary and will be sorted out, because what we do when we’re getting chips onto PCBs is a harder problem than trying to put chiplets onto silicon,” Drego said. “On another silicon substrate, where you have more control of that, you likely have less overall interference.”

Still, there are a lot of new elements in multi-die assemblies, from thinner dies and different bond materials to complex interconnect schemes and floor plans.

“Well-known reliability issues have been joined by a Pandora’s Box of new reliability issues that were not relevant before, or were relegated to the package level,” said Marc Swinnen, director of product marketing at Synopsys. “In a monolithic design, somebody would look at that in packaging. But now it’s in the floor planning of the 3D-IC. The primary reliability issue with chiplets and 3D-ICs is mechanical warpage and stress, because the warpage can lead to mechanical cracking. The stress, however, can lead to long-term failure, but it also changes electrical properties.”

EDA companies are now working with the foundries to understand how stress impacts electronic performance. “Transistors are made with stress built into them explicitly to get the properties they want, so stress is not a stranger to design,” Swinnen said. “But external stresses change the electrical properties of the transistors. Can we do a calculation and see how much stress? That loop is not quite closed yet.”

Die-to-package methodologies and technologies are still evolving. “We used to work in a world where every package had only one silicon die, and as part of a divide-and-conquer approach the SoC die world was completely separated from the package world,” said Cadence’s Emmer. “The SoC architect, designer, verification engineer, physical designer, etc., were focused on everything inside the SoC. The package work only came afterward, and there was a complete separation. Of course, there was a handshake that said, ‘These are my bounding conditions on the chip that I need to tell the package. I need to indicate where the bumps are, the electrical characteristics, what I need from power, and so on. And of course, I’m designing based on these agreements, and I’m meeting these specs.’ Then the package would take them as inputs and make sure that everything around it supports it. But it was completely separate. You do the die, you go to tapeout. The package starts very close to tapeout and continues afterward. The world of chiplets is changing that completely.”

Thermal-mechanical stress
One of the biggest changes with chiplets is the need to focus on thermal-mechanical stress, often due to different coefficients of thermal expansion (CTEs) in different materials.

“When they’re assembling these chiplets, they have to push these chips on top of each other to get the bonds to adhere, and there’s mechanical stress from the outside simply doing the manufacturing,” Synopsys’ Swinnen explained. “We’ve had repeated requests from customers to be able to model the manufacturing stresses, as well. In talking to one of the foundries, they said when you put these chips together and press them down on each other, the force required to squash these little solder bumps together is not big. But when there are a million of them, it becomes quite a big pressure on them. Also, the chips are allowed to bend a certain amount in a concave shape. There’s a limit as to how much they can bend concave, but they cannot ever bend convex. So there’s not just endogenous thermal-mechanical stress. External stresses also have to be considered. Then there are thermal cycling, delamination problems, and cracking of the bonds. There are so many of these little bonds, they’re so tiny, and they’re carrying huge currents that they become reliability issues.”

3D-ICs add other challenges, such as connecting through-silicon vias (TSVs). “That’s where reliability has come in, along with some of the traditional problems like power integrity,” Swinnen said. “Now, when it’s done, it’s not just the chip. It’s the whole system, and that makes it a really gnarly problem. It’s the same with electrostatic discharge. Now you need to have electrostatic discharge paths that go through several chiplets. How do you verify that path is secure? It really amplifies it. And there are some new ones, like mechanical stress and warpage,” Swinnen said.

Different materials add other challenges. “The chip designer never had to worry about the materials,” he said. “The foundry laid them down, it was all very fixed, and that was it. But once you get to interposer, there are various choices and options when it comes to the cooling and thermal interfaces. There was always a little bit of an issue with the packaging people, and now the chip people have to get involved in it much more. So there’s a looming materials choice and materials management challenge.”

Start with the process technology
Since circuit reliability starts with process technology, much of the focus is directed there. “When we look at finFETs and nanosheets and the sea of logic gates, they look very uniform — at least up to a certain metal level,” Siemens EDA’s Kamal said. “At the front end of line (FEOL), there is a continuity of the fins on the transistor layers. But there are challenges even there. For example, one of the foundries failed massively with a basic NAND gate, where there are two transistors laid out in series. In a standard cell, there are two types of connections in the standard cell, and one is the I/O. The other is the power delivery. So which node is more susceptible to noise?”

When 1,000 instances of a standard cell are placed across the die, they all see a very different context. “There is a lot of local and global variation in these processes, and the variation is increasing as the process complexity increases,” Kamal said. “You want to make sure that the circuit nodes are not noise-susceptible. You want to control the timing of your I/Os so they don’t change a lot in your place-and-route flow. You want to keep the I/O more focused layout-wise, with as much as you can internal to the standard cell, not exposed to the boundary of the standard cell. This foundry had done the opposite. You should use your power and ground supplies to connect the left and right sides of the standard cell on the outside, and use the standard connection for the I/Os. They did the opposite. As a result, when the teams were trying to place-and-route those library cells, they could not close the timing across the sigma.”

Transistor-level issues are more complex when it comes to chiplets and 3D-IC design. “While the designer can’t do much at the standard cell level, the foundry can, because they are the ones providing the library cells,” Kamal said. “The foundries must focus on making sure that they are designing these library cells with these fundamental things in mind. When you look at a flop, any latch structure you have now, that’s where the failures happen. When you look at cross-domain crossings, voltage domain crossings, domain crossings, and reset domain crossings, all of those must be looked at very, very carefully when you’re designing the standard cell. You must target a lower DPPM than you have ever done before.”

Then, at the block level, reset domain crossings must be designed carefully because with chiplets and 3D-ICs, there is a showstopper in the flow used today. When silicon comes back and it’s not working, the engineering team goes through the debug process.

“You are using your IJTAG interfaces to look inside the chip, but sometimes you realize that half the parts are showing a state of zero, other parts are showing a state of one, and that’s the cause of the failure,” Kamal explained. “But then, before you redesign and re-mask everything, because mask costs are $20 million or $30 million, you want to do more debugging. You want to make sure your assumptions are correct. We use focused ion beam (FIB), and go from the back side of the silicon and make changes to the circuit because the transistor is going to be in the FEOL layer. We generally limit the use of FIB close to the source and drain terminals of the gate terminal of the transistor. Going from the back side is easy. There is no metal there today. On the front side, there are so many metal layers, and you must not cut through them, or you will destroy the circuit.”

But that changes with backside power delivery, which Intel began using at 20nm (20A). “If you look at a 3D-IC stack, every 3D-IC stack will have one layer with backside metal, so you can’t do FIB anymore,” he said. “And since failure is not an option anymore, how do we manage this? In the case of analog, we do basic redundancy. Today, we do dual and triple redundancy in automotive. In automotive, we have lockstep cores and the like, but that’s a very expensive way of doing redundancy. Now we must take this coarse-level idea of redundancy, and take it a little more granular, because 3D-IC is going to be expensive. These 2nm nodes are super expensive. We must keep optimizing them. We can’t afford to have two dual cores, where in case one fails they use the other. That’s what the server class is doing today. That’s why Intel still manages to use theirs at maximum, because of the nature of their design, which is a repetition of lots so some can fail. In several computing classes, we have done that, but in the mobile space and where the 3D-IC comes in, you don’t have that homogeneity of layout or architecture. Redundancy is important, but you need to do it at a lower level so you can do an optimal redundancy rather than just duplicating the cores.”

Bridging these foundational reliability strategies with the realities of system-level integration highlights the need for a holistic approach as complexity escalates. With that in mind, the conversation shifts toward the architectural and packaging considerations that become critical when managing multiple chiplets within a single system.

When designing with multiple chiplets, packaging considerations must be taken into account at the architecture and planning stages, which is very early compared to a classic SoC project lifecycle.

“When you look at a system that you want to build with multiple chiplets, first, you can build bigger systems, and you can bring more silicon into the same package,” Cadence’s Emmer said. “It will be separate chiplets, separate dies, and there are some considerations that you need to make to ensure your design meets the specs. For example, if you want to do something for edge devices, you need to meet specific reliability considerations. If you want to do something for the data center or infrastructure world, those have different aspects. When I’m architecting a system that is built with chiplets, it doesn’t matter if the chiplet is on a mature process node or a newer process technology. Usually, we see a mix of them, and I need to think not only about how I’m distributing the component between the chiplets, but also how I’m going to integrate these chiplets together. I also need to choose what kind of integration solution I will use.”

There are multiple options for chiplet integration. “We can do simpler integrations through substrate, like organic substrate, simple UCIe, for example, standard packaging integration,” Emmer said. “We can do more advanced ones with interposers or bridges, and the aspects of side-by-side or stack-up dies with hybrid bonding are also coming into effect. All of these need to be thought of already in the architecture and design stage. The industry also needs to introduce new EDA solutions and tools to be able to verify these conditions, because if you think about reliability, you can split the world into two paths for where interconnect reliability fails. One is through the interconnect, through the line, whether the signal or the current that goes through it, so that the metal line degrades over time. This is one type. The more common and problematic failure is wherever there is a connection, whenever there is an interface between a line and something.”

As these architectural and integration challenges are addressed, it’s important to examine how the reliability concerns evolve in this new landscape. This speaks to the ongoing developments and unique reliability considerations that arise with modern chiplet technologies and their associated packaging solutions.

“With chiplets today, especially since this world is still emerging and all the developments are still ongoing, it’s not a mature technology where everything is completely in production, and we only see small advancements,” Emmer said. “It’s still in the very high gradient time. We have new elements that we need to take care of that create reliability on both parts — on the connection parts, as well as on the actual material that we use to transfer the signals. You can think of the RDL interposer. This is something new. How do the signals propagate through it? What’s the impact on the reliability?”

There are some cases in which a design will very marginally meet the spec. But even though testing is done after silicon comes back, after packaging and shows it has passed, the chip might still fail in the field.

“As the system becomes more problematic, you need to be able to do this type of verification at the package level, where you also include the information beyond boundary specs, as we used to do — as well as the internal information of the die when you do that package-level analysis,” Emmer said. “Reliability is a significant part of it, and it looks at both the actual interface and actual connection, as well as the signal that needs to go through a line. If there is a side-by-side integration of two chiplets, the distance the signal needs to go through is extended. Think of UCIe as the interface that connects two dies. This is a side-by-side connection. The dies are not above it. It’s not like the distance between them is zero. There is an interposer. There is some interface in the middle that connects them. With UCIe, that connection can be up to 25 millimeters. So we need to think of moving from microns at the chip level, to millimeters at the die-to-die level. This signal needs to stay reliable and immune. I need to be able to test or simulate my entire system ahead of building it. Otherwise, I will not have the ability to guarantee the operation. Looking 5 or 10 years down the road — with talk about a chiplet marketplace in which you can put a chiplet on the shelf and someone can just buy it and integrate it into a system — all these things need to be specified. The boundary of the chiplet needs to be specified, because you won’t know who is going to integrate it, in which package, and with what other components. All of these things have to be defined and standardized, and there’s no standardization for this as of today.”

Conclusion
Chiplets have the potential to transform the chip industry, adding both flexibility and scalability. But they also create some complex challenges involving reliability, integration, and standardization that must be carefully addressed from the earliest stages of development. While advances in packaging and interface standards are promising, lingering concerns around cost and interface IP highlight the need for ongoing collaboration and innovation.

To make this all work, the industry must prioritize robust verification methods to ensure seamless integration and long-term functionality. Ultimately, the success of chiplet-based systems will depend on balancing technical progress with practical solutions to these unresolved issues.


Related Reading

Chiplet Fundamentals For Engineers: eBook
A 65-page in-depth research report on the next phase of device scaling.
Chiplets Knowledge Center



1 comments

Guy van der Walt - Thermco says:

Great piece. The point about each chiplet needing a fundamentally lower DPPM to achieve the same system-level reliability target is critical and it has implications upstream that don’t get enough attention.

If every die in a multi-chiplet package must hit tighter defectivity specs, that pressure flows directly back to FEOL process equipment: diffusion, deposition, oxidation, annealing. The thermal uniformity and process repeatability of the furnace stack becomes a first-order variable in whether you can actually hit those per-chiplet yield targets at scale.

The thermal-mechanical stress discussion is also relevant here. The intrinsic film stresses that ultimately contribute to warpage and delamination in advanced packages are largely set during FEOL thermal processing. Tighter control at the wafer level before you ever get to assembly is one of the most cost-effective mitigation strategies available.

The chiplet era doesn’t just change design and packaging. It raises the bar on the process tools that build the silicon in the first place

Leave a Reply


(Note: This name will be displayed publicly)