Why lane swapping is essential to meet assembly yield.
Redundancy in chiplet interfaces is now a prerequisite for achieving sufficient yield in high-performance computing devices, which today are packed with tens of thousands of interconnects. And as the number and density of those interconnects increases, the prospects for yield only worsen.
For more than two decades, high-speed I/O interfaces have included reliability strategies to manage in-field system board failures. For example, the PCI Express 2.0 standard introduced in 2007 included 16 lanes for transactions. But if there was a lane failure then only 8 of those lanes were utilized, cutting the transaction rate in half. Commonly referred to as graceful degradation, this remains a strategy for computing chiplet interfaces such as UCIe.
But now HBM and UCIe interfaces include contingencies for single-lane failures in multi-die assemblies, specifying spare lanes to replace failing signal interconnects between chiplets. Once detected, a failing lane is swapped with a spare lane, and thus ‘repairs’ the failing lane. Engineering teams can swap out bad lanes throughout the test manufacturing flow. This capability also can be used in customer systems to accommodate assembly-related reliability and wear-out failure mechanisms.
Still, as advanced assembly processes enable dense 2.5D and 3D architectures, the number of signal interconnects in a chiplet-based system will climb to hundreds of thousands. That will challenge I/O test and repair strategies for both manufacturing and system-level reliability. As a result, test content needs to comprehensively detect I/O-related failures.
“For high-volume, high-data rate interconnect lanes, designers need to consider various types of faults that may manifest because of cross-coupling and other defect classes,” said Faisal Goriawalla, principal product manager for multi-die test-related SLM products at Synopsys. “Based on a comprehensive algorithmic set of patterns, if any of these lanes are deemed defective, then a mechanism must exist to perform lane repair by swapping out a bad lane with a good one.”
The upside is that tracking where and when failures occur can assist engineering teams with yield learning. Chiplet signal path defects can affect the PHY (a.k.a I/O circuit), the metallurgical connection from chiplet to interposer to package substrate, as well as the corresponding interconnects. With short transmission lines to drive, the I/O circuit design layout is more relaxed with respect to design rules. This, in turn, lowers the I/O circuit failure rate. Similarly, RDL and interposer interconnects do not push the process envelope.
Lane repair exists primarily to address the assembly process defects associated with TSVs, microbumps, and hybrid pads. Historically, only one or two connections were impacted by defects. However, depending on the assembly technology, the predominant defect mechanism can shift and the number of defective connections increases.
“If you are using EMIB (Intel’s Embedded Multi-Die Interposer Bridge) with a 25-micron pitch, the primary mechanism is open because you create a lot more vias,” said Sreejit Chakravarty, Ampere fellow and chair of IEEE P3405, the working group tasked with standardizing chiplet interconnect test and repair. “Thus, you’ll mostly see single-lane failures. Looking at a silicon interposer with microbumps, (as small as 25 micron pitch), shorts occur more than opens. You see solder bumps merge, resulting in two adjacent lanes being shorted. Now with hybrid bonding, a dust particle or some kind of impurity could get in between the two surfaces. Then a whole bunch of connections around that dust particle will not be formed properly. Failures become clustered, affecting more than two bumps.”
Excessive warpage and bump/pad process variations also can span multiple connection points, especially at the die and wafer edges. These could result in multiple opens or non-wet contacts that eventually fail in the field.
Lane-swapping specifics
Each chiplet’s I/O, and the interconnect between the I/Os constitutes a lane. Lane repair results in swapping out a bad lane in response to manufacturing defects and in-field failures.
Chiplet interface standards vary in their signals-to-spares ratio.

Table 1: Chiplet interfaces and their repair strategy. Source Advantest
“One important and additional consideration in multi-die designs is that lane shift must be implemented in both dies,” said Synopsys’ Goriawalla. “The faulty lane output I/O is then placed into standby mode. These are important requirements for multi-die IC applications such as data center/HPC and automotive, which require high reliability from the perspective of lowered total cost of ownership or safety.”

Fig. 1: Redundant lanes in multi-die system. Source: Synopsys
For HBM interfaces, research teams continue to explore various repair schemes. In a recent paper [1], the authors highlighted the need to consider clustered I/O failures. While graceful degradation schemes could be followed if there are clustered failures the result is a significant reduction in data bandwidth.
The ability to cope with clustering in stacked die scenarios needs to be available.
The proposed UCIe 3D standard takes into consideration the clustering of failures. This is very important for assembly processes using hybrid bonding, for which 5- to 10-micron pitches increase the likelihood that more than two hybrid pads will be affected by a particle.

Fig. 2: An assembly defect can affect multiple connections within a 3D die stack. Source: UCIe Consortium
To do so effectively, the standard will recommend that the pad and I/O layout be modular in nature. In a presentation last year, Debendra Das Sharma, UCIe Consortium Chairman and Intel senior fellow, discussed this approach, [2] noting that a defect may affect a 5 x 5 bump/pad area. Thus, the UCIe 3.0 architecture will be designed with modules of I/O for repair, along with the subsequent rerouting to the redundant module.
The specifics were described as follows:

Fig. 3: UCIe repair modularity in terms of bundles, showing an unrepaired row, a repaired row when one bundle has a defect, two repaired rows when a defect impacts four modules. Source: UCIe Consortium
Still, to fully enable heterogeneous integration, the ability to connect chiplets to each other without revealing their proprietary internal design requires a standard means of testing I/O and repairing the connections between chiplets.
“If you are buying a chiplet that has UCIe or HBM, then you understand its standard protocol,” said Vidya Neerkundar, product director for Tessent at Siemens EDA. “How do I connect it? How do I test it? How do I do the lane repair? Existing chiplet interconnect standards come with built in mechanisms for these actions. But today there is no standard process for I/Os between two dies when you don’t know what the other die is expecting. Now, P3405 is a proposed standard that addresses I/O test and repair for these situations.”
P3405 [3, 4] is needed to enable interoperability, and thus integration of chiplets from different sources. This integration cannot be done without a standard to support I/O test and repair between chiplets. P3405 defines an I/O test and repair architecture that verifies whether an I/O connection needs replacement and performs the repair. It also will address the reality of clustered I/O failures.
“In the next two years, high-performance computing (HPC) will continue to drive the majority of 2.5D and 3D silicon designs,” said Bob Bartlett, director of test technology sales at Advantest America. “For I/O repair, UCIe and HBM are good enough to drive heterogeneous integrations as you start going to more 3D packages. But I don’t think it’s the cost of test any more. It’s the cost of yield. When all the design tools in the 3D space get better, we can have lower-cost 3D products. These would support a more diverse set of applications, which would be smaller, less complex, and lower power than HPC products. And for the I/O lane, repair P3405 is something that could support these lower-cost applications.”
Manufacturing test implementation
Testing identifies the lanes to be repaired. Each redundancy scheme affects test programs and the test insertions from wafer to system test. With that in mind, the actual test flows that engineering teams implement can vary.
“Cost and quality always have driven test flow decisions,” explained Ken Lanier, principal technologist and director of strategic business development at Teradyne. “All things being equal, balancing scrap cost versus test cost and coming up with the lowest cost of manufacturing will always determine what tests are done at each insertion — classically, probe, package, system-level, and final product. Advanced packaging tends to contort things. For example, die-level packaging, memory stacks, and multi-die packaging, in general, will either make certain insertions impossible or drive test targeting faults introduced during later manufacturing steps.”

Fig. 4: A chiplet-based product test flow. Source Teradyne
Both wafer and package test of chiplet interfaces require DFT to provide access and test content. At wafer test, not all of a die’s microbumps will be probed. Often, only test pads are probed. Once assembled there will be no direct ATE contact to chiplet interfaces. Post-assembly I/O test needs to include pattern coverage for bridging and open defects. Due to the tight bump pitches and high data rates, coupling and crosstalk faulty behaviors need to be sensitized with an appropriate test pattern.

Fig. 5: A high performance computing multi-die product and limited test access. Source: Teradyne
“For wafer-level testing, internal and near-end loopback BiST will then be performed to test the local TX and RX,” said Synopsys’ Goriawalla. “Specifically for the post die-to-die bonding stage, far-end loopback BiST and die-to-die BiST need to be executed to cover the TX and RX of both chiplets. The results of these tests that indicate defective lanes need to be written into the lane repair registers of the PHY, and aggregated across PVT corners so that the final repair signature will be programmed into local e-fuse/OTP elements. For subsequent power-on self-test (POST), the additional step is that we will need to first offload the repair signature (prior to running the test) to ensure that the known defective lanes don’t need to be re-tested.”

Fig. 6: Three types of I/O loopbacks. Source: Synopsys
As to whether I/O lane repair should occur at wafer-level test, it depends. If the predominant I/O interconnect failure is assembly-related — an I/O failure near end loopback or a DC test — the die may be marked bad. But not always.
“It’s all a question of how much yield you want at final test,” said Chakravarty. “If it is very low for the overall repair flow, then you can look at the possibility of repairing the die at wafer test. Such a choice would be data-driven because you also have to assemble the dies and test the packaged product.”
If customers require I/O repair at wafer-level test, then this adds to die properties for matching die with similar characteristics. Information about repaired/swapped-out lanes is stored in an e-fuse/NVM, which makes matching possible.
“Lane repaired information is stored on the device insofar as the device is reconfigured to use some number of redundant lanes, if available, for die-to-die interfaces,” Teradyne’s Lanier explained. “The interesting part is that now the packaging process needs to ensure that whatever die that repaired die is connected to has a matching set of working lanes. This is one dimension of the overall problem of die-matching for chiplet-based devices.”
Testing of redundant lanes as part of the testing process, before swapping out, depends on customer requirements. Multi-die products destined for automotive, aerospace and other safety-critical applications will require pre-test of all spares.
“Let’s say we have 30 lanes and 2 spares,” Chakravarty said. “Initially, the I/O BiST will exercise 30 lanes. At the end of the test it finds 1 bad lane, number 17. The BiST will give that information the muxing logic to swap out lane 17 with one of the redundant links. Next, the I/O BiST sequence will be rerun to make sure the connections are good. But with lots of redundant lanes a customer may require you to test that the redundant lanes are functioning, and this may be required across all test insertions as well as requiring that the redundant circuitry is exercised during burn-in.”
I/O swapping also can be performed in the field to cover reliability- or aging-based failures. This presents a dilemma for device manufacturers when a device has used all the spare lanes at the end of the test flow. The device still functions, but no spares remain. One possibility is to bin a device based upon the remaining number of spares.
“This is a good question, because there may be merit to bin parts based on the number of repair lanes,” said Vineet Pancholi, manufacturing test technologist at Amkor Technology. “This may be important while the technologies are continuing to mature. We have not yet observed customers binning parts with respect to available repair lanes post-manufacturing for both HPC and automotive. This may change once there is adequate data to document the benefits of in-field repair requirements.”
Teradyne’s Lanier concurred. “It depends. If it’s an external interface, maybe you can still sell the final device as a lower-performance version. It seems like the same would be true of an internal (die-to-die) bus, but that sounds really difficult to manage.”
Conclusion
Just as memory device designers embraced the reality of bit cell failures, multi-die device designers are embracing lane failures. I/O test and the subsequent repair by swapping in a spare will become more prevalent as chiplet interconnect numbers rise to the hundreds of thousands.
“It all comes down to defect density,” said Advantest’s Bartlett. “Assembly contacts, microbumps, pads — everything is getting smaller. Hybrid bonding is at 10 microns and headed toward 5 microns. Your I/O interfaces need to be robust. You need enough repairability to enable functionality for 10 years. And as you keep shrinking, you are going to need more repair capability.”
Or put simply, the need for redundant lanes has become necessary for higher yield after assembly.
References
Related Reading
More Data, More Redundant Interconnects
Circuits are being pushed harder and longer, particularly with AI, speeding up the aging of data paths. Photonics adds its own complications.
Physics Limits Interposer Line Lengths
Thin lines and limited ground planes keep RDL interconnects short.
UCIe Goes Back To The Drawing Board
The open chiplet interconnect protocol faces some formidable challenges, but progress continues.
Screening For Known Good Interposers
Increasing interconnect density is making it harder to guarantee these devices will work as expected.
Progress In Wafer And Package Level Defect Inspection
Advances in imaging systems aim to improve throughput without sacrificing measurement accuracy.
Leave a Reply