Ensuring the reliability of multi-die systems with UCIe test and repair.
Multi-die systems are made up of several specialized functional dies (or chiplets) that are assembled in the same package to create the complete system. Multi-die systems have recently emerged as a solution to overcome the slowing down of Moore’s law by providing a path to scaling functionality in the packaged chip in a way that is manufacturable with good yield.
Additionally, multi-die systems enable product SKU flexibility in terms of performance scaling to match the needs of different market segments, optimization of the process node per function by mix and matching various process nodes in same product, faster time to market and lower risk.
To enable higher die-to-die routing density and support higher bandwidth traffic between dies, package technology has evolved to create new, advanced packages, based on silicon interposers (with TSVs) or silicon bridges and, more recently, redistribution layers (RDL), fanouts and HD substrates.
A key aspect for the success of multi-die systems is the ability to ensure testability of the system in different phases of manufacture and assembly, as well as ensuring reliable operation in the field. By using extra assembly steps and more complex bumping and packaging technologies, multi-die systems require test and reliability procedures that go beyond what was state-of-the-art for monolithic designs.
The naked dies, and the package itself, should be pre-tested to ensure that all defective dies or packages are detected before they are assembled in a package. If a defective die is detected only after assembly, then the complete multi-die system must be scrapped with serious impact on cost. The process of testing naked dies is called Known Good Die (KGD) testing.
The assembly process itself varies with the selected packaging technology. For example, chip first technologies, where dies are placed first and interconnect is built on top of them, do not allow for “known-good-package” testing, potentially resulting in scrapping good dies if the interconnect is faulty. On the other hand, in chip-last technologies, where the interconnect is built separately and dies are assembled on top of it, enable pre-testing of the package prior to assembly, reducing the probability of good dies being scrapped.
The multi-die system testability solution can be divided into several aspects:
This article describes the benefits of a comprehensive testability solution that leverages UCIe IP to ensure multi-die system reliability.
A high test coverage solution for the UCIe interface is achieved by implementing extensive testability features in the UCIe IP to route-out defective dies at the naked die testing phase. Some of the features include:
In addition, functionality to extend coverage to the die-to-die link, after package assembly can help achieve a high level of test coverage, including:
Advanced packages enable high density routing with fine pitch micro-bumping and routing on silicon or RDL interposers. During the assembly process, some micro-bump connections may not be well formed and may break down. UCIe offers the ability to test and repair these connections after assembly in a way that recovers the potential yield loss.
UCIe test and repair is executed during production test and at link initialization. In the test phase each individual link is checked for defects at slow speed. Defective links are repaired by re-routing data to spare links that are pre-defined by the UCIe standard.
UCIe configurations targeting advanced packages include up to 8 spare pins per direction (TX and RX) to enable repair of all the functional links:
The test and repair execution occurs when there is no valid traffic on the die-to-die link. After repair is complete and the link is initialized, it is assumed to be good and that traffic can be passed without problems. The resulting PHY configuration, called the PHY repair signature, is stored in internal registers in both ends of the link.
Degradation, due to aging or other, of microbump characteristics during operation may impact link performance. This will be detected at the protocol level by an increase in bit-error rate (BER) or, worst, by data being lost. In that case, the link is expected to be interrupted and a new test and repair step carried out.
However, some applications have stringent requirements in terms of continuity of traffic on the die-to-die link – they cannot tolerate interruption of traffic during operation. For these cases, a testability solution adds Signal Integrity Monitors (SIM) to each UCIe receiver pin.
Fig. 1: Link repair using built in spare links.
SIM monitors are small blocks embedded on the receiver. They are constantly sensing the signal at the receiver pin, during normal operation, to identify variations in the signal characteristics which can impact the link performance or indicate that the link is no longer healthy and may break in the near future.
The data gathered by the individual sensors is collected in a Monitoring, Test and Repair (MTR) controller, outside of the interface, for further processing. Aggregating the data from multiple UCIe links can provide instant insights into the health of the multi-die system and enable predictive maintenance of links.
If a specific link is predicted to be at risk of malfunction through this procedure, it can be disabled and data re-routed to one of the spare links, leveraging the UCIe PHY repair mechanism, even without traffic interruption.
Fig. 2: Health monitoring solution for UCIe links.
While the traffic pattern for most die-to-die interface use cases, for example in server splitting or scaling, is assumed to be stable during operation, in some use cases traffic may exhibit a bursty behavior. In such cases, it is desirable to bring the interface into a low power mode to save power while there is no traffic. Link re-initialization can be accelerated by avoiding the test and repair process and relying on the UCIe PHY repair signature that was created during the previous PHY initialization.
This concept can be further extended to situations where the die is completely powered down. In these cases, the PHY repair signature is retrieved from the PHY and stored on an on-die permanent memory (eFuse or flash). The memory could possibly store multiple signatures, covering different use cases or conditions, enabling additional user flexibility.
Test time is an expensive commodity. It is possible to accelerate test time by partitioning the test strategy hierarchically to run tests of different dies in parallel. The hierarchy can be extended across multiple dies in a multi-die system by connecting the test infrastructure of the two dies hierarchically. Such an approach allows access to all the dies in the multi-die system from a single JTAG (or similar) test interface in the main die.
Often, the limitation of test time is the time to load or read the test vectors into the dies. JTAG interfaces can become a speed bottleneck. To overcome this limitation, designers can use existing high-speed interfaces such as PCI Express (PCIe) or USB, etc. as interfaces to the test equipment. Test vectors and commands are packetized for that interface and depacketized on the die during the production test phase.
Many dies do not have a high-speed interface, however, the UCIe die-to-die interface can be used, during test, for transporting large test vectors and commands between dies at high speed. The UCI die-to-die interface extends the high-speed DFT access across the complete multi-die system without increasing the number of pins, which is particularly important for IO and area-limited dies.
Besides the UCIe die-to-die interface, the common denominator that enables all these test and reliability enhancement features is a test, repair and monitoring fabric that can connect all the internal blocks. The test, repair and monitoring fabric spans the various dies in the multi-die system, providing a structured hierarchical infrastructure that achieves the following important functions:
Synopsys provides a comprehensive and scalable multi-die system solution, including EDA and IP, for fast heterogeneous integration. For a secure and reliable die-to-die connectivity, Synopsys offers a complete UCIe Controller, PHY and Verification IP solution. As part of the Synopsys SLM & Test Family, a complete UCIe Monitoring, Test and Repair (MTR) solution is available along with STAR Hierarchical System (SHS) solution. The MTR solution includes signal integrity monitor for measuring signal quality on the UCIe lanes, BIST for self-test, and repair logic for redundant lane allocation while the SHS solution serves as the connectivity fabric supporting industry standard IEEE 1687, IEEE 1149.1, and IEEE 1838 interfaces. This complete solution enables efficient and cost-effective health monitoring of UCIe during all phases of the silicon lifecycle, which is critical for reliable operation of multi-die systems.
Resources:
Leave a Reply