From Known Good Die To Known Good System With UCIe IP

Ensuring the reliability of multi-die systems with UCIe test and repair.

popularity

Multi-die systems are made up of several specialized functional dies (or chiplets) that are assembled in the same package to create the complete system. Multi-die systems have recently emerged as a solution to overcome the slowing down of Moore’s law by providing a path to scaling functionality in the packaged chip in a way that is manufacturable with good yield.

Additionally, multi-die systems enable product SKU flexibility in terms of performance scaling to match the needs of different market segments, optimization of the process node per function by mix and matching various process nodes in same product, faster time to market and lower risk.

To enable higher die-to-die routing density and support higher bandwidth traffic between dies, package technology has evolved to create new, advanced packages, based on silicon interposers (with TSVs) or silicon bridges and, more recently, redistribution layers (RDL), fanouts and HD substrates.

A key aspect for the success of multi-die systems is the ability to ensure testability of the system in different phases of manufacture and assembly, as well as ensuring reliable operation in the field. By using extra assembly steps and more complex bumping and packaging technologies, multi-die systems require test and reliability procedures that go beyond what was state-of-the-art for monolithic designs.

The naked dies, and the package itself, should be pre-tested to ensure that all defective dies or packages are detected before they are assembled in a package. If a defective die is detected only after assembly, then the complete multi-die system must be scrapped with serious impact on cost. The process of testing naked dies is called Known Good Die (KGD) testing.

The assembly process itself varies with the selected packaging technology. For example, chip first technologies, where dies are placed first and interconnect is built on top of them, do not allow for “known-good-package” testing, potentially resulting in scrapping good dies if the interconnect is faulty. On the other hand, in chip-last technologies, where the interconnect is built separately and dies are assembled on top of it, enable pre-testing of the package prior to assembly, reducing the probability of good dies being scrapped.

The multi-die system testability solution can be divided into several aspects:

  1. Test coverage of individual blocks within the die
  2. Test coverage of the individual dies (naked dies)
  3. Test of the assembled system (with die-to-die coverage)
  4. Access to the test fabric in naked dies
  5. Hierarchical access to test fabric after assembly

This article describes the benefits of a comprehensive testability solution that leverages UCIe IP to ensure multi-die system reliability.

DFT for the UCIe interface

A high test coverage solution for the UCIe interface is achieved by implementing extensive testability features in the UCIe IP to route-out defective dies at the naked die testing phase. Some of the features include:

  1. Scan chains covering all synthetized digital circuitry
  2. Dedicated block specific BIST functionality
  3. Loopback built-in self-test (BIST) functionality covering the complete signal chain up to the IO pin
  4. Programmable pseudorandom binary sequence (PRBS) and user defined test patterns generators and checkers
  5. Error injection to eliminate false passes

In addition, functionality to extend coverage to the die-to-die link, after package assembly can help achieve a high level of test coverage, including:

  1. Far side (die-to-die) BIST loopback functionality
  2. Die-to-die link BIST
  3. 2D eye margining to analyze marginalities
  4. Per lane test and repair functionality

UCIe test and repair

Advanced packages enable high density routing with fine pitch micro-bumping and routing on silicon or RDL interposers. During the assembly process, some micro-bump connections may not be well formed and may break down. UCIe offers the ability to test and repair these connections after assembly in a way that recovers the potential yield loss.

UCIe test and repair is executed during production test and at link initialization. In the test phase each individual link is checked for defects at slow speed. Defective links are repaired by re-routing data to spare links that are pre-defined by the UCIe standard.

UCIe configurations targeting advanced packages include up to 8 spare pins per direction (TX and RX) to enable repair of all the functional links:

  1. Four spare pins for data pin repair, 2 pins for each group of 32 data pins
  2. One spare pin for clock and clocks and track pin repair
  3. Three spare pin, each for valid pin, sideband data pin and clock data pin repair

The test and repair execution occurs when there is no valid traffic on the die-to-die link. After repair is complete and the link is initialized, it is assumed to be good and that traffic can be passed without problems. The resulting PHY configuration, called the PHY repair signature, is stored in internal registers in both ends of the link.

Degradation, due to aging or other, of microbump characteristics during operation may impact link performance. This will be detected at the protocol level by an increase in bit-error rate (BER) or, worst, by data being lost. In that case, the link is expected to be interrupted and a new test and repair step carried out.

However, some applications have stringent requirements in terms of continuity of traffic on the die-to-die link – they cannot tolerate interruption of traffic during operation. For these cases, a testability solution adds Signal Integrity Monitors (SIM) to each UCIe receiver pin.

Fig. 1: Link repair using built in spare links.

Signal integrity monitors

SIM monitors are small blocks embedded on the receiver. They are constantly sensing the signal at the receiver pin, during normal operation, to identify variations in the signal characteristics which can impact the link performance or indicate that the link is no longer healthy and may break in the near future.

The data gathered by the individual sensors is collected in a Monitoring, Test and Repair (MTR) controller, outside of the interface, for further processing. Aggregating the data from multiple UCIe links can provide instant insights into the health of the multi-die system and enable predictive maintenance of links.

If a specific link is predicted to be at risk of malfunction through this procedure, it can be disabled and data re-routed to one of the spare links, leveraging the UCIe PHY repair mechanism, even without traffic interruption.

Fig. 2: Health monitoring solution for UCIe links.

Accelerating wakeup time

While the traffic pattern for most die-to-die interface use cases, for example in server splitting or scaling, is assumed to be stable during operation, in some use cases traffic may exhibit a bursty behavior. In such cases, it is desirable to bring the interface into a low power mode to save power while there is no traffic. Link re-initialization can be accelerated by avoiding the test and repair process and relying on the UCIe PHY repair signature that was created during the previous PHY initialization.

This concept can be further extended to situations where the die is completely powered down. In these cases, the PHY repair signature is retrieved from the PHY and stored on an on-die permanent memory (eFuse or flash). The memory could possibly store multiple signatures, covering different use cases or conditions, enabling additional user flexibility.

Accelerating die testing with UCIe

Test time is an expensive commodity. It is possible to accelerate test time by partitioning the test strategy hierarchically to run tests of different dies in parallel. The hierarchy can be extended across multiple dies in a multi-die system by connecting the test infrastructure of the two dies hierarchically. Such an approach allows access to all the dies in the multi-die system from a single JTAG (or similar) test interface in the main die.

Often, the limitation of test time is the time to load or read the test vectors into the dies. JTAG interfaces can become a speed bottleneck. To overcome this limitation, designers can use existing high-speed interfaces such as PCI Express (PCIe) or USB, etc. as interfaces to the test equipment. Test vectors and commands are packetized for that interface and depacketized on the die during the production test phase.

Many dies do not have a high-speed interface, however, the UCIe die-to-die interface can be used, during test, for transporting large test vectors and commands between dies at high speed. The UCI die-to-die interface extends the high-speed DFT access across the complete multi-die system without increasing the number of pins, which is particularly important for IO and area-limited dies.

Summary

Besides the UCIe die-to-die interface, the common denominator that enables all these test and reliability enhancement features is a test, repair and monitoring fabric that can connect all the internal blocks. The test, repair and monitoring fabric spans the various dies in the multi-die system, providing a structured hierarchical infrastructure that achieves the following important functions:

  1. Manages the testing of the individual dies in the multi-die system
  2. Optimizes test scheduling to reduce test time
  3. Supports the high-speed test access across the dies, via the UCIe interface
  4. Collects information from the health monitoring interfaces embedded in the UCIe interface and enables further system-level processing
  5. Manages the storage of the PHY repair signature in a non-volatile memory
  6. And more

Synopsys provides a comprehensive and scalable multi-die system solution, including EDA and IP, for fast heterogeneous integration. For a secure and reliable die-to-die connectivity, Synopsys offers a complete UCIe Controller, PHY and Verification IP solution. As part of the Synopsys SLM & Test Family, a complete UCIe Monitoring, Test and Repair (MTR) solution is available along with STAR Hierarchical System (SHS) solution. The MTR solution includes signal integrity monitor for measuring signal quality on the UCIe lanes, BIST for self-test, and repair logic for redundant lane allocation while the SHS solution serves as the connectivity fabric supporting industry standard IEEE 1687, IEEE 1149.1, and IEEE 1838 interfaces. This complete solution enables efficient and cost-effective health monitoring of UCIe during all phases of the silicon lifecycle, which is critical for reliable operation of multi-die systems.

Resources:



Leave a Reply


(Note: This name will be displayed publicly)