Enabling Seamless Monitoring, Test, And Repair In Multi-Die Designs

Fixing HBM and UCIe interconnects in the field.

popularity

By Yervant Zorian and Sandeep Kumar Goel

Anyone who follows the semiconductor industry knows that the accelerating performance, scale and energy efficiency demands of the AI revolution are outpacing the advances achievable by simply pushing the chip performance of monolithic, single-die designs. Multi-die design using 2.5D and 3D technologies has emerged as a necessity to keep the pace of innovation. For all their benefits, these projects present new challenges. This post considers monitoring, test, and repair of multi-die designs.

Any device in which multiple independent dies are contained within a single package is a multi-die design. This has several advantages over monolithic system on chip (SoC) designs. Product variants for specific markets can be assembled quickly using existing dies (often called chiplets). Designers can combine dies from very different technologies, save cost by using the latest node only for the most speed-critical chiplets, and save redesign by scaling only digital chiplets while reusing existing analog and mixed-signal designs.

Multi-die designs shorten time to market (TTM) and reduce project risk by reusing proven dies whenever possible. Die-to-die connectivity within a single package also provides better throughput than chip-to-chip connectivity over a printed circuit board (PCB). Performance may also be better than a large monolithic SoC since communicating horizontally or vertically within a multi-die package may be faster. Much more functionality is possible in the same footprint and system power is reduced. With all these advantages, why isn’t everyone moving to multi-die design immediately?

Unfamiliarity with the 2.5 or 3D design process is one reason, and the increased thermal density requires advanced thermal management. But perhaps one of the biggest challenges of multi-die design is the process of monitoring, test, and repair.

Typically, test refers to manufacturing test, broken into stages as shown in the figure below. Traditionally, wafers are probed so that individual dies can be tested, and the known good dies are selected for packaging. The full chip is tested before it is shipped to the integrator, who puts it on a PCB and runs system-level test with software and hardware together.

As this diagram shows, multi-die design adds a stage to the test process. After all the individual dies are tested, they are assembled into 2.5D or 3D stacks. The stacks are then tested so that only known good stacks are packaged. Final test and system-level test work the same as for traditional 2D chips. When test failures occur, repair may be possible. Many chiplets and stacks have some built-in redundancy, such as inter-die interconnects with spare lanes, that can be leveraged for repair purposes. Of course, repair can only be done if the test process can detect and diagnose the failure.

This is an issue since many of the chiplets in a multi-die stack are not directly accessible for test and are “hidden” behind the chiplets that have direct connections to the package inputs and outputs. Test content increases non-linearly as the number of chiplets in a stack grows.  One of the possible access mechanisms is a standard—IEEE 1838—that defines a test access architecture for 2.5D and 3D designs. It supports slow speed test of inter-die interconnects across all test stages. For high-speed interconnects using die-to-die PHYs, the test access is provided through the PHYs using specialized IPs, such as the Synopsys UCIe Monitoring, Test & Repair (MTR) IP.

Another wrinkle of multi-die design is the need for test and, when possible, repair, in the field. Many of these multi-die designs are used in artificial intelligence or safety-critical applications where chiplet or interconnect failures could be catastrophic. Thus, an in-field test and repair solution is needed. In addition, monitoring silicon health to detect signs of silicon aging or other indications of potential upcoming failure in chiplets and interconnects enable proactive maintenance to avert disaster. Various types of sensors and monitors can be embedded in the chiplets in accordance with effective silicon lifecycle management (SLM) principles.

The overall scope of multi-die monitoring, test, and repair is quite broad:

  • Chiplets (wafer sort and pre-bond stages)
    • Test access: probing dedicated/shared test port
    • Quality/yield: SLM for known good die
  • Interconnects (mid/post-bond stages)
    • Die-to-die algorithmic test and incremental repair
    • In-field signal integrity monitoring, self-test, and repair
  • Multi-die design (post-bond stage)
    • System-level test: stack/package test management
    • Reliability: in-field SLM for degradation monitoring
    • Availability: in-field reconfiguration and repair
    • Serviceability: remote debug and diagnosis

Addressing this scope requires the underlying advanced process technology for single die implementation and multi-die advanced packaging technologies provided by TSMC, and a suite of multi-die IPs and tools available from Synopsys. The Synopsys SLM IPs for multi-die interconnects, namely SLM EXTRAM and UCIe MTR, support the two most common inter-die connection mechanisms, High-Bandwidth Memory (HBM) and Universal Chiplet Interconnect Express (UCIe), correspondingly. Both standards feature redundant lanes for reconfiguration and repair, including in the field. In addition, Synopsys SMS IP provides test and repair for embedded memories in chiplets, Synopsys CDM IP provides memory performance monitoring capability, and Synopsys SLM High-Speed Access and Test (HSAT) IP enables adaptive high bandwidth testing over functional interfaces, such as PCI Express (PCIe). This speeds up test time, lowers pin count, and reduces test hardware cost.

To demonstrate the combined solution in actual silicon, Synopsys and TSMC developed a demo multi-die vehicle as shown in the diagram below. It contained two identical dies, rotated with respect to each other for connection via UCIe. The die was fabricated in TSMC’s N3P (3nm Performance-enhanced) FinFET process with 15 metal layers plus aluminum pads, and the two dies were packaged using TSMC CoWoS-S packaging. Synopsys IP included UCIe PHY, UCIe MTR controller, SLM Signal Integrity Monitor (SIM), SMS, CDM, HSAT, ARC CPU, and an adapter from the JTAG test interface to the internal AXI bus (JTAG2AXI).

The grayed-out blocks in the “hidden” Die 1 were not active since the only external access to the multi-die design was through the JTAG interface on Die 0. The goal of the demonstration vehicle was to show that the monitoring, test and repair architecture could fully access Die 1 from Die 0. The development team put this multi-die design through its paces in the lab, focusing on three main use cases:

  • UCIe SIM monitoring
    • Bring up UCIe link at high rate
    • Enable traffic generators
    • Enable SIM to start monitoring data for clock timing
    • Read out SIM results via MTR and assess signal integrity
  • Drive logic test patterns into Die 1 and collect results
    • Bring up UCIe link at low rate (4 Gbps)
    • Load CPU code into memory via SMS
    • Scan with HSAT and check for expected results
    • Use external JTAG master (tester) to read the status register
  • Drive memory SMS patterns into Die 1 and collect results
    • Similar process to previous use case

This project met its goal for successful silicon bring-up of a multi-die test demonstrating use of the high-bandwidth UCIe die-to-die communication, in addition to IEEE 1838 for slow-speed die-to-die access. The scope covered interconnect test and repair, logic test and diagnosis, memory built-in self-test (BIST) and repair, and monitoring die-to-die signal integrity.

TSMC and Synopsys are the industry’s leaders in multi-die design enablement, offering cutting-edge tools, IPs, and technologies. This demonstration vehicle showed that the combination of the two companies provides a proven, unrivaled solution. Much more information on this project and its results was presented in a recent webinar, with a recorded version available at https://www.synopsys.com/webinars/multi-die-test-monitoring-flows.html.

Sandeep Kumar Goel is an academician and senior director at TSMC.



Leave a Reply


(Note: This name will be displayed publicly)