New technologies for AI designs present a significant challenge to the design-for-test process.
The traditional processors designed for general-purpose applications struggle to meet the computing demands and power budgets of artificial intelligence (AI) or machine leaning (ML) applications. Several semiconductor design companies are now developing dedicated AI/ML accelerators that are optimized for specific workloads such that they deliver much higher processing capabilities with much lower power consumption (performance per watt). These accelerator designs are typically massive, containing billions of gates, highly parallel architectures with thousands of replicated Processing-Units (PU), and large amounts of distributed on-chip memory connected over highly optimized network for high throughput. Each of these PUs can further contain multiple cores and tightly integrated local memories. An example of such an accelerator from a leading AI chip company may have the features as one shown in figure 1. To achieve design scalability and a faster development cycle, AI SoC designers are using abutted design method where multiple instances of identical blocks (PUs or cores) are connected using their physical layout through standardized interface without any top-level routing.
Fig. 1: AI chip architecture with Processing-Unit (PU) and local memory (LM).
The local memories within these processing units (PU) allow for power efficient low latency operations on the local data. However, the conventional DRAM technology is unable to provide the required bandwidth and capacity for parallel accessing and processing of large amounts of external data. As a result, many AI SoCs are deploying novel memory technologies such as high bandwidth memory (HBM) or hybrid memory cube (HBC) based on advanced 2.5D or 3D packaging which provides much higher data throughput and capacity. These AI accelerators are eventually integrated into AI systems that process large workloads in a distributed fashion where accelerators communicate using high bandwidth interfaces such as PCIe present on the chip. More recently, the designers have started leveraging these interfaces for high bandwidth test. Several applications of the AI accelerators, such as automotive, demand high levels of reliability, security, and functional safety throughout the lifecycle of the silicon. Silicon Lifecycle Management (SLM) technologies address these challenges by adding features into the silicon that gather and monitor SoC data through all phases from design, manufacturing, test, and in-field deployment.
The size, complexity, and the adoption of new technologies for AI designs described earlier present a significant challenge to their design-for-test (DFT) process. In general, this challenge can be addressed by focusing on two main functions: adopting an effective test methodology and defining an efficient DFT architecture. The DFT architecture is defined based on the test methodology to achieve test goals which include fast DFT sign-off, minimized test time, high test coverage and efficient diagnosis. The following sections discuss the details and factors to be considered for these two functions.
Hierarchical test methodology is ideal for AI designs due to their massive design size and replicated architecture. Hierarchical test has two major advantages over flat test methodology. First, it employs a divide-and-conquer approach by dividing the design into smaller hierarchical partitions for quick DFT sign-off which includes DFT insertion, test-mode setup, pattern generation, and verification. Deploying a flat test methodology at the top level for large AI designs is not practical. Second, since AI designs consist of replicated blocks, the DFT sign-off is performed hierarchically only once for the block at a level and re-used for other instances. The replication and integration of the signed-off block with their parent’s DFT sign-off completes the DFT implementation at parent-level. The same approach of hierarchical block sign-off can be followed to complete chip level DFT if the AI design contains multiple levels of hierarchy as shown in figure 2.
Fig. 2: Hierarchical test methodology enables hierarchical DFT sign-off and replication for accelerated DFT sign-off of the design.
Hierarchical test enables faster DFT sign-off and maximizes reuse; however, the DFT architecture of the design still needs to be established which will dictate DFT logic implementation details. Due to varying design requirements, different AI chips have different requirements and constraints for DFT. Essentially, the DFT architecture for an AI chip can be broadly formulated by determining the following components:
When following the hierarchical methodology for a design with multiple levels, simply designating either the lowest-level block or higher-level block as the first hierarchical sign-off block may not be ideal. The designers need to weigh the DFT implications on design (area, power, timing etc.) and test requirements together to determine the sign-off block levels. Looking at the example AI design in figure 2, DFT sign-off at the lowest hierarchical level (PUs) could result in excessive area overhead, routing congestion and unnecessary test modes adding to test times without significant benefits. On the other hand, assigning chip-level for DFT sign-off would cause longer pattern generation time, longer test time, larger memory requirements, routing congestion, power etc. It is essential to determine a middle ground where the impact of DFT is minimized and test goals are achieved. In figure 2, Parent Block could be the first hierarchical sign-off block instead of the PU. Within the sign-off block, the designers would need to determine the DFT configurations for logic test and memory test, again based on multiple factors such as test time, power, physical design and so on. Figure 3 shows some examples of such design dependent DFT configurations for the Parent Block (Codec is scan test compression and SMS is memory BIST test controller).
Fig. 3: Example DFT configurations for hierarchical sign-off block. (i) one codec and one SMS for the entire block. (ii) one codec for each PU but one SMS for testing all memories in the block. (iii) one codec for each PU, one SMS for block memory and one SMS for testing all memories in PUs.
With simpler designs, the test requirements were less stringent and DFT logic had minimal impact on the design. As design became more complex like AI designs, the DFT logic needed to maintain test quality and cost became significant enough to impact several factors such as test time, test power, physical design, DFT planning time, and scalability. This required designers to develop innovative solutions to meet design and test goals. In this context, the following sections look at the requirements and progression of test compression, test data delivery, and test configuration mechanisms based on IEEE 1687 and IEEE 1500.
When it comes to scan test data delivery, the static pin assignment has been the traditional default approach as shown in figure 4. It involves connecting input and output pins of codecs from blocks to top level scan pins. As AI designs contain replicated sign-off blocks, the same input test data can be broadcast to test multiple blocks in parallel based on power constraints. This reduces both required test pin and test time. In case of limited chip level scan pins, top level multiplexing would also be implemented for codec outputs. Such architecture, however, lacks flexibility and suffers from several drawbacks; the major ones are as follows:
To mitigate these issues, test-bus based data delivery mechanisms have been developed. Some of the initial test-bus based solutions use existing codecs and connect them to new test-bus to deliver the data to the Pus. A local controller between codec and test-bus manages the data interfacing between the bus and codecs. Instead of complex codec input/output signal routing, the same test-bus goes through each block which provides a standardized interface at their boundaries. This solution eases the physical design of the chip immensely and provides a scalable, easy to implement test-data delivery for abutted designs by avoiding block-count dependent custom pipelining and signal routing. The local controller avoids the need for top-level multiplexing by providing flexibility through pattern generation to test or bypass blocks for maintaining test time and test power.
While this approach addresses most of the issues mentioned earlier, it still requires determining correct input and output pins for codec for reduced test volume and is unable to leverage the test-bus capabilities to the maximum to achieve another level of test time reduction. This is mainly because the codec and test-bus are developed independently of each other. Newer test-bus based solutions consist of sequential-compression codec, test-fabric, and fabric-socket, which are developed and optimized together to provide lower test data volume and higher test time saving. They achieve this by performing higher degree of block test overlap which provides efficient test bandwidth distribution. Figure 4 shows the progression of scan test data delivery mechanisms.
Fig. 4: Progression of scan test data delivery mechanism from static pin assignment to co-optimized test-fabric and sequential compression technology. SEQ stands for sequential compression codec and FS is fabric-socket.
The two key benefits of test-fabric with sequential compression are:
After determining the hierarchical sign-off blocks and corresponding logic test and memory test implementation details, another important component of the DFT architecture is formulating test configuration or test-setup mechanism strategy. This mechanism is usually a serial network based on IEEE 1687 and IEEE 1500 standards and is used to configure the test logic which includes programming of test data registers for codec, clock-controllers, memory test controller, in-system test controllers, etc. Like logic test and memory test, test-setup architecture also needs to address the challenges mentioned earlier, however, the key requirement in this case is an efficient architecture for reducing test time, easier physical design, and abutted design support. Test power isn’t generally an issue in this case as only a handful of design flops toggle during the process, unlike logic scan test or memory test. Due to the inherent serial nature of these networks, the test-setup operation is much slower compared to scan test which increases the test time significantly and can easily become the bottleneck for large designs. An AI design can require the same configuration for hundreds of replicated blocks and using a serial method to send the same data to all the blocks would not be ideal. Data broadcasting to identical cores would significantly reduce the test-setup time in this scenario. However, if the AI chip is using the abutted design methodology, broadcasting of signal would cause logic implementation and physical design issues. A serial bus can go through one block to the next without needing top level routing. While both IEEE 1687 and IEEE 1500 support serial and broadcasting networks, IEEE 1687 lends itself more easily to serial implementation and IEEE 1500 network is more convenient for data broadcasting.
An example of an AI chip is shown in figure 5 which assumes abutted design and contains a central block with test-access-port (TAP) where test-setup data is distributed to a number replicated identical PUs. Such a design would benefit from leveraging the strengths of both 1500 and 1687 to reduce test-setup time and support abutted design style simultaneously. The main controller would broadcast the test-setup data to five PU columns over 1500 network, but it would shift serially within a column over 1687 ring network due to daisy chaining of abutted PUs. Sub-controllers generate the control signals internally in each PU, instead of broadcasting from Main-Controller, which is friendly for abutted PUs. In this example, there would be a test-setup time savings of potentially 4x. Many industry designs contain abutted blocks within abutted blocks and the designers can follow a similar approach hierarchically.
Fig. 5: Broadcasting test setup to PU-columns over 1500 network and daisy-chaining PUs within a column over 1687 network to support abutted PU connections.
AI chips are large and complex, which makes it challenging to come up with a DFT and test strategy that achieves desired test goals with fast turn-around time with minimal design impact. Hierarchical test approach is a must for such design but deploying it with traditional DFT methods would result in sub-par test results for the AI chips. The designers must carefully plan the entire DFT architecture for logic test, memory test, and test setup while considering multiple factors such as test time, test power, physical design impact, abutted design method, and scalability. Current generation of test-bus based mechanisms are being adopted to replace the static-multiplexing for efficient data delivery, especially for large designs. However, they are limited in how much test-bus and compression codec can leverage each other. Newer architectures consisting of co-optimized sequential-compression codec, test-fabric, and fabric-socket are shown to deliver higher levels of test volume reduction and test time reduction.
Part II will focus on Silicon Life Cycle Management (SLM) and testing of AI chip package features.
Leave a Reply