Tackling growing chip size, rising test cost, and much more complexity.
Large digital integrated circuits are becoming harder to test in a time- and cost-efficient manner.
AI chips, in particular, have tiled architectures that are putting pressure on older testing strategies due to the volume of test vectors required. In some cases, these chips are so large that they exceed reticle size, requiring them to be stitched together. New testing efficiencies are needed to accommodate these devices.
“We have a lot of customers doing large chips, or with interest in large chips,” said Robert Ruiz, director of product marketing, digital design group at Synopsys. “This creates a couple of challenges. One is that ATPG [automatic test-pattern generation] becomes magnified. The second is runtime on the tester.”
Tile-based chips create their own challenges. They require different approaches to ensure that only good products are shipped. But they also provide an opportunity for reducing the amount of test data required. New test architectures allow test vectors to be created more efficiently and to be scanned into chips at much higher speeds.
“One of our key challenges is helping our customers to reduce final test costs without sacrificing quality,” said Prasad Bachiraju, director of sales and customer solutions for software at Onto Innovation.
Building on existing infrastructure
Internal chip testing was forever transformed by the addition of JTAG (IEEE 1149.1) a few decades ago. Originally intended to provide connectivity testing between components on a board, the INTEST mode also allowed for internal chip testing, as well. Many chips leveraged that, creating multiple internal scan chains that could be controlled via the JTAG test access port, or TAP. Data is serially loaded into those scan chains. That data then can be transferred into circuits to generate a response, with the outputs being captured and scanned out serially.
The original JTAG function ran both test data and control information through that port. Frequency was modest, because there was no desire to have test circuitry compete with application circuitry for routing resources. This limited the test data bandwidth. In addition, there was the nagging fact that JTAG wasn’t really intended at the outset to be used for internal chip testing, INTEST notwithstanding.
Two changes occurred that boosted what could be done with internal scan-chain testing. First was the addition of IEEE 1687, commonly referred to as internal JTAG, or IJTAG. It provides a more comprehensive set of control capabilities for setting up and managing the tests of various internal blocks.
IJTAG enabled the use of a separate set of pins for loading data more quickly. No one wants to dedicate pins to testing, so by multiplexing functional pins for testing, one can choose a number of pins to use for loading data without commanding any physical footprint. Care must be taken, however, to ensure this muxing doesn’t affect the performance of any high-speed or analog signals.
Next came the “Design-for-Test,” or DFT, era. Testing no longer was a side process. Tests and infrastructure were synthesized, placed, and routed alongside the application logic.
DFT involves test compression, which works by compressing the test inputs as stored on a tester. As the compressed data is scanned into the device, it’s decompressed and sent to the various scan chains for execution. The outputs then are hashed together to create a signature, and that signature can be verified as correct or not. In addition, many more pins can be used to scan the data in at rates faster than the JTAG TAP would allow. The higher bandwidth plus the reduced data footprint reduces the time it takes to execute tests.
One other new standard, IEEE 1500, was created to deal with the pervasive use of IP in designs. By standardizing a test wrapper for IP, it allowed IP providers to include both the circuits and the test infrastructure in their deliverables, decoupling them from the rest of the circuit and making them portable and reusable.
That created a landscape consisting of three standards – the base JTAG, along with IJTAG, and IEEE 1500 – plus proprietary implementations of test compression. The control path is used by all vendors, but the data path normally works only for a single vendor.
“There was an effort by some people to decouple the compression hardware from the pattern generation to use compression hardware from vendor A and pattern generation from vendor B,” said Geir Eide, product marketing director for Tessent at Mentor, a Siemens Business. “So there is a standard for how to describe compression circuitry, but nobody’s using it.”
New challenges: AI and massive SoCs
Developments in the scale of ICs being designed, as well as some architectural changes, have conspired to put more pressure on the test system. Of concern are the amount of time it takes to generate test patterns, the amount of memory required to store those patterns, and the amount of time it takes to execute tests. All of that increases the overall cost.
“We certainly are seeing the cost of tests increase,” said Alan Liao, director of product marketing at FormFactor. “When you get down to 7nm and 5nm, the pitch of the bumps you need to test is shrinking, and the power or the features or the functions that you need to test are increasing because people are packing more transistors inside of each processor. And because this technology is so new, you probably need to over-test it. So you put a lot of tests into the first wafers. That drives up the cost of the cost of tests. And the cost of the probe cards is going up, too.”
The sheer size of these new chips, coupled with the reduction in feature size, creates a daunting problem. “Some SoC designs had multiple cores, meaning four, eight cores,” said Ruiz. “But now we’re seeing in the range of tens to hundreds.” Each of these blocks is the same, but using the traditional approach, each must be loaded with its test pattern – even if it’s the same as all of the other test patterns. This can cause the size of the test pattern footprint to explode.
The second challenge is the way some of these arrays of computing blocks are arranged. IEEE 1500 wraps each block, so there’s a hierarchy that puts the full-chip at the top, with each block hooking into the full-chip infrastructure below it. That works if each block is independent of the other blocks, with functional signals exiting one block and moving to another through full-chip resources.
But these blocks are now being ”abutted.” That means there are no chip-level resources between them. Signals flow out of one and directly into the next one, leaving no room for wrappers between blocks. If testing is done using traditional scan chains, those chains would have to traverse the entire row of blocks.
Using existing data-scanning approaches would mean scanning identical versions of the compute-block data through all of the blocks before test could be executed. While testing frequencies have increased, they tend to top out around the 100 MHz level. As compared to many of the higher-speed I/Os on advanced chips, this becomes a bottleneck when loading the data.
Re-using test data
One solution is to add a side port to allow each block to be directly loaded and unloaded from a central scan bus. The inputs no longer have to traverse a long chain to get into place. This is a departure from past architectures, which were all serial scan-based. Blocks now can be loaded in parallel.
Fig. 1: The top shows traditionally wrapped blocks in a scan chain. The middle shows several blocks abutted, with no space between. The bottom shows a bus-like structure with a “side port” into the blocks. Source: Bryon Moyer/Semiconductor Engineering
There’s still some inefficiency here, because the automatic test-pattern generation (ATPG) algorithms would need to create and store the test data for each identical block separately. A traditional approach to the bus would have one block load its data from the side, followed by the next one, and so on until all of them were loaded. This adds both to test generation time and test execution time.
Having multiple identical blocks helps with test generation. “Since these [blocks] are logically the same, you can run the ATPG engine on one, replicate it, and then re-use it at the sub-system and the chip level,” said Ruiz.
Data delivery is improved by adding a broadcast capability to the bus. That allows a single test data pattern to be loaded into multiple blocks.
Fig. 2: The top shows five sets of identical data that will be scanned in, with each set of data settling into its respective core once the scanning is complete. The bottom shows a single set of data broadcast to all blocks. Source: Bryon Moyer/Semiconductor Engineering
Even though there may be many identical blocks, however, not all blocks will be identical. Simply broadcasting test inputs to all blocks would present inappropriate data to blocks for which the data was not intended. This is where IJTAG helps: it allows blocks to be selected for input. “The bus goes to all the blocks,” said Satish Ravichandran, product engineering architect in the digital and signoff group at Cadence. “And then you have a mechanism to select which block you want.”
Prior to broadcasting data, IJTAG would identify which blocks should listen for data. They’ll receive the broadcast data. Any other blocks will ignore it. “We test the identical blocks at one timeframe, then we go to the non-identical blocks,” added Ravishandran.
Even with large numbers of identical blocks, it still may not be possible to load them all and test them at the same time. “With multiple blocks selected, it’s possible that you could burn your chip because of power,” he said.
So, it may be necessary to break the set of identical blocks into groups that can be tested separately, even if they use the exact same test data. “Some customers may test only these cores at one time, and these other cores at a different time, because they have a power issue during test,” said Ruiz.
If power permits, it may be useful to allow patterns generated for one block to be loaded into another unrelated block, as well. “In ATPG, we have this concept of ‘testing by accident,’” said Ravishandran. Even though tests haven’t been structured for that block, they may “accidentally” provide coverage that keeps the vector count down.
Fig. 3: The top shows test data broadcast to all blocks, but only the first three have been selected via IJTAG. The bottom shows that same packet then being sent to the two remaining cores, as well as the other block. While the test data may have been generated specifically for the core, the other block may also get some extra coverage “by accident.” Source: Bryon Moyer/Semiconductor Engineering
Decoupling test-data bus and scan speeds
Another bandwidth improvement could come from increasing the test clocking frequency. But doing so would risk having internal block scan chains become critical timing paths, further complicating timing closure. So the internal test frequency remains around 100 MHz.
Instead, the frequency of the scan bus is being decoupled from the internal test scanning frequency. “One of the important elements for fully utilizing the bandwidth is a network or fabric that will interface between the fast data coming in and the slower scan chains at the block level,” said Ruiz. That puts a rate-matching circuit at the interface to each block, taking high-speed serial data and parallelizing it within the block for actual test execution.
Fig 4: A simplistic view of a bus whose frequency is decoupled from the testing frequency. An added block provides both rate matching as well as other control (like block selection). Source: Bryon Moyer/Semiconductor Engineering
While the preceding figure provides a stylized, generalized view, the bus may be continuous or it may be segmented so that values scan along the bus. Mentor’s Serial Scan Network (SSN) is structured this way, as an example. This arrangement will result in data being scanned into and out of the blocks at the same time.
Fig. 5: Mentor’s Tessent SSN structure. The blue bus is the data plane, carrying the data in (on the left of each block) and out (on the right of each block) from block to block. The green scan bus is IJTAG. Source: Mentor, a Siemens Business
“For every bus cycle where a block picks up one or more bits from the bus, the same number of bits are always put back onto the bus,” said Eide. “For the first several cycles (corresponding to the first scan load), everything that’s unloaded from the scan chains and put back on the bus is garbage, but that’s really the same as in traditional scan.”
The details of the Tessent control block are shown below. IJTAG control registers are on the bottom; access to the test scan chain is on top, along with various control signals. The bus goes from left to right. The bus register must be able to buffer enough bits to account for the difference between the bus clock and the scan clock. For instance, if the scan clock is 100 MHz and the bus clock is 400 MHz, then the bus register must be able to buffer four bits.
Fig. 6: An example control block. The clouds represent compression and decompression. IJTAG is at the bottom; the scan chain is at the top. Source: Mentor, a Siemens business
The size of the input bus also may be further decoupled from the width of the test-data packets in each block, which will not necessarily be the same for all blocks. That means that data positioning for each block on the bus will rotate with each clock cycle. Test pattern generation will need to perform the necessary bookkeeping to ensure that the various bits go where intended, because with each clock cycle that assignment will change.
Fig. 7: In this example from Mentor, the packet size is nine bits – four for Core B and 5 for Core A. The bus is only eight bits wide, so the packets wrap from one cycle to the next, causing them to rotate with each cycle. Source: Mentor, a Siemens business
Taking in more test data more quickly
Even with the added efficiencies, there’s still a growing demand for test input bandwidth. “The CPU needs about 2.5 Gb/s,” said Ruiz. “The bigger GPUs are operating at about 50 Gb/s. To maintain costs, we see the need for several hundred Gb/s.”
This now provides an opportunity to use some of the existing high-speed ports – parallel or serial – as the input pins for the test data to satisfy the growing bandwidth requirements. “The larger AI and GPU designs need more patterns, more data,” said Ruiz. “More data generally means more cost at the tester, but that can be addressed by using high-speed functional ports.”
In the case of a high-speed serial port, for instance, normal functional mode would implement both the physical layer (PHY) and higher protocol layers. “One way you can leverage the high-speed I/O pins is to reuse that whole stack,” said Eide.
The ability to leverage high-speed pins has been standardized as a high-speed test access port, or HSTAP, via IEEE 1149.10, and it doesn’t use the whole protocol. “The standard that we are working with, 1149.10, just reuses the physical layer,” said Eide. “That makes it more flexible across different types of high-speed I/Os.”
But tapping into the high-speed I/O circuitry must be done in a way that doesn’t impact the functional speeds. “The idea here is to be non-intrusive and not impact the high-speed I/O,” noted Eide.
Fig. 8: On the left, the entire high-speed I/O stack is shared between normal functionality and testing. On the right, only the PHY is shared. Source: Bryon Moyer/Semiconductor Engineering
One further challenge within blocks has been the distribution of compression logic. Test compression originally involved one set of compression circuits for an entire chip. That created a placement and routing challenge. Moving the compression into each block solved that concern. Even though the compression logic was replicated within each block, rather than once for the entire chip, the area impact was improved because routing became easier.
But even within a block, that routing can be an issue. The place-and-route algorithms want to put the test circuitry right in the middle, fanning the signals out to the different parts of the block. “There is a way to spread this decompressor and compressor logic such that it becomes more physically aware,” said Ravishandran. “That’s 2D physical compression.” Distributing that circuitry throughout the block – a step that has required some changes to the compression math – relieves the congestion.
Fig. 9: The left shows a block with centralized compression circuitry. The right shows compression distributed throughout the block. Congestion is greatly relieved, with no impact on wire length. Source: Cadence
These changes help both with ATPG and execution time. Each block can have its pattern generated independently. Those patterns are then migrated to the chip level, with identical blocks leveraging the single pattern generated for it. Data is stored in the tester more efficiently, and the high-speed capabilities allow tests to be set up more quickly.
There is, of course, always a tradeoff between coverage and test time. Each second is expensive on a chip tester. System test, however, tends to take longer, and so a few extra vectors at that point may not be so expensive. “Test engineers are saying, ‘I’ve only got so much chip test budget,’ and so they chop off their patterns,” said Ruiz. “Maybe these extra patterns can be run at system test to help boost quality.”
Tests also can be supplemented with other tests earlier in the flow. The key here is to be able to have continuity of results, and to be able to leverage the data at multiple steps. “We can link our test results not just on the production, but in an early phase,” said FormFactor’s Liao. “When they start to characterize an IC at an early phase, we can link to those test results and provide early feedback so customers don’t run into multiple iterations before they get the product out. If you can get the product ready earlier using less iterations, that helps improve the cost.”
The need for diagnostics
One final change to the architectures reflects the increasing sensitivity of designs done on advanced nodes. Traditionally, the results of chip scan tests are combined into a signature with a pass/fail result. “Designs using mature and established nodes, which tend to be lower-cost parts, use pass/fail tests,” said Ruiz. But for chips on aggressive nodes, engineers are looking for more diagnostic information so they can manage yields better. “With emerging nodes, there’s an interest in understanding why a part failed. Is there an opportunity to tweak the manufacturing process or the design?”
The need for diagnostics makes it possible to use pass/fail for routine devices, with offline tools handling further diagnostics. But the ability to pull higher-granularity data also sets up the possibility for adaptive tests. For instance, a device with 1,000 computing blocks that has some failing blocks might be sold instead as a 500-block partial, keeping the device in the revenue stream instead of scrapping it. “During the manufacturing process, if one of these blocks is defective, then that one can be turned off for manufacturing purposes, and the chipmaker can sell it as a as a lower-performance part,” said Ruiz.
So the newer architectures also provide this diagnostic data in addition to pass/fail results. “This lends itself to an architecture where the stimulus is [simultaneously] driven to identical blocks, but with a unique output for each block,” said Ruiz. Exactly which diagnostics are made available and specifically how they’re provided will depend both on the brand of DFT as well as the application itself.
Fig. 10: In addition to the traditional signature-based pass/fail signals, more granular diagnostics are being made available on advanced nodes both for yield management and for adaptive testing. Source: Bryon Moyer/Semiconductor Engineering
Such diagnostics also start to blur the lines between testing – especially when running tests in a deployed system – and monitoring. The main difference is that test tends to occur at moments when the chip is paused from its main function, like at startup. Monitoring can deliver data while the chip is executing.
“During boot and when the system is initializing, you have the opportunity to run BiST logic,” said Gal Carmel, general manager for automotive at proteanTecs. “But if you are in operational mode, you need something non-intrusive, with high coverage, checking not only core logic but also interconnect.”
All of these enhancements are coming together to increase the efficiency (and the cost-efficiency) of testing chips as they become ever larger and more complex on ever more challenging silicon nodes. Enhancements are just in time for the wave of enormous chips being readied for release.
Dear Mr. Moyer,
I read the article with great interest. A couple of items should be noted. INTEST was doomed from the start, but that didn’t prevent a lot of JTAG accessible BIST and BISRs from being implemented. Anyone who has not studied IEEE 1149.1-2013 may not be aware of JTAG at all by today’s standards. The standard is 4 times as big and includes internal TDRs designed compliant to IEEE 1500 (it can describe things like a Soc with an HBM memory for instance), hierarchical register descriptions, segmented registers (including the boundary register), support for configuration of weird non-LVCMOS type pins, support for TDRs which cross IEEE 1801 type domains, PDL – Procedural Description Language and more. What brings its value is its very prescriptive requirements such that if you’re placing an order for IP or for an SoC you can specify it as IEEE 1149.1-2013 compliant and it means something. You’ll know what you are getting for test-ability. No other standard addresses “I/O” and a description language for registers connected to an I/O pad or a through silicon via (TSV).
I enjoyed reading about IEEE 1149.10 as I was chairperson and editor. It was designed to address increased bandwidth needed for chip test, create “virtual” scan in/outs to solve SI/SO pin limitations, allow distributed scan chain I/O to alleviate routing constraints (you can physically have HSTAP and PEDDAs in different parts of the SoC) and enable in-situ SoC testing via a small number of pins. It can alleviate the need for codecs to virtualize scan-channels or make trade-offs as needed for a given design. And as Mr. Eide has pointed out, it re-uses the mission mode PHY (SERDES, SPI, I2C, etc) and has its own protocol such that ATE can communicate via the protocol without being aware of PCIe, USB or a future standard. (Your block diagram is also essentially correct albeit lacking some details).
IEEE 1149.10 leverages all of the hierarchical register descriptions and PDL language of 1149.1-2013. And its not just for SERDES, other mission mode I/O like SPI and I2C are supported by the standard to be used as an HSTAP.
It’s a great start to better Design for Test!!
Bryon,
Whether it is physical testers and their limitations (pins supported, speeds, memory, digital vs analog, etc), testers will be a limiting factor and affect how chip’s are designed. Often the ‘affects’ are increased power/area consumed and potential impact on performance. Decades ago, we did a military turnkey design (from spec) that required self testing to be perform in x microseconds. No question, a challenging design but when completed, this unit was easy to test from a test environment all the way to this IC being integrated into its final product. At some point, more chips will need to migrate to self testing that requires minimal pins and can be performed at full speed cutting all costs associated with test.
It will be interesting to see when/if this takes hold and how Test HW and SW (EDA) are affected.