Speeding Up Scan-Based Volume Diagnosis

Where the bottlenecks are, and what can be done to eliminate them.


In the critical process known as new-product bring-up, it’s a race to get new products to yield as quickly as possible. But the interplay between increasingly complex aspects of designs and process makes it difficult to find root causes of yield issues so they can be fixed quickly.

Advanced processes have very high defectivity, and learning must be fast and effective. While progress has been made, there are still bottlenecks in the ability to run volume diagnostics on scan-based failures.

“It’s really high defectivity at the beginning of the ramp on these advanced nodes,” said Matt Knowles, director, operations product management at Siemens EDA.

The challenge lies in identifying and fixing those defects. “The whole goal is to try to hit your expected yield and then get into volume production,” said Guy Cortez, staff product marketing manager, silicon lifecycle management, digital design group at Synopsys. “If your yield is not where it needs to be, and you suspect something is going on with the product, how are you going to be able to resolve that? And how quickly can you resolve that?”

While yield is always important, it’s particularly so today with chip shortages plaguing the industry. “Capacity is so scarce, and squeezing every good die out of that wafer is important just to meet commitments in the market,” said Knowles. “In this global supply chain and chip crisis, it’s critical.”

EDA and test tools help to narrow down the range of root-cause candidates, but attention is still focused on collecting data and evaluating the different candidates in order to focus physical failure analysis.

The origin of scan-based failures
Integrated circuit tests have many components, but one prominent one is the use of scan chains for implementing deterministic logic tests. In contrast with self-tests, which are algorithmic, scan tests provide a way for submitting specific test vectors to ensure that the internal logic is working properly.

Since these vectors can be very large, they are compressed for storage in the tester. Once they’ve been delivered to the device by the tester, they are decompressed internally and executed. Due to the number of internal nodes, it’s impossible to route them all to the outputs, so the resulting vector is again compressed into a signature and sent to test outputs to be compared against an expected signature.

Fig. 1: Input patterns are decompressed for internal scan chains. The results are re-compressed into an output signature. Source: Synopsys

Fig. 1: Input patterns are decompressed for internal scan chains. The results are re-compressed into an output signature. Source: Synopsys

As a result, this largely acts as a pass/fail test because the specifics of the failure are often lost in the compression. These days, additional results may be available to help identify specific failures, although effects still may remain confounded.

Once volume production starts, failures need to be logged and evaluated to determine which ones are most critical to improving yield. This typically can be done using a Pareto chart to identify priorities.

When scan-based failures rise to the top of the Pareto chart, there is an urgent need to analyze large volumes of failure data in order to identify the changes needed to eliminate those failure mechanisms.

“You need a lot of data from a lot of devices, because often you are debugging not just your own latest design. You’re also debugging the process,” observed Michael Braun, product manager at Advantest.

This process isn’t one for use only when there seems to be a problem. With early process bring-up, it needs to be a way of life. “Volume diagnosis involves routine sampling, where you use it not just for a panic excursion, but routinely to track these mechanisms over time,” said Knowles.

Early ramp-up also will take more engineering involvement than might be needed for a mature process or device. “It takes design engineering capability to diagnose and fix an early device failure,” noted Eli Roth, smart manufacturing product manager at Teradyne. “There are analytics and learning tools that can look at this large volume of data and tell us that there’s a systemic issue. And then can you build a process around that which says, based on this learning, these are the things I do next.”

The analysis process
There are four major phases to diagnosing high-volume scan-based failures:

• Extracting information from the tester after the device has failed;
• Running automated software tools to narrow down the likely root-cause candidates;
• Further analysis to identify one or two most likely causes; and,
• Physical failure analysis to confirm the failure on known-failing devices.

Fig. 2: The diagnosis flow. Red items are potential bottlenecks; orange is automated and compute-bound. Source: Bryon Moyer/Semiconductor Engineering (with one element from Advantest)

Fig. 2: The diagnosis flow. Red items are potential bottlenecks; orange is automated and compute-bound. Source: Bryon Moyer/Semiconductor Engineering (with one element from Advantest)

Prior to physical failure analysis, the causes are only considered to be likely. It’s through physical confirmation that a root cause can be firmly established. But that physical analysis process is time-consuming and uses expensive equipment. Ideally, only one candidate would be submitted for verification. Failing that, one must narrow down the candidates to the absolute fewest possible.

That puts a large burden on the prior steps to effectively and accurately identify the best candidates for confirmation. Each of those phases contains potential bottlenecks that could be improved to speed up the overall process.

Getting data from the tester
If a device fails, there are two competing efforts. On one side, the need to maximize test throughput means that a failing device should be ejected as soon as possible so that a new device can be tested. From a production metrics standpoint, there is no value in keeping the probes down on a failing die beyond the time when the failure occurs.

Competing with this is the need for additional data to be collected to understand the failure. At the very least, data that already has been collected needs to be downloaded to some trove for later offline analysis. All of this takes time – time that could work against production metrics if not done carefully.

“The performance of getting the failed cycles back from the hardware and into the data log is as important as the performance of the rest of the test application,” noted Advantest’s Braun.

Tester companies have paid close attention to this tension, and for the most part they’ve been able to balance both requirements.

For new processes, it’s insufficient to stop testing after the first failing cycle. Additional cycles must be run. This is necessary both for understanding any logic failures and for confirming that the scan chains themselves are working properly.

“For logic and chain diagnosis, the first failing cycles will not tell you too much,” said Braun. “It gives you a rough idea, but it’s not nearly enough for diagnosis. What you typically need is at least one full unload of the scan chains.”

The situation is different for mature process nodes that aren’t in a ramp-up phase. In that case, it’s not usually necessary to capture multiple cycles beyond the initial failing vector. “The first failing cycle gives you enough to have a sense that there’s something systematically recurring too often,” said Klaus-Dieter Hiliges, platform extension manager at Advantest. “Then you may look into detail, but otherwise there’s no need for general diagnostics.”

Because additional cycles need to be run, downloads can happen in their shadow.

“The collection of these failing cycles is done automatically in the background, while the test stimulus is applied and the comparison to the expected data is being done,” explained Braun. “Before you run the pattern, you set some configurations to tell the hardware in which mode to acquire the pages. And so the acquisition of the data from the hardware doesn’t cost you anything, unless you totally overdo it in terms of the amount of failing cycles you collect. But people usually try to avoid that anyway, because otherwise you end up with gigantic log files, which are hard to process later on.”

This can be assisted by carefully partitioning the vector set into blocks. “You cut your scan pattern into chunks of 5 or 6 or 10 pieces, like bursts of patterns, and then you execute the first one,” he said. “And while you execute the second one, you upload the failures from the first one in the background and send them into the data log.”

The emphasis is on simply getting the data out of the devices and onto a workstation. Any further manipulation can be done as time permits. “Any processing of the data, like writing it into the standard STDF format, is a lower priority task on the workstation, and it happens when there’s time,” said Braun.

Multi-site testing
When using a multi-site tester, downloads may be hidden yet further. All of the sites are tested in lockstep, so when one site fails it can’t be ejected immediately and replaced with a new device. All of the sites must complete their testing before a new set of devices can be started. If the good devices need more testing after some of the devices fail, there is time for the data download while the remaining devices finish their tests.

The number of sites can have an impact here. “As a rule of thumb, if you test one or two devices in parallel, like a big a big AI chip, or GPU, NPU, it doesn’t really matter. Collect as much as you want,” said Braun. “If you go to a higher site count, like for mobile devices, you want to make sure you set a reasonable limit for the amount of fails cycles to log because the transfers from the hardware are pretty fast, but not infinite.”

While multiple sites tend to run their programs in parallel, it’s important that failing devices have their data downloaded without interrupting the passing devices, or forcing a download from them too. “If you have a 10-site tester and site 2 has a failure, you don’t want to do deep diagnostics on all 10 devices,” said Teradyne’s Roth. “You want to look only on site 2.”

And as more and more data is needed for diagnosis, the limits may push out in the future. “As we move from one generation to the next, we keep increasing this limit of how much can you capture without having any impact,” said Hiliges.

One situation that may require additional data to be gathered is when the compressed failure data is ambiguous as to where the failure occurred. Additional work may be needed to unconfound the results.

“Some diagnoses you can do with these compressive structures in place,” said Braun. “With others, it depends on the configuration. You may need to switch to a bypass mode and daisy chain all these internal scan chains in order to form one super-long chain to get better diagnostic resolution.”

Alternatively, special diagnostic vectors could be used to tease out better data. But those patterns must be generated ahead of time and stored if there’s room. “If you pre-generate the special diagnostic patterns, then during test execution you only have to select which special diagnostic pattern to run,” he said. “We kicked around the idea of generating the extra diagnostic patterns at run time, but that takes too much compute power.”

In some cases, an engineer may need to use a tester to confirm or further elaborate on a failure. This can be even more disruptive, because an entire tester needs to be taken offline for engineering work. It can be hard for engineers to get permission to use the precious production systems for non-production work – even if it would ultimately help yield.

“If you want diagnostics, you got to have more data about failures,” said Roth. “And that means testing more failing parts, which might mean more time or cost to test. The expert shows up, he’s available, and he’s processed all the data, but he’s got to get on to resources – that’s a logistics challenge.”

One way around this could be to create a digital twin of a wafer. “Then you just spend time enhancing your digital twin capability so that the insights and inferences and capabilities, and all the things those experts are doing, are on a twin rather than on an actual asset,” he added.

Identifying possible causes
Given appropriate failure data, EDA and other test tools can automatically take the failure information and push back up the “cone of influence” to identify a few likely ways in which the failure could have occurred.

“These tools come up with a list of suspects and a probability,” said Braun. “And as a rule of thumb, the more data you feed in, the better these probabilities get. But there is a threshold beyond which you will not get higher resolution.”

We’re often not at that point, however. “The industry is saying we need higher and higher resolution from these tools,” said Knowles. “And what that means is lower suspect numbers, lower ambiguity.”

These tools have been around for a while, but they still take time, and different approaches are being proposed for accelerating the process.

At this year’s ITC, a team from National Taiwan University [1] proposed an approach where, rather than analyzing every failing device, statistical methods could be used to generate a single “virtual” device that would stand for the group.

The idea is partly that each failure will have a number of characteristics unrelated to the specific failure – effectively, test noise. So they won’t all lay on top of each other if plotted. Instead, they’ll look more like a cluster. Identifying a “centroid” of that cluster would help to smear out unrelated characteristics and would allow analysis on that single centroid rather than on all of the devices in the cluster, saving valuable time.

Another team from Huawei [2] proposed using a neural network to classify failures. Feature engineering would involve both fault features and report features, allowing reports to be directly ingested and root-cause candidates to be identified. While this could complement, or even replace, some of the current statistical approaches, it also could be used in the next phase, as well.

Narrowing down likely causes
The prior phase involves a fair bit of automation, but the next phase is more manual. Here, the candidates identified in the prior phase are further narrowed down — ideally to one most likely candidate.

Exactly how that process works will vary depending on the candidates put forth. And it typically involves expert engineering analysis, which is where it takes some time.

This is also an opportunity to improve the test set. “Modern tools can run the diagnosis on these failure files and then, based on that failure, they can iterate with the automatic test pattern generation (ATPG) tool and make the patterns a little more specific to this type of mode,” said Knowles.

Adding extra capability to the internal test logic also can help. “We’ve got a technology called ‘reversible scan chain,’ he said. “Instead of just sending patterns in from one side and out the other side, there’s a way where you can send the vector in and out in the reverse direction so you can tell exactly where in the chain the defect came from. That increases resolution incredibly.”

A repeating failure mechanism may make further analysis unnecessary. “In some cases, they can look at that Pareto that’s created from the diagnosis and decide not to do failure analysis at all because they recognize a defect,” Knowles further noted. “They’ve seen it before in the population, and they know what the root cause is.”

It all leads to physical failure analysis
Physical analysis always has been a laborious process that requires delicate deconstruction of a chip to provide visual confirmation of a fault. While tools have improved over the decades, they still require heavy investment and skilled technicians and engineers to generate and evaluate the results.

Each candidate will require a lot of work to confirm physically. “That’s probably a $20 million equipment set being used for days,” observed Knowles.

For that reason, candidates must be specific not only about the suspected cause, but also the locations where the issue is likely to show up on the die. “If you’re going to take some candidates to failure analysis, you want to make sure that you give highly accurate locations where the failure analysis team needs to cut into the silicon,” said Cortez. “To try to get more accurate location, we can take some other types of data, like inline defectivity data or acceptance test data. If you bring those two (finding the area and finding the right candidates) together, that hopefully should reduce this whole effort.”

It’s for that reason that there is so much emphasis in improving and accelerating the prior steps. Large improvements in narrowing down candidates can work alongside incremental improvements in physical analysis to ramp up new-product yields more quickly than is possible today. As long as new processes have high starting defectivity, priority will be placed on eliminating systematic failures ever faster.


1. “Improving Volume Diagnosis and Debug with Test Failure Clustering and Reorganization”, Mu-Ting Wu et al, NTU, Qualcomm, ITC 2021
2. “Adaptive NN-Based Root Cause Analysis in Volume Diagnosis for Yield Improvement”, Xin Huang et al, Huawei, ITC 2021

Leave a Reply

(Note: This name will be displayed publicly)