Pinpointing Timing Delays Can Improve Chip Reliability

Focus shifts to internal chip assessments of timing margin and changes who’s responsible for what.

popularity

Growing pressure to improve IC reliability in safety- and mission-critical applications is fueling demand for custom automated test pattern generation (ATPG) to detect small timing delays, and for chip telemetry circuits that can assess timing margin over a chip’s lifetime.

Knowing the timing margin in signal paths has become an essential component in that reliability. Timing relationships are central to the function of digital logic and high-speed I/Os, and they need to be well understood and planned for early in the design process. In fact, DFT/DFx-based methods have usurped the need for high-performance ATE-based measurements for detecting defects in high-speed I/O. As a result, test engineers today are generating very specific test vectors to screen for small timing delays caused by defects and premature circuit aging.

In the past few years, subtle timing delays in hardware have gained notoriety as one of the chief causes of silent data errors, as described by Meta and Google. [1,2] That has been compounded by extremely high-quality requirements for automotive chips, which require more effective and efficient testing. Timing is a key component in that testing. As part of achieving its 10- parts per billion failure rate, NXP now includes targeted tests to screen out subtle timing-related issues. [3]

When you have critical applications, then of course you need to make sure that all possible means of ensuring the quality are in place — especially if you desire these very low failure rates,” noted Dieter Rathei, CEO of DR Yield.

These quality expectations create a greater interest in ATPG algorithms that can expose small delays.

“Given that many of these SDE (silent data error) defects seem to be context and data related, it seems like we are missing the critical paths or marginal paths in the testing process. A slack-based transition delay ATPG ends up with the best structured approach in the industry to find small delay defects within a device,” said Adam Cron, distinguished architect at Synopsys. “Slack-based transition delay ATPG guides the pattern generator to create a pattern for a fault through the least slack path. This is probably the highest-resolution manufacturing test available for structured, scan-based pattern generation.”

At advanced process nodes there also is a push to identify differences between the mission profile and actual field usage. This is exacerbated by accelerated circuit aging, which can slow signal path, and be made even worse by the number of I/Os in connected devices. But the adoption of adaptive circuits in I/O’s has changed the production test espectations on ATE timing measurement capability.

“DFT can and has relaxed external timing requirements during test. Emerging on-chip monitor technology gets DFT to a new level by providing a great opportunity for gathering accurate timing information for what seems like a wide range of targets – both for die-to-die (D2D) interfaces and for internal critical timing paths,” said Ed Seng, strategic marketing manager for advanced digital at Teradyne. “In these cases, ATE’s job is to facilitate the setup/execution of the tests, efficiently obtaining the monitor data, and likely executing very fast local edge machine learning processing algorithms based on these measurements.”

Measuring time
Measurement objectives determine what is important in assessing timing margin. For instance, when looking for delay defects the engineer typically applies a test vector that stresses the timing relationship. Technically that doesn’t measure the time. Instead, it identifies a pattern that gives the signal with the least likelihood of arriving by the next clock edge.

Measuring the margin in an actual path requires a specific number of transitions. An insufficient change in the data state impacts confidence in the margin measurement.

A relative measurement dictates different needs than an absolute measurement. With an absolute measurement, the precision represents the smallest increment of time, which can be nanoseconds or picoseconds. Measurement accuracy is the difference between actual and measured. And because a relative measurement assesses a change in the timing from the base measurement, the accuracy of the timing margin is less important.

In addition to precision and accuracy, a timing measurement methodology should include repeatability. With repeated measurement, slight variations in a flip-flop’s setup and hold times can result in different data being captured. To improve repeatability, averaging over several repeated measurements minimizes the impact of device noise that causes flip-flop timing variations.

Consider the measurement requirements for path margin monitors. “By going through all the delay elements, it searches for the margin in every path. Once you find them you set the monitor and wait for a change,” said Firooz Massoudi, solutions architect at Synopsys. Describing the repeatability needed with the margin measurements, he added, “We finalize the margin after we have actual field data, and we check the margin every millisecond, because that’s the best way to build a profile and track changes. And for an ATE it is recommended to make the same measurement two or three times. You want to capture the settings that are right on the clock edge.”

Manufacturing test and timing measurement
Previously the resolution and accuracy needed for ATE pin electronics cards had been on the order of 25 to 50 ps. To keep up with I/O data rates greater than 5 giga-transfers per second requires a resolution of 1 to 5 ps.

“From an ATE point of view, DFT and features like timing leveling (i.e., tuning timing relationships) for DDR interfaces have long obsoleted the need to do at-speed measurements on high-speed I/Os for SoC devices in volume production,” said Teradyne’s Seng. “While ‘standard’ ATE digital channels will go faster over time, frequency is only a small part of the story. The timing accuracy needed to test skew between different I/Os on a parallel bus is only achieved through very specialized instrument design, as well as rigorous calibration of the combined errors of instruments, interface boards, and device sockets. All of this would cost far more to achieve than SoC manufacturers would find cost-effective in high volume production.”

To detect defects that impact timing performance, product and test engineers can apply functional test patterns at higher clock rates. For designs with scan DFT, they can apply ATPG transition fault patterns. However, at the most advanced CMOS process nodes these are no longer good enough. “Regular transition faults are extremely good at detecting gross timing delays, but can potentially miss what we define as a small delay defect,” said Lee Harrison, director of Tessent IC solutions at Siemens EDA.

Small delay defects have become important in large SoC designs due to the rise in systematic defects. Clearly, there’s a very tough set of design rules that you have to use in bleeding-edge technologies. Contrary to the previous technologies, some failures are more of a systematic nature,” said Andrzej Strojwas, CTO at PDF Solutions. “This plays a much more important role because you will not be able to eliminate the systematics. What we see from real volume production data is that the systematics are important and need to be screened for. This is a real weakness in current ATPG fault models.”

Finding small defect-driven delays requires a more deliberate approach, propagating a transition fault via a signal path that takes longer to travel between flip-flops. This is also known as the path with the least amount of slack.

“Slack-based transition fault test generation tries to cause the transition to follow the path with the least amount of slack. That means the smaller the defect you have, the bigger the opportunity you have to detect that defect,” said Synopsys’s Cron. “And then, by adding cell-aware, you get a couple more defect types than a traditional transition or stuck-at fault model. This is especially relevant with finFET processes, as you have the multi-gate cells with more area inside the cell.”

Fig. 1: A small delay defect and possible capture flip-flops. Source: Synopsys

Fig. 1: A small delay defect and possible capture flip-flops. Source: Synopsys

To generate such patterns requires design data on path delays, also known as timing margin or slack.

“With slack-based transition delay test, the fault travels a particular trajectory topology through the device. A transition delay fault is slow to rise or fall on a gate output. That’s it, and you can detect that same fault perhaps 100 different ways within the combinational logic between flip-flops,” said Cron. “But if you know the slack on these paths through that node, then you can attempt to follow the transition from that node backward to the worst-case slack launching flop, and forward to the worst-case slack receiving flop. You are testing this fault on a path that has less slack than your generic transition delay test.”

Fig. 2: Slack-based cell-aware ATPG chooses the longer path delay, the path with the least amount of slack. Source: Synopsys
Fig. 2: Slack-based cell-aware ATPG chooses the longer path delay, the path with the least amount of slack. Source: Synopsys

However, it’s too expensive to use on all transition faults, and it makes no sense to apply it to paths that have large enough slack so that small delays won’t impact functionality. Engineers need to follow a process that permits judicious use of these generated patterns.

“To enable the ATPG tool to select the correct path to sensitize, it needs to become timing-aware, and this can be done by reading a full SDF for the design database,” said Siemens EDA’s Harrison. “It is then important to define some boundaries for the tool to work with target faults that sit on a path where the calculated delay is 80% of the clock period, for example, or to target the paths with a timing slack of less than X ns. These boundaries will generate a very targeted set of small delay defect patterns. The engineer targets small delay defects in designs with high-speed data paths, where specific paths would not naturally be detected when using regular transition fault ATPG. This technology is regularly used for testing devices that are pushing the boundaries of clock speed.”

Embedded timing measurements
A natural extension of detecting small delay defects is characterizing the timing margin, or slack, in a path. This is done with in-circuit monitors, sometimes referred to as telemetry circuits. The assessment of timing margin can be done with the derived patterns, functional patterns or naturally occurring activity.

“Path margin monitor measures the delay of actual paths. The monitor is small in size, so a design can have 100 to 1,000 monitors spread throughout the die. This allows monitoring the state of the design in terms of timing margin across the die. With a resolution of 5 to 7 ps, it enables the capture of variations due to temperature and voltage profile on the die,” said Synopsys’s Massoudi. “Basically, you can determine which part of the die can run at which speed. This is connected together via a serial link to a central controller that manages the calibration and monitoring. In addition, there is a software driver and its associated analytics that collects the data and makes sense of it.”

Designers need to consider several attributes when placing monitors.

“For accurate measurement, they need to be carefully placed alongside the monitored flop and with the same clock, balanced as much as possible, to eliminate possible errors,” said Nir Sever, senior director of business development at proteanTecs. “At the same time, care must be taken to not disrupt the logic placement and cause excess congestion in already critical areas. Also the ‘other’ monitors must be placed in vicinity to help with root cause analysis.”


Fig. 3: Measuring minimum path margin with Margin Agents during normal operation at high coverage. Source: proteanTecs

A margin monitor requires characterization and then monitoring. “The critical path delay response depends on the local voltage variation, the gate type and size combination in the timing path, the signal activity that affects NBTI degradation, the process variation and layout context,” Sever said. “This makes it nearly impossible to capture actual timing margins with simple ring oscillators, or even critical path replicas that are not exercised under the exact same workload conditions. Consequently, measuring the actual margin accurately in critical paths in-situ during normal operation becomes a must. The choice of the critical path to monitor though is key, as not all top critical paths can be monitored (for practical reasons) and the path order under real applications will not be the same as seen in the models. proteanTecs offers intelligent algorithms to select the right timing path to monitor, to achieve high coverage in terms of number of nodes and critical paths covered, as well as representative groups of paths.”


Fig. 4: Latent defect detection using Margin Agents. Source: proteanTecs

During characterization, the engineer determines the margin for each individual path. Typically, the delays are generated using a delayed lock loop (DLL) circuit. This results in delay steps that are independent from process variation. Because margin monitors are triggered when the timing margin changes, each delay step should be about the same size.

“Margins are measured at all phases of the product cycle, from wafer testing, packaged device testing, characterization, system test, and in-field usage,” Sever said. “This offers invaluable insights of the system reliability, performance, power, and cost optimization opportunities. An additional benefit of using the Margin Agents is to check the coverage of the critical paths per workload.”

Fig. 3: Setup for a path margin monitor. Source: Synopsys
Fig. 5: Setup for a path margin monitor. Source: Synopsys

Engineers also want to characterize the timing relationship in embedded memory access time.

“A clock delay monitor can measure any delay going through any path, including networks with high precision, i.e., a resolution on order of 1 picosecond,” said Massoudi. “This is mostly used in memory characterization, either in a test chip or during the characterization phase of a production device, to gauge the performance of the device. Data access time is a limiting factor of SoC performance. Also, it can measure different characteristics, like the clock duty cycle that can change because device transition from high-to-low and low-to-high can differ. One of the critical items that usually is not monitored is the delay of your clock. So there’s usually a 50 picosecond or more delay, depending on technology.”

Fig. 4: Clock delay monitor functionality. Source: Synopsys
Fig. 6: Clock delay monitor functionality. Source: Synopsys

Conclusion
A shift in the testing of timing requirements has changed the ATE’s role during production test of modern SoCs. Specifically DFx has usurped the need for high-resolution instrumentation for measuring I/O timing relationships. As a result it has elevated the need for DFx earlier in the design-through-manufacturing flow.

Testing of digital logic for delay defects now requires transition fault patterns, but its ability to expose small delay defects requires highly crafted test vectors. And alongside both of these shifts is a recognition that testing has to continue throughout a chip’s lifetime, which has opened the door to internal telemetry circuits that can be used to assess timing margin in digital, memory, and  I/O circuits.

References

  1. Meta’s “Silent Data Corruption at Scale,” by Harish D. Dixit, et al.
  2. Google’s “Cores That Don’t Count,” by Peter H. Hochschild, et al.
  3. NXP’s, “Multi-Transition Fault Model (MTFM) ATPG patterns towards achieving 0 DPPB on automotive designs,” by J. Corso, et al.

Related Stories:
Screening For Silent Data Errors
More SDEs can be found using targeted electrical tests and 100% inspection, but not all of them.

What Data Center Chipmakers Can Learn From Automotive
Higher quality, lower cost, and faster time to market are requirements for both as rising complexity in vehicles overlaps with defectivity concerns in data centers.

Designing For In-Circuit Monitors
Data from sensors is being used to address a wide variety of issues that can crop up at any point in a chip’s lifetime.



Leave a Reply


(Note: This name will be displayed publicly)