Pinpointing Timing Delays in Complex SoCs

In-circuit monitors become essential to understand the causes of failures over time and under real-world operating conditions.

popularity

Telemetry circuits are becoming a necessity in complex heterogeneous chips and packages to show how these devices are behaving post-production, but fusing together relevant data to identify the sources of problems adds its own set of challenges.

In the past, engineering teams could build margin into chips to offset any type of variation. But at advanced nodes and in advanced packages, tolerances are so tight that small defects and/or fluctuations in power and temperature can reduce timing margin enough to cause a failure. As a result, engineering teams need to find the root causes of variation, and to monitor any changes throughout the expected lifetime of a device, which in some cases can be a decade or more.

There can be many causes for failures, but as more features are added into devices the potential for unexpected interactions increases. So does the number of factors that could have caused those failures. For example, due to the growing emphasis on hardware/software co-design and software-defined hardware, identifying the source of a software failure becomes much more difficult. The problem may stem from the software code itself, or it may arise from interactions between software and hardware with or without the presence of small defects.

Software applications can vary the functional workloads, bus transactions, and thermal profiles, and they can cause variation in localized power distribution, any of which can affect performance of digital circuits. Consequently, software engineers have found themselves outside their comfort zone because timing margin of a data path with respect to a clock path now can impact a set of code— something that previously was limited to design, test, and system engineers.

That’s just one facet of the problem. At advanced CMOS process nodes, signal paths with just enough timing margin are susceptible to small defects, as well as variations in heat, processes, or power rails. Any of these can cause signal delays. And because they cannot be guard-banded, they must be monitored. As a result, demand is surging for in-circuit monitors that measure timing margins during functional operation, augmenting existing methods that utilize ring oscillators and local copies of a set of paths used for part characterization. This is proving useful both for new product introductions and to understand the causes of failures — at time zero and during field operation.

The difference between the assumed mission profile and actual field usage adds yet another set of issues. This can be caused by different environments, applications, or use cases, and in some cases it can result in accelerated circuit aging and longer digital path delays. The setting of Vmin, i.e., the lowest voltage setting that still ensures desired circuit performance, can cause other problems.

Power usage is of particular concern for data centers and automotive chips. As a result, SoC designs for both target applications are incorporating more and more telemetry circuits (also known as on-chip monitors).

“Devices for data centers are trending in this direction, which is a large driver behind the interest in on-die sensing technology and the data analysis done on collected data,” said Ken Lanier, director of strategic business development at Teradyne. “These sensors look at electrical data, as well as thermal data, which is something traditional automotive devices have not necessarily done in the past. The challenge with data center devices is they are so large that the observability you get at device I/Os doesn’t always tell the whole story.”

The internal data provided by telemetry circuits enables engineers to better comprehend the cascading effect of changes in one or more timing margin parameters in data paths of interest.

“By co-designing the specialty monitors and algorithms to evaluate health, manufacturers are getting new insights,” said Alam Akbar, director of product marketing at proteanTecs. “For example, temperature sensors have existed in semiconductors for many years, but providing a heat map vs. application workload leads to new ways of optimizing power and compute resiliency. As a result, test is migrating into important system features for end products. It now can account for operational and performance marginalities, as well as software impact.”

Silent data errors in hyperscaler systems illustrate the complex interactions in which computing cores with defects fail for a specific set of data and computations. Both Meta and Google engineers noted in their description of silent data errors that some failures exhibit intermittent behavior. The same core and software execution does not fail consistently. Many experts have theorized that activity in neighboring cores lead to a localized thermal and power distribution environment, which in turn cause a timing relationship to flip from just enough margin to no margin.

This speaks to the complex system interactions that can occur in large SoCs, as well as to why these manufacturing defects are so hard to detect during manufacturing test and in the field.

“We know that silent data errors are usually very elusive. They’re very difficult to detect and they show up during functional operation of the device. And there are certain environmental conditions like power profiles, certain temperatures, certain software workloads that result in the failure of the device. But when those devices come back and we apply structural patterns, they operate fine,” said Nilanjan Mukherjee, senior engineering director for Tessent at Siemens EDA.

Data about internal operations can provide feedback to improve manufacturing test coverage, and it can be used in the field to optimize system operation. Telemetry circuits and on-die built-in self-test (BiST) play a role in both applications, and the data can be used to modify whole chip or partial chip settings like power voltage levels, as well as to better understand actual system usage.

Measuring timing during functional operation
Timing relationships between data paths and between data and clock paths are foundational. With extremely large SoCs (i.e., billions of transistors and vias) bound for either data centers or ADAS applications, designers and software engineers rely upon these relationships to be valid under all operational scenarios and over the lifetime of the device.

For these ICs at the bleeding age of CMOS processes, the decreased timing margin coupled with the exponential increase in subtle defects create a perfect storm. When specific software executes a specific set of paths, defects that passed all manufacturing tests can fail during use. At times, this failure occurs only under particular thermal and power supply conditions, exposing the reduced timing margin.

During the operational lifetime of these SoCs or heterogeneous advanced packages, design engineers develop mission profiles to guard-band circuit performance against aging and for conditions at minimal power supply levels, Vmin. This assumed toggle activity may differ in actual systems, though, which is natural with large SoCs with many identical computing cores, including CPUs, GPUs, NPUs, or TPUs. The associated power supply levels and temperatures also may differ.


Fig. 1: Setup for a path margin monitor. Source: Synopsys

In the past, timing margin data was inferred from design specific ring oscillators co-located near the functional block of interest, as well as a design block with paths that are similar to the actual paths. However, today’s designers need to understand timing margin during functional operations in the context of the actual paths, not a proxy. Today this capability exists, so they can insert path margin circuits at the end of a set of functional paths.

With accurate measurements of time, often provided by a delay-locked loop (DLL), paths can be characterized during functional operation. Measurements can be taken both during test mode and during functional operation. Characterization can be done during new product introduction, and in the field using cloud technology. Data analytics of timing margin can be documented over the lifetime of the part.

“Once you know the margin for every path, then you program the margin threshold. In monitor mode, this path margin monitor block can look at thousands of paths at every clock cycle versus the set threshold,” said Firooz Massoudi, solutions architect at Synopsys. “This is a powerful feature that gives you insight within a cycle of the event. You just wait, and for every clock cycle, as long as everything is fine, nothing happens. But if as soon as one of them reports that margin has deteriorated, then you go to a re-characterization of all the paths for that monitor.”

But it’s not only data signal paths that can impact the timing margin. “One of the critical items that is not usually monitored is clock delay,” Massoudi explained. “There’s usually a 50 picosecond or more delay depending on technology. Two factors can affect the failure of a timing path. One is the actual delay through the path. Another one is the clock that feeds those flip-flops. We can measure both delays. The advantage is it helps with detecting the source of deterioration, which sometimes can be due to a change in the clock tree paths between two flops.”

Where to place monitors
Historically, engineers depended on synthetic paths monitors. “For example, you take a data path, e.g., a carry chain, and stick four of them around the chip, then measure them. This approach still adds value,” said Randy Fish, director of silicon lifecycle management at Synopsys. “Now, with state-of-the-art design flows, we have the ability to insert these monitors directly into functional paths, as well as the ability to generate the test sequences that exercise them. This full solution is very hard for somebody to do in an ad hoc way.”

Because the designers simply cannot select all paths, they need to test circuit performance marginality with respect to process variation, sensitivity to power supply level, temperature, and aging. The monitor circuitry also consumes valuable area.


Fig. 2: Illustration of path margin monitor placement and analyzer. Source: Synopsys

“In the field, we need to look at all the parts that are corresponding to different process corners, and then monitor them and generate patterns for them,” said Siemens EDA’s Mukherjee. “You can come up with those monitors using standard delay format (SDF) for various process corners. But how do you place them? Where do you place them? And how many do you require? We believe in the long run that ATPG can help in figuring out the locations where those monitors have to be placed in the context of process corners.”

This adds a whole new wrinkle to place-and-route. “The absolute number of these elements does not matter as much as the coverage achieved — as long as the design size doesn’t increase,” said Nir Sever, senior director of business development, proteanTecs. “We offer automated ROI analysis tools to select the optimal mix, number and placement of our monitors, but in the end, designers can alter the final decision. Our EDA tools will also assist in the optimal implementation and verification post insertion.”

Engineers also want to evaluate power supply sensitivity at the lowest operating level, i.e., Vmin.

Meanwhile, engineers are working on optimizing pattern design with respect to the process, which can indicate the ideal location for monitor insertion.

“In a design there are millions of paths,” said Synopsys’ Massoudi. “Path selection is an integral part of this solution. For example, path selection based upon Vmin is of interest. You may look at the paths that have elements of devices that are more susceptible to voltage variation. When selecting paths, you want paths with sufficient delay, because accumulated change in the delay improves detection.”

A final analysis parameter of interest is aging. The mission profile dictates activity levels, so the aging analysis should be able to identify sensitive design blocks and paths to monitor.

Understanding margin in context
Discerning interactions from on-die telemetry data requires the data from individual parameter monitors to be connected with one another, as well as with monitors for architectural functional behavior.


Fig. 3: Path margin is influenced by process variation, temperature, voltage and functional transactions. Source: A. Meixner/Semiconductor Engineering

“On the functional side, you can have analytic modules that gather and monitor the data from various functional blocks, e.g., bus monitors, direct memory access monitors, test encoders,” said Siemens EDA’s Mukherjee. “You will be able to collect various data from these monitors. You have communication between those monitors. There will be certain parameters you can set for those monitors so that it can start collecting the data.”

Next comes use. “How do they communicate that data to the external world? There’s a messaging infrastructure in place that will communicate this data on-chip (i.e., between each other) and off-chip,” he said. “In order to communicate with the external world, monitors will need to use either the functional interfaces that already exist in the design, or they will need to communicate via some sort of debug interfaces, like a JTAG interface.”

Understanding voltage, frequency and temperature over lifetime of an individual SoC and the computing units within an SoC provides useful information for system owners wanting to extend a system’s useful life in the field. That data also can be used to compare different die with each other, which is useful for a fleet of SoCs to identify outliers and drive actions during operation.

“The whole point is going beyond just pass/fail by providing full parametric measurement,” proteanTecs’ Sever said. “In addition, the clear distinction is that our monitors run during fully functional mode (real system running functional software in the field). The amount of data gathered from all the chips and all systems throughout their lifetime to predict TTF and allow for ‘predictive maintenance’ is done using a highly capable, ML-powered, cloud-based data analytics software platform.”

Aging, temperature, activity, and Vmin are dynamic metrics that affect path margin monitor data. Design resilience strategies can be deployed when a degradation in timing margin is detected. Such degradation also can signal the need to modify the Vmin and/or clock frequency in response to the actual mission profile so the SoC’s life can be extended.

“First, we all use the foundries’ aging models. Absolutely everybody characterizes aging, but it is done with assumptions. One of the big assumptions is the workload. With a mission profile, you make assumptions on what you characterize for,” said Synopsys’ Fish. “But what if the software that’s running results in the activity factor differently from what you estimated, and the environmental conditions (i.e., temperature and voltage) are not going to be what you thought they were going to be? Either results in your part not having enough margin, or you put too much margin in. Being able to actually measure changes in path timing becomes very useful.”

Characterizing parts for Vmin can assist with both settings for ATPG test application during manufacturing, and during mission mode applications for preventive maintenance.

“I would say that a big headache our customers tell us is Vmin in the field. They say it’s hard to figure out what the Vmin is for a part. And so they generally leave a lot of power on the table by setting it too high,” noted Synopsys’ Fish. “That’s an area of extreme interest.”

Others concur this is an absolute need for silicon lifecycle management for data centers and automotive applications.

“Automotive is power-conscious because you’re just running on a battery. But for data centers, because they’re at megawatts, every millivolt counts. And those path margin monitors are meant to get your part running at as low voltage as possible while still being reliable and usable,” said Adam Cron, distinguished architect at Synopsys. “When the monitors detect activity starting to slide out of the usable domain, then raising Vmin is one way to address it. Lowering Vmin actually helps you save energy, but also lowers the stress on the component. So it’s kind of double duty.”

Gathering statistics about the localized environment can be used to improve project behavior over time. Another response to a degradation in circuit timing margin might take advantage of redundant design blocks in a resiliency strategy, taking that block offline or signaling to the system that it’s about to fail.

Different options
Telemetry circuits run the gamut from basic environmental measurements, such as process, voltage, or temperature, to a circuit performance monitor or a monitor for architectural performance under specific workloads. Tying all this information into a bigger picture of what’s happening is a non-trivial implementation, and connecting these various data sources within context of each one’s measurements is a work in progress. Several industry experts confirm that such capability will evolve gradually.

“We have always faced the challenge that the test technique must be significantly better than the device being evaluated. What is changing is where and what is being tested,” said proteanTecs’ Akbar. “Testing is moving from something done in the isolation of a factory at the beginning of a product’s life with specialized equipment, to continuous and in-situ evaluation throughout the product life. It’s not an easy evolution, but an inevitable one that will shift the entire ecosystem.”

Part of that shift is inferring what engineers can learn from this new source of internal data.

“We don’t directly see the connections,” said Siemens EDA’s Mukherjee. “But as we collect more data from the field, there will always be ways to see how different monitor data relates. For example, a bus monitor looks at the performance of the communication between a CPU and memory. You also have timing monitors within the CPU. If we see an impact on the path timing performance that might actually also impact the transactions.”

With monitoring during functional operation, you now have a new set of lenses for comprehension. And therein lies the biggest challenge -– the ability to connect across the different telemetry circuits and drive on-chip actions in real-time.

“You are collecting the data and telling the instruments when to perform measurements, basically orchestrating data collection across the chip,” noted Geir Eide, director product management for Tessent Embedded Analytics at Siemens EDA. “An SoC is a fairly complex scenario in which to do this. Along with getting not just the data, but the data with a timestamp, you could get information about not just the measurement itself, but when it is done. That whole orchestration is going to be as important as the instruments themselves. You need to not just have the instrument, but know where to put it and how you will deal with the data. It’s important to make sure the instrumentation is smart enough to not just produce boatloads of data, but have some embedded processing, as well. Then it can be clever enough about capturing the right amount of data and sending the right amount of data. That is not raw data but an aggregation of it — the ability to trigger a set of actions with other monitors, or in-field test running structural/functional test in-file.”

Conclusion
Timing relationships are central to an SoC device’s correct operation. As evidenced by reports from hyper-scaler applications, traditional manufacturing test patterns from stuck-at faults to cell-aware, slack-aware transition fault testing is not sufficient. Timing margin needs to be measured in-situ with telemetry circuits during functional operation. In addition, to comprehend these timing assessments, additional data is warranted to understand the local thermal and voltage-power environment, as well as the corresponding functional transactions that occur within the larger functional block. But with the various telemetry circuits available, we are on the precipice of putting it all together.

Related stories
Pinpointing Timing Delays Can Improve Chip Reliability
Focus shifts to internal chip assessments of timing margin and changes who’s responsible for what.

What Data Center Chipmakers Can Learn From Automotive
Higher quality, lower cost, and faster time to market are requirements for both as rising complexity in vehicles overlaps with defectivity concerns in data centers.

Screening For Silent Data Errors
More SDEs can be found using targeted electrical tests and 100% inspection, but not all of them.

Designing For In-Circuit Monitors
Data from sensors is being used to address a wide variety of issues that can crop up at any point in a chip’s lifetime.



Leave a Reply


(Note: This name will be displayed publicly)