Designing For In-Circuit Monitors

Data from sensors is being used to address a wide variety of issues that can crop up at any point in a chip’s lifetime.


In every application space the semiconductor ecosystem touches, in-circuit monitors and sensors are playing an increasing role in silicon lifecycle management and concepts around reliability and resiliency — both during design as well as in the field.

The combination of true system-level design, in/on-chip monitors, and improved data analysis are expected to drastically improve reliability of electronic devices throughout their lifetimes. But in the future, as more sensors and monitors are strategically placed to collect data, and as that data is combined and analyzed, it will enable a much more granular understanding of what exactly goes wrong in real-time and open the door to recovery schemes to keep devices functioning, at least until they can be repaired or replaced.

“Product complexity is the big driver,” said Chris Mueth, senior manager of new markets and digital twin program manager at Keysight. “You could say there’s some regulatory standards and miniaturization going on, but it’s all really around complexity. And it’s just going to keep getting worse as consumers want more capability in the palms of their hands. For example, the aerospace/defense industry wants more capability in their hands, so they’re going to continue to push more and more functionality into products. To illustrate that, if you look at where a 2.5G chip was, say, 15 years ago, there may be 100 requirements for a PA chip that would go in your phone. Now it’s a multi-function 5G chip, which could have 2,000 requirements. And those are the specs, but there are multiple bands. Also, it has to operate on multiple voltages, multiple operating modes. All of these have to be managed and verified. We have heard stories of chip manufacturers who missed verifying a requirement, and only caught it after it was already in a chip in a phone. It was a very costly mistake.”

Where we are today
The concepts around observability, predictability, and resilience with on-chip monitors/sensors used to be a hard sell, but with more attention being paid to how devices and systems behave over time, along with issues such as silent data corruption, it’s becoming easier.

“The architects get it,” said Gajinder Panesar, chief architect at Picocom. “They’re the ones who normally understand why it’s needed. What was missing and hard to articulate was the business reason for doing it, because either you develop it yourself or you buy it. But it takes real estate, and in some cases it will take power. That’s shifting somewhat, but there’s still a question of, ‘Why do we need it? Why can’t I just have something running every now and again, sending some information somewhere?’ Both are needed. You’re not going to have all the compute resources or the complexity of the monitors-on-chip for all scenarios, from low-end to high-end systems.”

The value of this comes more when there’s a collection of systems relaying back what they see. From that comes the in-life monitoring and predictive maintenance aspects.

“Designers need to understand that today’s advanced design techniques—combined with manufacturing complexities associated with the latest process nodes—are leading to new challenges that increase variability in power, performance, and the useful life of semiconductors,” noted Alam Akbar, director of product marketing at proteanTecs. “A chip’s power and performance characteristics start to change as it moves through the silicon value chain from pre-silicon design to new product bring-up, system integration, and finally to in-field usage.”

It is also understood that many semiconductor failures, such as bias temperature instability (BTI), can be predicted by monitoring how a chip degrades over time. “In-circuit monitors have evolved to be able to measure areas including power and performance degradation, workload stress, on-die temperature variations, and interconnect die-to-die monitoring for heterogeneous designs,” Akbar said. “Since reliability and safety are key differentiating specs today in mission-critical systems, and since device functionality is compromised over time, test has evolved to include lifetime operation as well.”

To be fair, these are not new ideas. The likes of Intel have always had this kind of technology, but it was homegrown, custom, and it was hard to penetrate. And it is more for themselves rather than for the systems that they were building.

What’s changing is this approach is becoming much more widely adopted. Still, Panesar is not convinced people understand how much information can be collected and how widely it can be used. “Tesla and other car manufacturers, for example, send a lot of telemetry information, and correlate information collected telemetry-wise in conjunction with what the on-chip is doing. Then they send the metadata back so [analysis] in the cloud can correlate and say, ‘This stuff’s going on, for these reasons.’”

These concepts feed into the test landscape, as well. “When we’re testing chips in the traditional way, we’re testing them at time zero and we know that they’re meeting timing,” said Lee Harrison, director of product marketing in the Tessent group at Siemens Digital Industries Software. “But increasingly there are a lot of requirements, especially from the automotive ecosystem, where they want to continue that testing. We’re in the process of looking at ways to monitor things around how the timing is changing. This means looking at how to monitor the most prominent slack paths within the design to see how the whole aging effect is changing the timing of the chip, and what impact that has on its reliability.”

This is part of a bigger focus to analyze designs in order to find optimal places to add monitors, and from there to figure out how to add a feedback loop so a chip can self-heal. As timing starts to change, some elements in a design can be adjusted. For example, as a device ages, the voltage can be ramped up to counteract that aging.

“That has a number of consequences,” Harrison said. “One, it means that when the device is new, you can minimize the system voltage to make sure the device is functioning without failure, without over stressing it. Also, as the device ages, those adjustments can be made to effectively extend the life of the device over time. In other words, you’re improving the reliability of the device to get every last bit of life out of it — especially in the automotive world, where the requirements for silicon extend past its natural life. That’s where we’re really seeing this push. But when adding in the technology, you don’t want to add thousands of monitors across the device and just do it randomly. You want to cherry pick the key points in the device where there are really tight timing slacks, where there are critical data paths, to make sure that you’ve got the best coverage of the device with the minimal amount of area and overhead.”

Adam Cron, distinguished architect at Synopsys, believes there’s going to be an explosion in redundancy techniques in the sensor infrastructure. “We already have the big die with hundreds of cores. Suppose just one of them is going bad, but the rest seem to be working okay. We don’t want to throw the whole thing away. That means having some robust automation around removing a chunk and handling it from not just a hardware standpoint, but from an operating system and software standpoint, to still make use of what’s left in a reasonable, safe way, and secure way.”

Gone are the days where sensors for characterization and test were enough. Monitors today need to be able to provide health tracking during usage as well. But that means designing highly sophisticated HW monitoring systems that are able to work in mission-mode, without disrupting the system operation.

“Although in-circuit monitors are the first crucial step to enabling the needed visibility,” Akbar remarked, “it is imperative to apply specialized expertise to the data they provide. We can only achieve that with focused machine learning and advanced analytics. This is the only way to predictively find what you are looking for, instead of looking for a needle in a haystack.”

For the design team, some of this may look familiar.

“The challenges are not just about aging,” Cron noted. “They’re going to encounter some of the same things in cross-die variation, where something’s going to work well over here, and not work so well over here. That is what you need to design for initially. Luckily, those same things that you might stick into the design to help you manage that spread between what works fine over here and doesn’t work fine over there, you can then leverage for solving some aging issues — such as the voltage and frequency dynamically being set by the measurements being made in the system, based on monitoring data, not just across time, but at that moment, as well. We’re seeing in the hyperscaler domain that silent data corruption issues are a problem. There’s no data to go with that yet, as far as we know, but part of the issue is that at test time, there’s not a realistic environment set up. So, for example, having logic BiST in the design is probably something you’re going to have anyway. Maybe you can leverage that in an initial characterization of the design and system to make noise all around, as if it’s working actively, and then characterize in that kind of environment so you can get initial setups and tweaks of frequency and voltage. So you can save energy in the front end while bringing the voltage down as low as possible, but then have some upside and longevity extension in the back end.”

These techniques and approaches are gaining traction, specifically in automotive, noted Randy Fish, marketing director at Synopsys. “The automotive ecosystem went from using very mature technologies, such as 40nm, which was probably fairly advanced until very recently. Now suddenly, there’s N5A. Those are advanced nodes that don’t have a legacy yet. In automotive terms, they don’t have a history, and so they can’t work off historical data. It all comes down to the need to monitor, the need to test, and the need to repair or mitigate, and that’s going to continue through the life. The in-field test, which used to be something in automotive that they’d do with MBiST or LBiST, you’re going to see it more in the hyperscalers or the data centers or HPC realm. In-field scan is happening now, where you continue to test the part during its life. You are going to monitor things, and then you will mitigate or repair as needed.”

Cron added that the same techniques used in the factory, such as outlier detection, will migrate to the field. That will make it possible to find the next outlier a couple of years down the road when some chip is way off from all the other devices of the same kind, in the same rack, in the same farm.

Others agree. “The expectation is that ICs have longer lifetimes, with minimal degradation over time, especially when dealing with harsh environments or those that need a 24/7 operation,” said Pradeep Thiagarajan, principal product manager in the Custom IC Verification Division at Siemens Digital Industries Software. He noted that IC reliability requirements are a trending topic across many application areas. “The two things to consider here are whether the aging of a device can be modeled accurately, and whether the aging of devices actually will be slowed down over time, thereby extending the lifetime of the device, the circuits, and the product. Both of these need innovative solutions by the foundry to properly model these, as well as to make the right choices with the fabrication material that’s chosen and the processes that are involved in making a device. That enables you to extend the lifetime of a device, and also to ensure the device does not become a victim of some unexpected failure mechanism over time.”

Additionally, new monitoring techniques today have the ability to connect the electronics value chain stages, creating a streamlined and common language across all steps and disciplines. This has enabled apples-to-apples correlations, data based predictions and shift left decisions.

“Monitors today are much more sophisticated than in the past. And coupled with ML, can provide visibility of the full material distribution, including detection of on-die variation for process speed, leakage, RC delay and path delay. They can monitor performance and performance degradation at every stage, from characterization, qualification, volume testing and during lifetime operation in the field. Users can also see not only where they have scarce margins, but also where they might have excess margin that they can tune down. They can exploit this ‘extra’ margin to push the limits even more by increasing frequency or reducing voltage, to save power for instance,” Akbar explained. “With this information, developers are now able to optimize their design decisions, remove excess guard bands to account for poor visibility, reduce DPPM rates, and make unique decisions for each chip depending on its characteristics in the field. For example, an auto company can utilize data coming from a performance monitoring agents to identify those chips that might be headed for failure and can then make pre-emptive actions, such as lowering its frequency to extend the life of the part or notify the owner to bring in the car for maintenance.”

Defining the data
To make in-circuit monitoring work, it’s also essential to collect the right data. That may sound obvious, but knowing what to collect and why is complicated. Different engineering teams want different types of data.

“If you are putting hardware blocks into the silicon, you can watch what’s going on and get all this information out to analyze in software on your host machine — all to do with the capabilities and performance of what you’d actually been doing — so you can see that it is behaving correctly and that it was performant,” said Simon Davidmann, CEO of Imperas Software. “What are you collecting data for, and what type of data are you collecting? It comes down to the ‘what’ and ‘why.’ The ‘how’ is obviously a necessity for us engineers because, for example, with our modeling, we are speed freaks. We don’t want to put anything in the model that will slow us down. If you want to start doing analysis on what’s going through the model, it’s going to slow it down. So we’re very concerned about what data people want because it’s going to have performance impacts.”

It usually comes down to how data that’s collect will be used, and for what purposes.

“Some people are trying to tune software and need very specific data,” Davidmann said, “If they are gathering cycle-by-cycle data, and someone’s trying to verify something, they’ll need different data. And once you’ve got the data, there are different abstraction levels. What do you do with it? For example, if we’re helping someone port Linux, they don’t want to look at the events in the RTL. They don’t want to look at the register values. They want the abstraction of C or the abstraction of functions, or the abstraction of the scheduler of the jobs within the OS. And that’s all data that can be collected. Then they can do analysis on it to see how well it performs or what bits of the OS they’ve explored.”

When the engineering team has collected all of the data it requires, what do they do with it? What sense can they make of it?

It depends what you’re trying to do. “A lot of the hardware and processor world puts performance counters into the hardware so you can see how well the branch predictor did, how well the cache hits and misses were going, so you can see how many cycles each instruction took,” said Davidmann. “You can look to see if the branch predictor or the cache worked well. One option is to build smart designs into the processor micro-architecture. They put counters into the hardware so the folks designing the hardware can see what works. But once they built it, the software guys also can use it to tune the software based on what’s in the silicon.”

Putting the pieces together
The more data points, the more visibility into a chip. Troubleshooting can take place anywhere throughout the design-through-manufacturing flow, and even looped back from the field.

“For in-silicon debug, for example, because the initial silicon verification is a very difficult process, there are lots of things that come into play,” said Kam Kittrell, vice president of product management in the Digital & Signoff Group at Cadence. “If you look at physical monitors that go throughout a chip, if you find out that your chip passed your test and then it went on the ATE but then dies on the test board, you can find out that maybe the IR drop was a mess on the test board. You’ve got a lot of more information that you can pull back and compare to what you did before, all the way back to the design process. Over time, these will pull together. Also, if we see that people are putting together data in one platform, and somebody puts data in another platform as long as they’re using these open standards, you don’t have to translate everything from one to another. We believe this is going to happen, because we’re using open standards for this. We’ve looked at it closely enough, because we’ve already had other big data projects going on here that can stack. You can write some tools that recognize these two different databases as having unique information and do analytics on top of it. Technologies like Python have lots of utilities to facilitate things like this. All of this speaks to the megatrend toward more resilient design, more reliable design. On many levels that covers design, test, and manufacturing.”

This becomes more important as designs become more complex at advanced nodes and in advanced packages. “Especially now at 5nm, aging is affecting the timing, and so you’ve got these models of how it will age, but how do you know how it will age until it’s been out there for five years?’ said Kittrell. “Also, if you find a peculiar case that is failing, maybe a soft failure like a bit flipping every once in a while, that’s not a manufacturing defect, but there’s something you can do at manufacturing test. You can go back and re-do your manufacturing tests with known bad die, and then also disperse that test out to every unit to see if there are some weak links in the chain. There’s lots of goodness that comes from that.”

“If we want these monitors to provide the level and depth of coverage that is truly needed, they must be strategically placed. Integration tools allow designers to review the design by analyzing the connectivity, libraries, and design style within each block, and suggest the type and quantity of agents needed. Decisions are geared toward the specific business case they are intended for, which allows the designer to increase or reduce agents to meet their visibility goals and constraints. These agents are built for analytics, which is what allows them to be very small and widespread, with minimal impact on power, performance or area. They are sensitive to many vital parameters in the chip, and can sense both issues in the chip, as well as the surrounding electronics, application effects and environmental impact,” Akbar added.

Leave a Reply

(Note: This name will be displayed publicly)