Ensuring Chip Reliability From The Inside

In-chip monitoring techniques are growing for automotive, industrial, and data center applications.


Monitoring activity and traffic is emerging as an essential ingredient in complex, heterogeneous chips used in automotive, industrial, and data center applications.

This is particularly true in safety-critical applications such as automotive, where much depends on the system operating exactly right at all times. To make autonomous and assisted driving possible, a mechanism to ensure systems are operating correctly in real time needs to be in place. Today, this typically is referred to as in-chip monitoring, runtime monitoring, in-system monitoring, and even built-in self-test.

The advantages and restrictions of in-chip monitoring depend on several factors, such as the expected failure mechanisms, which are both application- and technology-specific, as well as on the chosen monitoring approach, said Jens Warmuth, engineer in Fraunhofer EAS’ advanced physical verification working group. “One approach is detecting damage directly but early enough that no failure has occurred yet. Another is to measure environmental or operational parameters contributing to damage formation and growth and thereby foresee failures long before they happen.”

With monitoring of contributing parameters, think of a chip used in high-stress environments with big temperature changes or excessive vibration.

“This can be expected to experience high levels of mechanical stress,” Warmuth said. “For most circuits this is no significant problem, and only the package and other periphery is in danger. For some circuits, however, characteristics may change. To determine accurate values of mechanical stress on the chip level, it has to be measured directly on-site. Moving away from on-chip failure mechanisms to those at packaging level ,it also often becomes necessary to move the monitoring circuit out of the chip and into a device in its periphery.”

As electronic system sophistication within cars increases, so does the requirement to monitor in-chip conditions, whether that’s within the engine bay itself, as a part of radar/LiDAR systems, or as part of the infotainment infrastructure.

“The requirement for energy efficiency, power performance and reliability in high-volume manufactured vehicles has also caused in-car monitoring and sensor systems to increase in number and complexity, allowing the dynamic physical conditions to be managed and to optimize against the manufacturing variability of each engine produced,” said Stephen Crosher, CEO of Moortec. “At the semiconductor level, in-chip monitoring allows for self-test, self-diagnosis, providing scheme to enhance the reliability of devices situated within harsh automotive environments.”

This applies to network activity, as well, but monitoring needs to be done with a certain degree of awareness of the impact of the monitoring itself. “We’ve found that at current data rates active monitoring of signals disturbed those signals. You need to monitor the quality of those signals on chip rather than actively monitoring the signals themselves. For example, what does the receiver actually receive? You need to sense for exactly what a device sees. That involves equalization. You need to shape the signal in a way that it is intended to be received,” noted Steven Woo, vice president of systems and solutions and distinguished inventor at Rambus.

Automotive monitoring
Much of the current activity surrounding on-chip monitoring in the automotive arena has been driven by the ISO 26262 standard, because that’s what everyone is using to measure a system’s resiliency, noted Anush Mohandass, vice president of marketing and business development at NetSpeed Systems. “ISO 26262 does a good job in terms of asking where the problem is and running diagnostics. It says whether you have a problem or not, but it doesn’t tell you how to recover from it. And it doesn’t tell you what to do in a live situation. When you’re building cars, you care about ISO 26262, but not so much. You care about what’s happening to the chip as you’re driving the system.”

One way automotive OEMs address this is with online monitoring that checks to see if the hardware is operating properly. This is optimal when there is nothing running other than the built-in self-test (BiST) traffic, and it works well at ignition start or when bringing up the system for the first time.

“However, if you’re driving down 101 and you want to figure out whether your chip is doing well or not, you cannot suddenly bring the performance down to zero,” Mohandass stressed.

Another approach uses a technique called network BiST, which works in a system at regular performance. This could be an attractive approach for automotive OEMs that want diagnostics to understand failure before it happens, not after. Particularly in autonomous systems, understanding why something is broken after the fact is only the absolute minimum.

“In the same way, ISO 26262 is necessary but not sufficient,” Mohandass said. “For other things you need online monitoring to make sure you have diagnostics done very, very well. The next step includes recovery techniques, which must be architected into the system up front. This is true for pretty much any IP or anything in the system. You can’t think of it as an additional layer you add on; you need to architect it in.”

Again, in the context of runtime monitoring/mission-mode monitoring, it’s all about making sure that when a product is shipped it is defect-free. Even after the devices are assembled in the car, they can be tested. But questions remain about whether they will maintain their operational quality. Are they going to maintain their ability to operate safely in the long term and especially when the car is actually running? That is where additional capabilities are needed, said Stephen Pateras, product marketing director at Mentor, a Siemens Business.

“In the past when you talked about DFT or test, it was typically at the manufacturing stage,” said Pateras. “You manufactured the part, you ran tests to cover all of the defects, and then the assumption was these parts would stay good for a long time. In the rare occasion they didn’t, you could deal with that when it occurred. In automotive functional safety environment, anything going down is unacceptable, so infrastructure capabilities must be preemptive to ensure this never happens.”

There is much more that can be done with this monitoring approach, as well. “You can set up performance counters to use at runtime to map quality of service and track errors,” said Kurt Shuler, vice president of marketing at ArterisIP. “You want to know whether the system behaving as you expect it to behave. You can use probes to look at the traffic and set it to click every X number of times something happens so you can have statistical sampling. You also can use probes to view data in motion at runtime, similar to taking a trace off the data path.”

In addition, this kind of approach can be used for tracking security breaches, which in the case of automotive, industrial and medical applications can have safety implications.

“The observation probes can look at the traffic, but you also insert programmable firewalls into the datapath,” Shuler said. “From there you can determine what to let through and what not to let through. Sometimes you don’t want an adversary to know you’re onto them, so you may want to block malicious traffic without firing an interrupt and letting the adversary know. This capability is in addition to the hardware diagnostics. When you combine all of this together, you get a functionally safe solution where you have ECC or parity for the data path, hardware duplication, BiST, safety controllers, and these additional runtime diagnostics. It’s the combination of all of these that make it so attractive.”

In the datacenter
Inside of data centers, as CPU temperatures rise, the server power consumption drastically increases due to CPU leakage current. Real-time temperature monitoring systems are necessary here, as well, to allow for power optimization.

“In-chip embedded temperature sensors also can  help extend device lifetimes or provide protections through the enablement of server shutdown schemes, the latter being the result of rising temperatures from sudden increases in dynamic CPU load profiles,” Moortec’s Crosher noted.

Additionally, there are commonalities between the needs in the datacenter versus the automotive space, and many times it’s simply a terminology difference.

“For data centers, they call these RAS (reliability, availability, serviceability) features,” said NetSpeed’s Mohandass. “Data center guys get paid and get monitored based upon what percentage of the time they are online. It’s not the 99, but it’s 99 followed by how many nines you have. We’ve seen this many times. One data center goes offline for United Airlines and millions of people get stranded. It’s bad publicity for United, bad publicity for the data center guys.”

The way data center managers have coped with this is to not worry about RAS at the chip level. Rather, they focus on the rack level or system level, he said. “How do you build reliability at a system level? For a network-on-chip, the key is if you have different racks or different PCIe, what happens if it gets pulled out? Do you reset your entire chip or can you gracefully handle things coming online and offline? The mechanisms are very similar to how the automotive segment works, but out there it’s about how graceful you are in accepting errors and recovering from errors.”

Let IP handle it
The promise of design IP always has been to solve the systemic complexity that plagues engineering teams today. So another approach to on-chip monitoring is implementing specialized IP that gets scattered inside a big digital chip to help engineers understand how that chip is actually working, in operation, in real time, with real software, all at the system level, and the system as a whole at a high level of abstraction so they can find problems quickly, noted Rupert Baines, CEO of UltraSoC.

For example, an embedded logic analyzer can report on any signal. “You might use that to just look at common use cases, buffer level and checking the buffers are being filled and emptied,” Baines said. “Quite often, you can improve performance by noticing you were very conservative on one buffer that’s only fluctuating between 20% and 30%. If you let it go to 60%, it could do twice as much data in a given amount of time.”

Alternatively, protocol analyzers can be used to understand various different interconnects and show how efficiently traffic is flowing through a bus.

IP is the foundation of nearly every SoC system today, which is why IP is such a key piece of the on-chip monitoring puzzle. And IP in the automotive space is undergoing some interesting changes.

Navraj Nandra, senior director of product marketing for interface IP at Synopsys, said that each application has its own set of challenges. For example, atomotive electronics operate in a harsh electrical operating environment with very high temperatures, while industrial applications don’t always operate under similar stress conditions.

“The difference with the data center is that processing demands vary quickly, which equates to peak demands in the data center, followed by quiet periods,” Nandra said. “You get very big local changes of the temperature that impact the temperature of the chip so you need to figure out in all of those areas how to throttle the speed of the SoC or limit its functionality so that you’re not going into unsafe ranges that impact the performance.”

Some of the ways that engineering teams are looking into the on-chip PVT (process, voltage and temperature) variation is by adding more intelligence on chip. “There is a concept of distributed sensors around the SoC, and what these distributed sensors do is measure the voltage and the temperature around the SoC, but it’s very local,” Nandra said. “For example, there can be one sensor close to each processor in a multiprocessor SoC. The sensors each communicate to a central manager, which is gathering how to maintain the PVT monitoring across the whole SoC. The central manager will set up a specified range of the threshold voltage, for example, and that can be used to reduce the clock speed, balance the load and change other aspects in the processor.”

Another way of monitoring for power management is voltage control, which can be used to reduce the voltage to save on power. “You can do something more sophisticated like set a level on that voltage control to specify a minimum level allowed for a specific performance goal,” Nandra said. “The voltage-monitoring system can then detect voltage transience that can be helpful for the safe operating region of the SoC. Finally, process control can be implemented to allows the SoC to reach the maximum possible speed by adjusting the voltage levels accordingly.”

A more granular approach can be taken in automotive applications. “In safety critical applications like ADAS, where we’re trying to detect and control failures, and to minimize the impact of random hardware failures, to accomplish this there is some on-chip monitoring that’s required,” he said. “For hardware safety functionality that’s added to the SoC—things like parity checking, cyclic redundancy code (CRCs), and ECC (error correcting code)—all basically allow the errors to be identified. Once you’ve identified the error, the automotive IP can then add some hardware safety mechanisms.”

This is reflected in revamped automotive-grade IP. With PCI Express, for example, DDR now includes support for CRC, particy protection and ECC.

How much will it cost?
One of the major arguments in favor of in-circuit monitoring is cost, which is less expensive than redundancy.

“Compared to redundancy, only one more device has to be installed, not many,” said Fraunhofer’s Warmuth. “However, this device has to be extremely reliable so it does not become the weak link itself. Still, this approach can still be feasible and cost-efficient because it reduces these additional reliability considerations to just this one device. The challenge of necessary changes in design and consistent implementation of monitoring circuits or devices differs little from the problems faced by all approaches to enhance reliability used or considered today, such as redundancy. It can be approached by defining and mandating new guidelines across the whole industry.”

How additional monitoring requirements plays out for the design team is still evolving, but a lot more specific functions are already being added into design IP, where the IP is targeted for a particular end application. That includes diagnostics which are internal to the IP. These diagnostics then interact with external diagnostics.

“Both types of these diagnostics allow the IP and the SoC to communicate with each other through software, and that is a way to communicate errors and error status,” said Nandra. “These diagnostic mechanisms are then recorded as part of a functional safety record as FMEDAs (failure mode effects and diagnostic analysis), which allow the error status of a system to be monitored. This has been part of the SoC, but is increasingly becoming part of the IP, as well.”

In time, this may evolve as part of a functional safety subsystem that includes safety items, which can be part of a processor, along with the tools for safety like a memory BiST technology to periodically test these within the subsystem to make sure that the functional safety testing isn’t interfering with the operational load of the SoC. “It’s actually becoming very sophisticated,” Nandra said. “It’s essentially a functional safety subsystem, which performs a periodic on-chip monitoring due to this mission-mode operation.”

Leave a Reply

(Note: This name will be displayed publicly)