Adding monitors or traceability into an SoC is not new, but it is beginning to become a huge new opportunity across the entire silicon lifecycle.
New regulations and variability of advanced process nodes are forcing chip designers to insert additional capabilities in silicon to help with comprehension, debug, analytics, safety, security, and design optimization.
The impact of this will be far-reaching as the industry discusses what capabilities can be shared between these divergent tasks, the amount of silicon area to dedicate to it, and the value they may be able to extract from it. New business models may emerge to deal with data ownership.
Is this the next frontier for silicon? “Yes,” says Steve Pateras, senior director for test marketing at Synopsys. “We spent many years trying to optimize designs, tape them out, and then forget about them. We have come to the realization this is no longer possible. You need a methodology, a platform, and an approach that provides a way of monitoring and managing electronics throughout their operational life.”
There are multiple drivers for doing this. “The challenge of ensuring everything is functioning as expected throughout a chip’s lifetime has become more difficult,” says Aileen Ryan, senior director for Tessent portfolio strategy at Mentor, a Siemens Business. “Structural monitoring is required to detect defects, degradations and aging effects. Additionally, there may be functional issues caused by bugs and even malicious attacks which must be detected and mitigated. For chips that are deployed in situations where safety, security and resilience are a priority, it is critical that detection of these issues takes the minimum amount of time.”
Technology pressures certainly are making it necessary. “As process technology advances, designers can no longer insert large margins to obtain yield, and ensuring that chips behave within specification is becoming a lot more difficult across different workloads,” says Norman Chang, chief technologist at Ansys. “In addition, with large AI chips and other large SoCs, the workload cannot be accurately predicted.”
Reliability issues are growing. “With technology transitions, such as finFETs, a lot of new types of defects got introduced in these devices during the manufacturing phase,” says Faisal Goriawalla, senior staff product marketing manager for Synopsys. “Some of these will manifest quickly, during production. Some of these will manifest later in the SoC lifecycle. The process rules are getting increasingly complicated. The memory bitcell transistor is very sensitive to the process. It is 2X more sensitive than the logic transistor used to synthesize the rest of the chip.”
Aging is becoming a big concern, too. “Things are getting less reliable, and the reliability requirements are going up,” says Synopsys’ Pateras. “It is not just reliability. It’s security, safety, it’s even performance, it’s power.”
This is not just for the benefit of the chip manufacturers. “A big application is preventive maintenance,” says Ansys’ Chang. “Consider unexpected system shutdowns, which can cost a lot of money. If you can catch that before the system breaks down and replace a particular chip or particular PC board, that can save a lot of money. Intentional shutdowns of the system can be done safely, whereas unexpected shutdowns can cause damage. Preventive maintenance can save a lot of money.”
In other industries it is likely to become mandatory. “A new regulation from the United Nations Economic and Social Council, WP.29/GRVA (The Working Party on Automated/Autonomous and Connected Vehicles) is due to go live in January 2021,” says Mentor’s Ryan. “This work relates closely to ISO 21434 and ISO 26262 standards which also address cybersecurity and safety in vehicles – and how these ultimately impact vehicle design and passenger safety. What this means to vehicle OEMs is that they will be automatically and ultimately responsible for the cybersecurity of a vehicle, not only at the point of sale or throughout its warranty period, but throughout its entire lifecycle.”
The basics
Synopsys’ Pateras provides us with a primer on the subject. “It’s all about managing silicon throughout the lifecycle. The approach has two components to it. First, we need visibility into the chip. We need to know what’s going on and so we embed various forms of instrumentation – monitors, sensors. Think of PVT sensors, think of structural monitors like looking at the margins, looking at the path delays, looking at clocking abnormalities, and even more macro monitoring. If you think of security, you need to be looking at activity on buses. What is being read and written to memories? What’s being accessed? And so, the first component of this approach is to get visibility into the chip for various components.”
That’s step one. “Once you have all these monitors and sensors in the chip, and you’ve placed them properly in an efficient way, you need to make use of that information. So the second part of this approach is analytics — taking this data at different points in time. We may want to take it during manufacturing. We may want to take it during production test. We may want to take it while we’re trying to bring up the system. And I certainly want to be taking it throughout the operational life of the system. And we want to perform analytics to figure out what’s going on and then react to it. Analytics can be on-chip, they can be off-chip but locally within a system, sort of edge analytics, or they can be centralized, where we send data to some central repository where we’re doing some sort of big data analytics.”
This becomes particularly important for whatever are the most important functions for a device. “Mission-mode monitoring opens up new possibilities of predictive maintenance, adaptation to maintain performance with aging, protection from side channel attacks, and a wealth of data to be mined with artificial intelligence,” said Richard McPartland, Moortec technical marketing manager. “Central to this is accurate, distributed in-chip monitoring where non-intrusive sensors can be placed closer to critical circuits and provide highly granular mapping to enable analysis of thermal, power distribution and other issues across often very large die. Mission-mode data is valuable for many diverse parties involved throughout the chip’s lifecycle. Analytics enablement revolves around the concept of making such a data pool available, which then can be analyzed and insights extracted — not just by the immediate user, but potentially subject to safeguards also by the system developer, chip developer, test house, and others.”
Owning the data
One of the big questions that comes from this is who owns the data. “The whole question of data analytics involves how to run analytics, who runs it, and who can access it,” says Frank Schirrmeister, senior group director, solutions marketing at Cadence. “When I look at some of my personal devices, there are machines which sync data locally to an app, but they do not interface to others. That is because of the value of the data, because you really want to control who can access the data because if they have access to the data, they could build the same thing again. It will be an interesting challenge for the whole notion of data platforms.”
That could affect business models within the industry. “A lot of designers are putting in dynamic frequency voltage scaling,” says Chang. “Based on the reading from the thermal sensors they can slow down the frequency or lower the Vdd of operation in order not to exceed the thermal hotspot threshold. That’s the current way of doing things, but with the post silicon on-chip sensors, we can think about a new working scenario where we bring together the multi-physics simulation vendor, combined with a chip vendor and the customers using those chip in the system. Can post-silicon monitoring provide information that becomes beneficial to all the parties? It is a new problem to think about in terms of the business model.”
And it may not be an easy problem to solve. “If you’re a vertically integrated company, it’s no issue,” says Pateras. “You can use the data throughout the stack. If not, there needs to be a way of sharing the results of the analytics. If you’re a chip provider who put the sensors and monitors into the chips, and that data is being extracted from the chips, you need to be able to at least allow the results of some basic analytics to be provided to the system vendor. That data could be post processed analytics rather than raw data coming off the chip. This is why there’s value in doing analytics at different stages. If there’s a lot of raw data coming at you, you may want to be able to filter out what’s really important and then you can do system level analytics.”
Securing the data
Allowing this type of data off-chip could be a security vulnerability. “Security is needed for silicon, regardless,” says Pateras. “You want security for anti hacking, for IP theft and so forth. Security is a critical aspect of chips going forward. So there needs to be techniques used to ensure the security of this data and you can think of using encryption keys to provide access to the chip and that same mechanism would be used for accessing this monitor data. There’s no way to access the monitored data unless the proper access is provided. And then access can be controlled by the chip provider, and by the system provider as needed.”
The data could enhance security. “Could you use this data for side-channel attacks?” asks Chang. “Possibly. Usually you put a thermal sensor in sensitive locations, or a thermal hotspot. If the thermal hotspot location is coincident with the security sensitive location, and with a different payload in an AES security chip, you may create different thermal reading from different payloads, and that information may enable an attacker to crack the key. That’s why the information coming out should be secured. But it also works the other way. If the attacker is exercising a specific workload, or the sequence of the payload, in order to extract the key, the on-chip thermal sensors or other sensors can have a machine learning agent to try to detect the pattern of the specific workload.”
Functional monitoring is becoming an important issue. “State-of-the-art solutions for functional issues in the field deploy embedded monitors on the chip, which constantly watch for anomalous or unexpected behavior,” says Ryan. “These monitors are instantiated in hardware and operate at clock speed, constantly collecting data about how the chip is operating. This data can then be correlated and analyzed to identify the root cause of a functional problem.”
Or a vulnerability. “In some forms of attack, it may create a spike in voltage, or it may create a temperature increase that is abnormal,” says Pateras. “Being able to monitor voltage and temperature on an ongoing basis, and then be able to understand the history of those parameters, allows you to discover or detect changes in the behavior of those parameters that are abnormal. Another example is monitoring bus activity. Again you could look at what is a normal amount of transaction data or the type of transaction data that is normal for that system. If you start seeing changes or anomalies, you can then determine that something’s going on and quickly turn off access to those components. It requires both the monitors to be able to see what’s going on, and it also requires the data history and requires the ongoing analysis of that data.”
And finally, security needs to be updated regularly throughout the lifetime of a product and best practices need to be maintained. “Nothing is secure unless you practice secure principles,” said John Hallman, product manager for trust and security at OneSpin Solutions. “Usually, an insider is the weakest link. So after you’ve done all the due diligence, this needs to be part of the normal behavior.”
That also includes regular monitoring for data leakage, Hallman said. “You may have several organizations that know a piece of the design, but nobody really knows all the pieces. If you look at complex designs, very few people ever know the larger scheme for a chip. Those controls need to be kept pretty tight, and you need good checks and balances.”
Accessing the data
There are design decisions that have to be made about how to get data off-chip. Is it intrusive, utilizing existing communications channels which may also affect operation or performance, or does it have dedicated resources to be able to obtain and transmit the data in an unobtrusive manner? How do you balance the on-chip computation and storage requirements against bandwidth?
“We see the same thing during the development process when running in the hardware engines,” says Cadence’s Schirrmeister. “We often refer to them as accelerated verification IPs, which are optimized to collect data. The challenge is very similar to large-scale edge processing. There is a network from your sensor to the data center. Where do you do the compute? You can push out everything, which is very intrusive and will impact speed, versus computing things internally. Then you don’t have all the data available anymore. You only have the computed derived data. But then you have a much better chance to collect that and stream it out.”
Test standards are paving the way in being able to collect data within complex packages. “IEEE 1838 is a new testability standard which was developed for 2.5D and 3D-IC testing,” says Synopsys’ Goriawalla. “It enables you to think about inter-die test and intra-die test management. You have a situation where you no longer have the accessibility to all of the middle die that are stacked in a 3D structure, and so you need to have a framework, and infrastructure which is able to test through the bottom die to test the other dies that are stacked on top of it. In addition, there is also die to die connectivity, or the interconnect which needs to be tested.”
Some systems may use existing interfaces. “There may be the need for some standardization of monitor data,” says Pateras. “Then we can use standard bus accesses, going through standard functional high-speed interfaces to access that data. If any chip or chiplet or die has a PCI Express, or USB interface, we will piggyback on that to gain access to the monitor data. We have IP inside the chip that provides access to the monitor data on-chip and sends it through these existing functional interfaces. That means we do not add additional infrastructure to gain access to the monitor data.”
Part of the decision may be based on the speed with which you need to detect and react. “For safety-critical use-cases, this kind of on-chip embedded analytics system will give the fastest possible detection and response times,” says Ryan. “In other use-cases, where time is not so critical, the data collected by on-chip monitors can be loaded into an off-chip analytics system, where it can potentially be correlated with an even richer set of data to determine root cause and next steps.”
Understanding the data
Many years ago, GE started to collect huge amounts of data from their aircraft engines. At the time, they did not know how they would use all that data, but today they do. “Part of the problem at the beginning is you often don’t know what to measure,” says Schirrmeister. “That’s why you want to have the option to push out all the data but then the dire consequence of this is that, just by virtue of interface bandwidth, you will have to slow things down to do that. The tendency is to measure everything and store everything and that’s only possible for a certain amount of time.”
Machine learning can be helpful. “We are applying machine learning to these analytics,” says Pateras. “Some are algorithmic, but we’re also looking to apply standardized neural net-based approaches to looking at trends. This is an ongoing, evolving space where we’re going to continually try to improve our understanding of what’s going on in those chips and be able to better predict what’s going to happen.”
Some of the data may need to be tied back to more extensive models. “Post silicon monitors are a very good idea that provides you with a better reading for real-time chip operation, but also create a problem,” says Chang. “The problem is that it’s not model-based. It is purely based on the reading from the on-chip sensors. And so they only provide a one dimensional view. But the workload is changing, the threshold is changing. You need a model-based digital twin to complement the on-chip sensor reading. Consider a sensor that can tell you resistance is getting larger, but if you don’t have a physics-based model, you cannot predict how much time before it will fail. You need a reliability model to complement on-chip monitoring sensors.”
How much overhead?
Defining the area or performance penalty for the insertion of this type of monitoring is difficult. Not only will systems have different amounts of monitoring, but much of that circuitry may already exist for other functionality. “There are early indications that customers are willing to give up a small amount of area,” says Pateras. “Area is always important but if this becomes part of the functional requirements of the chip, then it becomes more accepted and less of an afterthought. These monitors are useful for many things. They are useful for bring up, they useful for performance improvements. PVT sensors are used for doing things like dynamic voltage and frequency scaling. This is a functional requirement. You can’t make this chip work without it. So these monitors and sensors have already been placed into chips, at least in these lower nodes, just to make a chip work. Taking a little more area is not that much of a stretch in most cases.”
Sometimes it may be coupled to safety requirements. “There is likely to be a mechanism or infrastructure in the chip which is performing some kind of testing,” says Goriawalla. “This is not just the traditional manufacturing test, but it could be in-system. It could be in the form of power-on self test, it could be a periodic test during mission mode functional operation. This safety infrastructure has to be in the chip so that when it finds an error that it cannot recover from, that cannot be corrected, it needs to have a mechanism to aggregate the errors and report them to a functional safety manager for then reporting to a higher level system software or to the user.”
Tying it back to the development flow
There is a lot to be gained from this data in the development flow. “Chip design is based on simulations and on basic models, but there’s very little real data being used to optimize this design,” says Pateras. “We’re now looking at using these analytics. Think about path delays and margin analysis, and using it to better understand the distribution of margins and frequencies and then driving that back to the design implementation tools, to our models, to calibrate those models based on the actual silicon data. That will allow us to tighten our margins and tighten the timing on our designs and make them more optimized. The link from silicon analytics back to design is something we’re very excited about.”
Better understanding leads to better products. “If we can do real-time monitoring for the lifetime of the product, you will really see the aging phenomenon,” says Chang. “That will provide valuable feedback to the design stage, enabling better chips for the next generation. We can see thermal cycling running under a realistic scenario over different workloads. As a simulation vendor, how do we provide the multi-physics model, or a reduced-order modeling? That can help to do a good correlation with real-time sensor data to provide valuable information for the whole ecosystem.”
And with systems houses becoming responsible for the product over the lifetime, there may be constant updates. “With security becoming a critical aspect of IC development, post-silicon analysis is going to change,” says Sergio Marchese, technical marketing manager for OneSpin. “ISO 21434, the automotive cybersecurity standard, has specific demands on how OEMs and their supply chain handle incident response, for example. Assessing the impact of newly discovered hardware and software vulnerabilities on systems, and identifying and verifying solutions, will become almost a routine task. The very nature of security verification, where unintended use case scenarios take center stage, means that static analysis techniques will have a significant role in implementing systematic, efficient incident response processes.”
Conclusion
The introduction of on-chip monitoring was initially required for a number of small use cases, but as technology has progressed and the value of analytics understood, the ways in which this data could be used are becoming endless. The entire chain from EDA vendors, to chip developers, to systems companies, to the end users who deploy those systems all see value in the data that could be provided. Many of those capabilities are becoming necessary to fulfill end product requirements, or to deal with safety and security demands. There are many issues that need to get resolved, such as who owns the data, or who can gain access to it, and long term, that may lead to some interesting new business models.
Hardware (IC, PCB, System) monitoring & management based on data from build-in sensors creates opportunities for new business models. What’s needed is ecosystem collaboration on use cases and end-to-end solutions provide clear economic value to stakeholders. The new IoT GSA Ecosystem initiative aims to drive the electronics value chain in this direction.
very good