System State Challenges Widen

The state of a system is fundamental for performing many analysis and debug tasks, but understanding it and its context is a growing challenge.


Knowing the state of a system is essential for many analysis and debug tasks, but it’s becoming more difficult in heterogeneous systems that are crammed with an increasing array of features.

There is a limit as to how many things engineers can keep track of, and the complexity of today’s systems extends far beyond that. Hierarchy and abstraction are used to help focus on the important aspects, but even those have limitations. At the same time, functional state is no longer isolated from physical and environmental concerns. An increasing number of factors, such as the thermal state of the system, need to be considered when trying to explain how a system is behaving.

In theory, the functional state of a system is defined by the values held in the memory elements of a design, plus any changes to the system state that are in progress. In reality, that information is becoming almost useless. There is a register with the address for the current instruction being executed by a particular processor, but that does little to inform you about software state. Hundreds or thousands of other pieces of memory contribute to providing that information. While a software development environment and debugger can extract that information, it must be done for every processor in the system, as well as for all the other configuration information or state information that pertains to the non-processor elements.

Each time information is abstracted, it provides a little more context about system state. However, there are no system tools to provide the necessary overview that allows you to push in and out of any areas of interest. They would need to work within simulation, test, or real silicon environments to fully understand why a particular chip behaves the way it does.

“Context is key when it comes to figuring out what’s going on inside of a system,” says Adam Arnesen, chief systems engineer for NI‘s Test & Measurement Group within Emerson. “That context is often hard to quantify. How do we track the context of the system over time so that we have visibility into how to reproduce that context?”

With an increasing number of physical effects impacting functionality, this problem no longer stops when silicon has been produced. “One buzzword these days is silent data errors, or silent data corruption,” says Geir Eide, director product management for Tessent Embedded Analytics at Siemens EDA. “A silent data error is like a UFO bug in the sense that something weird happens and you don’t know what it is. And when you start looking for it, it goes away. The quest is to try to understand what’s causing these problems. Are they caused by test escapes? Are they caused by degradation? Or is it something else?”

All problems trigger a debug session. “Knowing the current state of a system almost never tells you everything because you always failed due to something that happened in the past,” says NI’s Arnesen. “A bug exists because someone didn’t think about some condition they should have dealt with. And in most cases, it is outside of the normal space where you’re going to think, at least for the hard ones. There will always be dumb bugs, but the really tough ones are almost always something outside of our immediate context.”

Beyond function
An increasing array of operating conditions can affect functional operation. “There are conditions where you are enforcing a set of conditions on a system to put it into a particular state,” says Arnesen. “I’ve set the configuration registers to the right values. I’ve pushed the right configuration onto the bus. I set the temperature in the temperature chamber to the right level. I set the voltages to the defined value, and everything’s set up. Those are forcing conditions, and they are traceable conditions. And then there’s the whole context of the rest of the known and unknown universe. Does the phase of the moon matter?”

Consider the operating environment for 5G or 6G wireless systems, for example. That often requires building a model to know how it is operating.

“Sometimes you find that a static model is insufficient,” says Eva Ribes-Vilanova, system design and simulation product manager for Keysight. “You need a dynamic model that changes as you evolve through the life of the product. At the beginning you use system simulation models for the different parts in the system just to explore the possibilities of the designs. Eventually the designer selects the best model. This model is very abstract, very ideal. Then you describe or input more fidelity into the system. Moving through the development flow, we go from system design and simulation to prototyping. We continue to improve our initial design with measurements from the physical prototype. We constantly feed information back from reality, or from implementations of parts of the model, into the original model.”

Thermal can have a significant effect on a system’s behavior. “Once we have a good enough picture of how power distribution is spread across the chip, we can convert that information into the thermal effect of that power consumption,” says Suhail Saif, principal product manager for power analysis and reduction products at Ansys. “High power consumption in a small area would lead to a thermal hotspot, affecting not just the thermal sensors in that area, but also the performance of the transistors in and around that area. As the circuit is functioning, it is consuming power. As a side effect, it is dissipating heat. But that heat has a side effect back on the performance of the chip, as well. You have to model that effect as well while you’re measuring performance.”

Variation can have an impact on all of these factors, even in the best-designed systems. “Variation is an exponential,” said Mo Faisal, CEO of Movellus. “Variation goes up exponentially with lower voltage. It goes into OCD, which becomes timing margin, which becomes Fmax, which goes back to Vmin. It’s a vicious cycle, and you have to keep going back and forth until you figure it out. Some margin is based on the technology and the variation, and other margin depends on the functions and different modes. So if you have P states, for example, like in a multi-processor, you have a mode where half the cores are on. Then you want to go to 100% utilization temporarily, so you’re turning those processors on, but you’re putting a lot of pressure on the supply network, causing a droop. That can impact state, even if in a transient manner.”

The big challenge here is that many factors become intertwined. “The challenge is to test software with the hardware,” says Keysight’s Ribes-Vilanova. “That means for software we need to have information on the performance of that software at different temperatures. In addition to the traditional variables that we use for electronic design, we also need to consider other physical variables that can affect the performance of our designs.”

That requires linking development stages together. “From design time all the way through to actual silicon, that data needs to be kept consistent,” says Arnesen. “It’s not enough to just say we will slap a bunch of temperature sensors everywhere in the chip. Those have to be considered during every phase of the process, so you get this notion of how is it going to fail, and the modes in which you observed that happening across the flow.”

Some of that information can be hard to keep correlated. “People started discovering that on-chip instrumentation can give you more flexibility,” says Siemens’ Eide. “They get thrown into these really complex systems, which usually have asynchronous sub-systems working together. All this happens in a very complex environment where there’s lots of other things going on, and problems that are hard to replicate. You want to understand what else is going on. What is the environment doing when this problem occurs?”

A system is needed to handle the complexity. “It’s easy to get bogged down in the minutiae of snooping a bus and looking at a huge waveform of data and trying to figure out where something went wrong and manually correlating it,” says Arnesen. “But if you can get a data set that’s well-structured, and has the notion of both the state space, the conditions under which it was taken, and the right data collected from your system, then you can bring to bear on all sorts of interesting analytics. You can even deploy artificial intelligence/machine learning to help you figure out correlations that maybe you don’t have time to figure out on your own.”

Arnesen explained the concept of well-structured data. “You may be running a thermal simulation of a system, and it ends up in some big file. Sometime later, first silicon comes back, and the validation engineer needs to know something about the thermal characteristics of the chip. They may need to know more than it is specified to work at 25°C. If they see a problem when they’re bringing it up under high temperature, what are the potential problem areas? That person can save a lot of time if they can look back to the simulation information and see that simulation had some issues in particular areas. Perhaps under a high thermal load on this part, there were timing failures on a particular bus. If you are now seeing that same thing in the lab, it can save a lot of time. The problem in a lot of the situations today is the domains where that data comes from don’t know how to communicate with each other. There’s information produced in one domain that can’t move across to the next domain. Everyone ends up re-learning the problem over and over again.”

Built-in monitors
Debug continues into real silicon. “It is an evolution of the discipline of design for debug,” says Eide. “You modify and enhance the silicon to add more visibility or more controllability. People have been putting little embedded logic analyzers and stuff like that on chips for a very long time. But just adding more visibility and piling on more data doesn’t really solve the problem, because a waveform doesn’t necessarily help you if you’re debugging embedded software.”

That data needs to be turned into information. “Products with built-in sensors are in the field and are constantly recording data,” says Ribes-Vilanova. “That can be sent back to the cloud for designers to analyze. It requires high computational intensity, making this the perfect example for AI to solve. It is a very complex problem.”

There is a balance between the amount of data collected from a chip and the quality of the data. “You need be smarter in terms of what you capture in the trace,” says Eide. “For instance, look at the efficient trace standard for RISC-V. The idea is to only capture the jumps, and there are things you can optimize to basically create a trace that is as efficient as possible. Whether it’s for a processor trace, or other data you want to trace, if you can define the conditions, or narrow down what you need to capture, rather than looking at everything that happens on a bus. The other thing is you’re running into bandwidth issues and memory issues. Rather than using JTAG interfaces, we have started using USB3 and PCIe to get data out. If you can use a fast interface, you can afford to send more out and have more visibility into the data.”

AI can then work on the data. “We did an interesting test for waveform anomaly detection,” says Arnesen. “We took hundreds of billions of waveforms — time series information that was captured from buses and analog signal from a bunch of chips. We told it to identify events, the characteristics of these waveforms, and when they happened. We built a profile based of that information so we could ask what led to a particular issue with a register. Maybe it was a digital error somewhere. Maybe there was a code error somewhere. Maybe it was a transistor that was poorly printed and out of spec. You need a very large amount of information to figure that out.”

AI can help here. “AI will not solve the problem for you, but it can be a valuable tool. You have to feed it the right set of data, and you have to ask the right questions. You have to learn to ask the right questions at the right level of breadth,” he says. “This can help you narrow down the state space that the human has to think about. It will identify things that may be anomalies, and you don’t have to scroll through endless piles of data trying to find something that’s out of your comfort zone.”

Debug strategies
Debug strategies are changing, and this requires domains to come together. “Traditionally, any type of instrument, whether it’s a test instrument or design for debug instrument, is typically triggered from an external debugger,” says Eide “Or, you’re sitting with a debugger and doing things one step at a time. If you want to be able to leverage the visibility that you get from on chip monitors, you need to be able to control these from the system itself. The embedded software needs to be able to communicate with those monitors, configuring trigger mechanisms and orchestrating data. The challenge is that the system also needs to be able to operate those monitors rather than just sitting in front of a debugger and talking to it.”

In addition, some test is becoming virtual. “A lot of semi companies are interested in moving more of the model of the entire test system to the left, into design and simulation,” says Arnesen. “They can write code to test the whole flow in simulation — all the way through from the test program that will eventually run against real hardware — through a set of models of the test system and the function generators, and all the way down to a model of the DUT and then back up. So that you can try and correlate that model of the test system, as well, with what you see in the real world.”

But data does need to be presented in the right context. “We see a need for having more on-chip monitoring to capture data that is meaningful in terms of software terms,” says Eide. “Rather than just looking at waveforms, you want to be able to monitor bus transactions and communicate that in the context of other transactions involving a certain area of memory. We have to evolve existing design for debug concepts so you can monitor whatever type of problem you’re debugging, and have the flexibility to be able to set up events that capture data that is relevant.”

As a result, some long-trusted techniques may need to change. “I’m not convinced that single stepping is always the right approach,” says Arnesen. “It’s our bread and butter in both software and hardware debug. But if you can capture enough data, and then do interpolation across your data, you can think about it as single stepping, but you can let your asynchronous processors run and then post-correlate the data in time. That can include known gaps between the time fidelity of each model. It also gives you value as you move into the real world, because that same notion can then apply. If you have a structure that lets you capture that data at that level, and you can annotate it properly and align it, then you can ‘single step’ in real hardware, because you’re just walking through the same types of information captured.”

Building knowledge
Many problems will hide in the corners. “Silent data errors are an example where the answer will most likely include a combination of software, through functional monitors, to executing tests in-system and environmental monitors,” says Eide. “All of these different types of instruments need to be able to work together to address that type of problem.”

Today, large portions of the development flow are silo-based, and this creates a significant amount of waste. Knowledge is gained in one area and stays there. “A consistent view of data is required across the flow,” says Arnesen. “From simulation, into validation, into the production line of your chip, they all ought to be done in a very standard similar or consistent way so that you can have these insights across the flow and save yourself some time.”

There are an increasing number of domains being brought together today from system design through detailed design, implementation, test and post silicon. They are built on different standards, different languages, different abstractions. That is creating an increasing number of barriers that need to be broken down. Information needs to flow between them, and operations like debug need to span across multiple disciplines. The most fundamental tenant of debug is knowing the state of the system, how it got there and that has to be within the context required to solve a problem.

Further Reading
Why It’s So Difficult To Ensure System Safety Over Time
Gaps in tools and uncertainties in methodologies leave the door open for unexpected failures.

Leave a Reply

(Note: This name will be displayed publicly)