Taming Non-Predictable Systems

The non-predictable nature of most systems is not inherently bad, so long as it is understood and bounded — but that’s becoming a bigger challenge.


How predictable are semiconductor systems? The industry aims to create predictable systems and yet when a carrot is dangled, offering the possibility of faster, cheaper, or some other gain, decision makers invariably decide that some degree of uncertainty is warranted. Understanding uncertainty is at least the first step to making informed decisions, but new tooling is required to assess the impact of some new inaccuracy decisions that are being made today.

It is important to start with some well-defined terms. Consider figure 1, which separates notions of accuracy and precision. What we often strive for is accuracy with a constrained amount of precision. Inaccuracy is often the result of a systemic problem that may be fixable. Determinism, in this context means that if the same inputs are provided to a system, we can expect the same output every time.

Fig 1: Accuracy and Precision. Source: Semiconductor Engineering

An example of inaccurate/precise in semiconductors comes from manufacturing issues. For example, all devices may have a common error that leads to timing being different than expected. Imprecise/inaccurate results may come from the use of AI systems that are not fully trained. Accurate/imprecise results may come from software that can cause unexpected scenarios to be operated on hardware. They may have been predictable, but the also were unexpected.

If you have ever had to track down the root cause of a non-predictable problem, you know how frustrating it can be. Not only is it difficult to reproduce, but when the problem has been located, it is equally difficult to devise a test to show that it is consistently fixed. Everything that can be done to reduce the amount of non-determinism in systems should be done, and yet we also invite it. Sometimes the gains are worth it or the remedies are not practical.

There are two primary areas in which the system design deals with uncertainty. Timing has been an area of imprecision since the beginning. More recently, uncertainty has become an important power consideration as well. There is a new area of uncertainty, brought in by AI systems, that has greater levels of uncertainty in accuracy.

Timing uncertainty
Timing uncertainty can be caused by several issues. “Building processor-based systems that execute predictably (or deterministically) has been an issue of intense focus in the industry for some considerable time,” says Gajinder Panesar, fellow at Siemens EDA. “There are two aspects to the problem. The first is that hardware architecture and caches are the kiss of death for determinism. The other is the programming paradigm. The two aspects are intertwined. For example, in the PicoChip architecture, each core in a multicore array could be coded independently, and communication between processes was handled over an any-to-any communications mesh. Importantly, the communications flows were defined at compile-time, and not run-time, so that the overall system would run predictably.”

Uncertainty is more important in some industries. “Determinism, and the verification of determinism, is critical in applications like automotive,” says Frank Schirrmeister, senior group director for solution marketing at Cadence. “With things like cache, some aspects become unpredictable in unknown ways. This is especially so when multiple unconnected tasks are operating in parallel. There are cases with timing-critical applications where things need to happen within a certain time, no matter what.”

Such systems are normally called real-time systems. “Real-time is a ‘stretchable’ term,” says Michael Frank, fellow and system architect at Arteris IP. “In general, it implies that a certain action is completed within a bounded time, with 100 % probability, as opposed to engineering schedules. For most real-time systems, the definition is not that strict. Some systems are fine if the average is within a certain window that meets the requirement, such as for video decoding. Other cases may look to see if the deadline will only be missed with a certain low random probability. Those systems may replace a missing result by a prediction/interpolation, such as dropped audio samples.”

The design of a system is a set of tradeoffs, often between performance and power. Cache is an interesting example because it attempts to improve both performance and power by giving up timing predictability. In certain cases, it can worsen both metrics it attempted to improve.

“Caches are helpful in dealing with very large multiple-instruction-multiple-data (MIMD) programs with scant task partitioning,” says Siemens’ Panesar. “They not only cause timing problems, but can affect the functional behavior of the overall system. Cache missing and refill creates an unpredictable overhead and potential timing problems. In all but the simplest systems, it also becomes necessary to ensure coherency. Changes to data in a cache must be propagated (in the correct sequence) to other local cache resources. These processes inevitably have an impact on power and silicon area.”

This creates a veritable verification problem. “Caches often need to be coherent across multiple processors,” says Cadence’s Schirrmeister. “The complexity of the coherence logic and protocols, and identifying amongst the different subsystems who has the latest entry, and the system then acting appropriately, is far from trivial. Companies spend a lot of time and money making sure this does not introduce errors.”

Caches aren’t the only problem area. Any contested resource on a chip, which cannot be simultaneously used by two processes, presents a challenge. Examples are arbitrated interconnects and memory controllers.

Non-determinism creates additional verification headaches. “Design and verification for non-deterministic (anything that does not have a 100% probability) behavior has always been a challenge,” says Arteris’ Frank. “Examples include asynchronous clock boundaries, schedulers, probabilistic network (queuing) models or approximate computation. One way to address this was randomizing stimuli and comparing the results to expected distributions instead of a fixed value. These types of systems are not really that new – they used to be called fuzzy logic, a term introduced by Dr. Lotfi Zadeh of the University of California at Berkeley in 1965.”

Some of these problems are important enough that verification solutions have been devised to identify and fix them. “Clock domain crossing (CDC) is a very pervasive failure mode in the sense that if you do not take care of it, then your device will have a small mean time between failures (MTBF),” says Prakash Narain, CEO of Real Intent. “Reset domain crossing (RDC) issues have also been present in designs for a long time, but they are not as pervasive, meaning the MTBF for RDC failures is much larger than CDC. For example, you have to reboot your device approximately every three weeks. The root cause could be RDC, but to correlate that back to an RDC failures is very difficult because the mean time to failure is so large. As reliability requirements increase, more people are becoming concerned about it.”

Functional uncertainty
While it is possible that timing uncertainty can lead to functional uncertainty, we are dealing with new kinds of uncertainty in systems today that utilize AI/ML. AI systems are tested for accuracy, and we should be a lot more concerned about their precision. The web is rife with examples about slight alterations to an image, some of which may be invisible to us, that result in wildly different interpretations from an AI system.

An AI system itself is predictable in the same way as any other von Neumann machine. You can predict the time that it takes to deal with a frame of data, given a particular set of coefficients. Every time you provide the same input data it will behave the same. But anything that changes the coefficients will modify timing and power. Coefficients change every time you retrain or modify the compiler that maps the coefficients on to the hardware.

“If you can determine the computational and inter-process communication flow at compile time, you can often improve both predictability and power performance,” says Panesar. “Essentially you need to partition the processes on to available computing resources; run those processes; then communicate data; then reconfigure and start the process again. This approach is common in some AI/ML architectures today.”

But the biggest issue is that the results themselves are the output of a statistical process with unknown precision. “The pure mechanics of how the CNN/DNN works is just computational software,” says Schirrmeister. “It is just a bunch of MACs, and you can predict how fast all the coefficients will be predicted. Now you have the output, and you have the output in a determined time, but you have to put safeguards around it. Outliers are critical. Perhaps because of the training set, or not being fully trained, you have to determine if the result is an outlier. These need to be located to provide a certain level of predictability.”

How are outliers going to be identified and dealt with? “Results coming out of neural network components might require some type of ‘analog’ interpretation,” Frank explains. “I believe the more important question is in the context of a system verification, along the lines of:

For all x in a range [xmin … xmax] will the function fNN (x) approximate ftarget (x) well enough and how do we verify that without visiting every data point and checking if the result is within < maxError?

“This is a real hard problem,” he says. “Defining a metric to assess a functional approximation over a discrete sample space is hard and I’m not sure if a solution exists for all possible mappings. It will most likely require specification at the system behavioral level, similar to defining analog systems. The problem lies in the system transfer function — how to prove that a ‘machine learned’ approximation will match the real function for any input and, even more complex, in temporal behavior, if the network is a recursive network (RNN) and/or has memory.”

Analysis and predictability
Ironically, there are some systems that are purposely adding unpredictability as a way to increase security. “You don’t want people to be able to extract important data by looking at power profiles,” says Schirrmeister. “The addition of some unpredictable element can act as way to hide this attack surface.”

In order to avoid unexpected outcomes, you have to understand the problem well enough to know if the decisions you are making are sound. “At the front end of the design and development pipeline we need to learn a lot more about statistics,” says Frank. “Statistical methods, such as randomization, simulated annealing, and approximations, have been in use for backend tools and verification for a while, and propagating them forward requires some thought. Applying these paradigms to digital circuits somehow strikes me to be similar to converting logic back into an analog representation.”

The new Accellera Portable Stimulus (PSS) language and the tools that operate on it may provide assistance in some areas. “PSS gives you the ability to look through all the options that can happen and the sequences that could happen,” says Schirrmeister. “They become more complex as you add more context on top of that. You are using the constraint solver to identify all paths through the design for which a certain result can be achieved. There is a new gray area emerging between verification and performance analysis, because if there is too much delay in getting the result you may have bad results. You have to start doing verification where you consider ranges. You need to simulate all the variations in a way to get to the end result of a task. Then you can determine if you are in a safe range.”

A lot of this involves both hardware and software, making analysis more difficult. Formal has complexity issues with this type of problem.

Some companies that are trying to resolve some of this uncertainty. “One company in Germany – Inchron, is looking at software timing, which is impacted by the hardware it executes on,” adds Schirrmeister. “It has notions of formal, where you go through all variations of what might happen. For example, what if the interrupt service routine fires, but you need to return with a result in a certain time? They have ways to predict that. This is for very short response timing-critical applications.”

Another path forward is through functional monitoring built into devices. “You can observe the behavior of an SoC in terms of bus transactions, instruction execution and hardware state,” says Panesar. “You can configure the monitoring infrastructure at run time to look at a variety of quite sophisticated measures of performance. Effectively, you are using a set of run-time assertions based on the behavior of the real silicon, just as a verification engineer might use design assertions earlier in the flow. This can alert you to unpredictable behavior and allow you to identify problems that may occur only after many hours or days of operation in the field.”

The increasing focus on 5G is adding to some of these concerns. “This kind of system requires a functional monitoring system to check that cycle or time budgets are being met,” adds Panesar. “When the embedded analytics infrastructure detects drift from ‘ideal’ performance, it alerts the software scheduler to remap the system and bring it back to satisfactory real-time performance. This functional monitoring data is also a powerful tool when a feedback loop exists to the compiler and other tools in the development toolchain. It may be possible to solve predictability problems by remapping resources via the compiler. Taking an even longer view, architectural decisions on future chip generations can be informed by predictability data gathered in the field.”

The costs associated with making a system completely predictable is untenable. Most designs are unaffected by some degree of non-predictability, so long as the magnitude of it is understood. As systems are becoming larger and more interconnected, precision is become more of an issue, simply because it is not possible to do enough analysis to be able to bound the problem.

Leave a Reply

(Note: This name will be displayed publicly)