What Do Feedback Loops For AI/ML Devices Really Show?

Optimization removes some of the baseline measurements for chips, making comparisons much more difficult.


AI/ML is being designed into an increasing number of chips and systems these days, but predicting how they will behave once they’re in the field is, at best, a good guess.

Typically, verification, validation, and testing of systems is done before devices reach the market, with an increasing amount of in-field data analysis for systems where reliability is potentially mission- or safety-critical. That can include cars, robots, military equipment, servers, and even smart phones and gaming systems. But the impact of intelligence on performance, power, and ultimately chip behavior is uncharted territory.

With semiconductors, predicting reliability is a combination of data analysis plus repeatability without failure — a confluence of math and science. The more that design through manufacturing processes are repeated with the same positive results, the higher the predicted reliability. It’s like comparing version 0.1 of a process to version 1.0.

The problem with AI/ML systems is that once they are released into the field, not everything is repeatable, greatly increasing the level of uncertainty in feedback loops. The whole point of these systems is use-case customization. They can adjust to changes in the environment or different user preferences. And unlike traditional chips, these devices are increasingly heterogeneous designs with unique architectures. Put simply, there is little history against which to measure reliability, and the data used in those measurements is suspect.

There are several possible solutions to this problem, none of which is perfect. The first is to spend more resources testing how software/algorithm updates will affect intelligent systems over time. Given the fact that many systems will have to be updated over extended lifetimes of a decade or more, OEMs need to understand how systems that already have adapted to their environment or different use cases will be affected by these updates.

In the past, vendors would roll out one patch after another, sometimes even multiple times a week, in order to fix interactions they didn’t anticipate. But with AI systems, there is no single baseline for updates. That means either patches will have to be much better understood and more carefully rolled out, or systems will require a partial reset every time a patch is downloaded in order to make sure everything works as planned.

Second, systems need to be architected so that whatever pieces can optimize themselves can only do so within boundaries that are acceptable. That means systems need to be designed not just for maximum performance and efficiency, but with carefully constructed pre-set limits. Those limits need to be well defined, because systems of systems may have multiple additive behaviors that can cause anything from erratic performance to uneven aging. Within a heterogeneous system, those kinds of changes are nearly impossible to keep track of, let alone account for.

Third, systems will need to run regular checks, whether that includes external monitors and data or a combination of internal and external sensors. Equally important, there need to be enough knobs to turn to make sure that when problems do arise, they can be identified quickly and fixed. You can’t do a hard reboot for the logic system in a car on a highway, but you certainly can add enough checks into a system to be able to isolate a potential problem and get off the road safely.

The tech industry has done an impressive job in developing systems that allow technology to take over important but tedious functions currently done by people. But it also needs to understand how to control those systems when something goes wrong, because as anyone with enough history in technology well knows, no electronic system lasts forever.

Leave a Reply

(Note: This name will be displayed publicly)