Interactions and optimization make it much harder to determine when a system will fail, or even what that failure really means.
AI can do many things, but how to ensure that it does the right things is anything but clear.
Much of this stems from the fact that AI/ML/DL systems are built to adapt and self-optimize. With properly adjusted weights, training algorithms can be used to make sure these systems don’t stray too far from the starting point. But how to test for that, in the lab, the fab and in the field is far from obvious.
Until recently, most of the work involving AI has focused on improving performance and accuracy of these systems. They are expected to work within an acceptable distribution range, which is basically setting the parameters for “normal” behavior. It’s similar to the engine temperature gauge in a car. Sometimes it runs hotter or colder than at other times, but as long as it’s green, there’s no reason to pull over to the side of the road.
The challenge with AI systems is that you can’t really tell what’s going on after they are turned on. You can measure the heat and the electrical activity, but you don’t specifically know what’s going on inside the package. There is a nearly infinite number of possible interactions and permutations, and all of that is opaque. And once a system goes live, there is no way to say for certain exactly how it will behave.
This uncertainty is compounded as more systems interact, and as machines learn from other machines. But now, rather than a single device, the parameters of acceptable behavior become even murkier. The chip world is used to thinking about dependencies in fixed terms. “If this, then that.” With AI, those interactions are fuzzy because these are distributions. So it becomes, ‘If roughly this, then probably that.” And the lower the accuracy of one system, the lower the accuracy of both, and the less certain the outcome of any interaction.
Predicted behavior becomes still hazier as updates are added into these systems. Those updates can be anything from software and firmware in other parts of a system, which can alter performance, power and thermal gradients, to new algorithms in the AI system or system of systems. So even if it’s clear how they behave individually, the big challenge going forward will be how they behave together. Modifications to an algorithm can change the data flow within a system, stressing different parts of a system differently and at different times.
In some cases, that’s the point. If your phone doesn’t work, you need to be able to re-route signals the same way you can with ECC memory where extra memory cells are used to keep systems behaving as expected. The difference is that with AI systems, the re-routing is less predictable. These systems are supposed adapt to whatever is optimal for a particular user or device or application, so each system will adjust itself to the optimum performance or energy consumption based upon internal variations.
That makes it almost impossible to predict how a system will behave, and it makes it much harder to pinpoint problems when a system is in use. In theory at least, each system should reach a sufficient level of performance and power and continue to behave within the parameters set by the design team. Some circuits will age faster than others, and in certain use cases some parts of a system may be stressed more than others.
The challenge going forward will be to build tools and systems that can both measure and predict all of these possible permutations, and to build sensors and analytics into systems that can at least give engineers some idea of how the hardware is behaving. The whole industry is pushing into very murky territory, and it would be nice to at least have a flashlight.
Leave a Reply