中文 English

Putting Limits On What AI Systems Can Do

Developing these systems is just part of the challenge. Making sure they only do what they’re supposed to do may be even harder.

popularity

New techniques and approaches are starting to be applied to AI and machine learning to ensure they function within acceptable parameters, only doing what they’re supposed to do.

Getting AI/ML/DL systems to work has been one of the biggest leaps in technology in recent years, but understanding how to control and optimize them as they adapt isn’t nearly as far along. These systems are generally opaque if a problem develops in the field. There is little or no visibility into how algorithms are utilized, or how weights that determine their behavior will change with a particular use case or interactions with other technology.

In fact, the European Union this week issued guidelines for AI — specifically including ML and automated decision-making systems — limiting the ability of these systems to act autonomously, requiring “secure and reliable systems software,” and requiring mechanisms for ensuring responsibility and accountability for “AI systems and their outcomes.”

The level of concern depends varies greatly depending on what the device is supposed to do, where it is used, and what it is connected to. For example, it may be a minor consideration in some consumer electronics applications, but a much greater concern in the network that controls them.

“The question is where is that AI,” said Sandro Cerato, senior vice president and CTO of the Power & Sensor Systems Business Unit at Infineon Technologies. “So there is AI in the cloud, and AI at the IoT edge. In the end, there are many AIs, and they have different functions depending upon where they are. I can now communicate with my vacuum cleaner, which is connected to a network, and there is one part of this network that is very intelligent.

Safety- and mission-critical applications are a particular concern, especially as increasing levels of autonomy are added into cars, drones, and industrial robots, and as machine-to-machine learning is used to update those systems in the field.

“The challenge is the irreversibility, or black-box nature of neural networks,” said Ty Garibay, vice president of engineering at Mythic. “How did it make its decision? There’s also the challenge of the statistical nature of AI. It can be incorrect correctly. If you put an image of a cat on a particular neural network that has been trained to decide it’s a dog, then it will say that cat is a dog every time. That will be correct behavior for that neural network on that digital processor, but the wrong answer.”

Fixing these issues is difficult, and it’s made more difficult by the fact that AI systems are supposed to adapt. The challenge is to minimize any collateral damage caused by that adaptation.

In general, strategies for containing AI fall into the following areas:

  • Limited functionality. One of the most common strategies today for making sure AI systems don’t stray too far is to partition functionality among several different AI processors or systems, rather than relying on just one, with redundancy added for mission-critical or safety-critical functions.
  • Better simulation. Understanding early in the design cycle how devices will interact with other devices is one of the big concerns among companies developing these systems, but it requires specialized simulators with enough capacity to include multiple elements, which adds to the cost.
  • Real-time monitoring. Tracking when a device strays from its legal operation, whether due to sensor drift or a flaw in the algorithm or another piece of hardware, is yet another facet of increasing attention on chip-level and system-level reliability.
  • More visibility into algorithms. New approaches to writing algorithms have been under development for several years. So far, they are still a work in progress.

Limiting functionality has been an obvious first step for many companies. Xilinx’s smart network interface card can adapt to multiple implementations and connectivity options. Samsung’s HBM memory offers an AI chip that essentially tackles the same issue. And many EDA tools and chip manufacturing equipment have added some type of machine learning, a subset of AI, to identify patterns and potential problems.

The next phase will be more difficult, and it requires a much deeper understanding of how systems will be used and stressed, and how AI will adapt to various operating conditions and other variables. In some cases, this may require a digital twin or a level of redundancy similar to what the aerospace industry has been using for many years, where it compares results from three systems and uses the two that are closest.

“The best way to control AI is to have a second system in place that acts like a safety control mechanism, like we do in hardware or software for functional safety today,” said Raik Brinkmann, CEO of OneSpin Solutions. “What are the things you want to protect against? What are the things you don’t want AI to do? What are the bad situations you want to catch? You want to try to mitigate those risks. You cannot fully control AI, because it’s too complex, but you can mitigate risks. And if we could come up with some methodologies and standards to address this, it would be helpful. That could include prepared scenarios that everyone would want to check for.”

Much of that is based on predictions about how the devices will be used. Chip design teams usually don’t have that information because they are designing to a spec, whether it’s an individual chip, a package, or a subsystem. But sometimes even vertically integrated systems vendors are surprised by the context in which their device operates because the technology around them is changing so quickly.

“There are assumptions you make during the design phase and you don’t know whether those assumptions will be right during the deployment phase,” said Anoop Saha, senior manager for strategy and business development at Siemens EDA. “When the iPhone was deployed, you didn’t know that it would be used to upload so many TikTok videos. That comes after a device is in production, and it completely changes the utilization. And that also requires some configurability, so you need to put that into an SoC, as well.”

Design considerations
Much of AI/ML is focused on optimizing a system for a particular use case or hardware configuration, improving performance and/or lowering power. But not every design team has access to the same pool of data, which makes this more difficult. In many competitive markets, and across many segments of the design-through-manufacturing chain, that data is highly proprietary, which makes designing and controlling AI more difficult.

“So there’s shrinking the nets down, and there’s formatting of the structure of the problems so that they map better to existing hardware, whether that’s a CPU, GPU or NPU,” said Rob Aitken, fellow and director of technology at Arm. “But whatever that hardware is, it will do certain operations better than others. Making your problem fit the hardware is always a good way to reduce power. Longer term, if there are a bunch of applications that don’t fit nicely into a particular architecture or set of architectures, how do we design better ones or newer ones? That’s where things like spatial compute come in. A bunch of it is constrained by, ‘Do you have the freedom to define the hardware for this project?’ Or are you using a particular hardware platform, in which case you need to adapt your system to work with its limitations and its capabilities. But if you’re in the position where you could define the hardware, then you have more degrees of freedom and you can do more stuff.”

The greater the access to all pieces of the system, the more data, and the better the results. But there is so much data, and so many possible interactions, that system-level simulation becomes essential. The challenge there is the size and complexity of some of these systems requiring simulation. On the other hand, doing this kind of simulation in pieces has limited value because it’s essential to understand as many interactions as possible across increasingly complex systems.

“It’s not just that these circuits are bigger and more complex,” said Hany Elhak, group director of product management and marketing at Synopsys. “It’s that now I have many different types of circuits that are part of that bigger system, and they need to be designed together. And then we need to have some common flow for these different design teams so they don’t have problems at the end of the design cycle when they try to connect these things together. They need to be working together from the beginning.”

AI adds another element to this. From the standpoint of chip design, an AI design, or a design that includes an AI element, needs to function like any other chip. What frequently isn’t understood is how these devices will behave after they adapt.

“For AI, you have many very similar elements that you produce on the chip to parallelize the execution, but the elements themselves are fairly simple compared to an out-of-order processor,” OneSpin’s Brinkmann said. “The processing element inside of an AI matrix is fairly small and easy to understand, and easy to verify, too. That’s something you can control and do today with the technology we have. The bigger question is what runs on it, and that’s software. That’s where the actual verification problem with AI starts.”

Scale only makes that problem worse. “Some of these AI chips are very big digital chips that need to be designed using foundation IPs, and they need to communicate with other blocks in the design, like high-speed I/O and memory interfaces, and all of that requires accurate and fast simulation,” said Synopsys’ Elhak. “To enable AI, simulators really need to do a lot more than what they’re doing today. You need to to simulate all of this to validate these AI chips.”

Lifetime monitoring and metrics
That’s just the starting point. With AI, constant monitoring is necessary.

“There’s a real problem with models drifting for one reason or another. There’s a whole field of study around measuring the drift and finding mechanisms by which you can control the drift,” said Jeff Burns, director of AI Compute at IBM. “A lot of our key clients are in regulated industries. They need auditability for why key decisions were made. And because of who these clients are, that includes the entire AI lifecycle.”

This requires data to be structured, organized, and sifted for patterns through some sort of loop-back system.

“The learning, re-learning, and re-training is done in the factory, and not at the edge,” said Brinkmann. “It may combine different input from different instances of the AI systems, but essentially you will have to have a digital twin platform to monitor that and have a clear understanding. Just imagine if you have a fleet of systems out there and you want to upgrade a sensor. Some systems may have a newer version, so they give you different data than the older ones. Getting that all back together into something that you can use requires that you keep track of what’s there. That’s configuration management on one level, but you also could think of that as a digital twin. You have a model of all of these systems in the factory, and you can simulate what will happen if you upload a new version before you actually do that.”

In effect, it is a reference point that moves as systems evolve. The challenge it addresses is how to provide some level of explainability for how AI is behaving.

“This is something that we’re working on at the moment,” said Dirk Mayer, department head for distributed data processing and control in Fraunhofer IIS’ Engineering of Adaptive Systems Division. “It’s quite an important topic because this is needed to get some things certified. If you want to use an artificial intelligence algorithm, to leave the prototype stage you need the acceptance of the customer. So you need a certified failure rate or something you have for other industrial systems. I’m not sure artificial intelligence algorithms at the moment meet those requirements. Also, you don’t have those algorithms from which you can predict whether they would meet the requirements, which is a transfer problem. You may have something for this application and for that specific machine that works. You can predict failures with a true positive rate and false negative rate. But when you transfer the algorithm to another machine, you have to analyze everything from scratch again.”

Utilizing data
Another related challenge involves the growing number of factors from different disciplines that need to be considered in an AI system.

“Architectures keep changing all the time,” said Suhas Mitra, product marketing director for Tensilica AI products at Cadence. “There’s a lot of research happening. But is there any way to make this foolproof? The answer is not very clear. You need to understand all of this and have a software toolchain that can do all of this fast and efficiently. And then you need to go back to the hardware-software co-design methodology and ask, ‘If I utilize this new network, how do I need to change my hardware? What do I do now in terms of activation functions, and what happens if I do a 3D versus 2D MAC arrays? All of these different things play a role, and the solution may depend on which domains you’re focused on.”

Those domains determine just how critical any changes can be, and there are two factors that need to be considered, Mitra said. “One is the time effect in terms of change. Different verticals have different demands. That all boils down to power, performance, area and bandwidth. You need to analyze the network and determine what improvements are required in the software tool chain to meet those demands, and then determine how to we need to tweak the hardware. Or can we do different things in the hardware to exploit these things much better? Utilizing both axes is very challenging.”

That, in turn, has an impact on how these systems adapt, and the big problem here is there are so many possible permutations to keep track of, often spanning multiple engineering disciplines.

“Machine learning is more data engineering than algorithm optimization,” said Siemens’ Saha. “How do you sanitize your data? How do you figure out what data is important? And how do you compress the data so that only the important bits are there? That’s probably the core of machine learning. The amount of data that is collected right now across every domain is huge, and there is no way we can process that much data at speed or at rest.”

The key here is understanding what data to keep, what data to get rid of, and how that data may be changing due to changes in the AI system. That’s easier to do with speech or image recognition, where bad results are obviously wrong. But it’s much more difficult with data that isn’t connected to an immediate result. Accuracy isn’t always obvious, and things like sensor drift and their impact on results may require much deeper understanding of how these systems can change.

“What we do in speech, and hearing the speech, and vision is we go by the limitations of our eyes and ears,” Saha said. “What does our eye care about? Based on that, what are the key features? And what are the things our ears are sensitive to? But as we expand the use cases of machine learning across different domains, that becomes harder and an interesting challenge.”


Fig. 1: Trend line showing increasing computational capabilities of AI chips per watt and over time. Source: IBM

The future
While there has been much hype about intelligence everywhere, for critical functions companies are proceeding cautiously. The bulk of the work so far has been focused more on subsets of AI than a push toward fully autonomous machines.

“As a grad student, I had an office next to someone who became one of the leading professors in AI in in Europe, who has since retired now,” said Simon Davidmann, CEO of Imperas Software. “His conclusion was that it’s going to take forever to get real intelligence. Machine learning is a different thing. Machine learning helps with data analysis and predicts things better. I love the fact that my car puts the brakes on a little bit if the car slows down in front of me. That’s a learned thing, but it’s not really intelligence. One of our customers in the automotive industry uses simulation to analyze the speed of your car. They can look at the ABS data in a car wheel and tell you if your tires are going down. Is that intelligence? Or is it just analysis using a lot of clever algorithms?”

That distinction has a has a big impact on problems these systems are addressing. “Some of these machine learning things are tolerant of failure on a small scale, like the bit flips in automotive concerning caches,” Davidmann said. “You’d have thought by now that we know how to do this or that. We don’t. But things things are exploding right now, and it will be a very interesting period for the electronics industry.”

For AI/ML design, the challenges are really just beginning.

Related
Making Sure AI/ML Works In Test Systems
AI holds promise for improving reliability, but it’s not perfect yet.
New Uses For AI
Big improvements in power and performance stem from low-level intelligence.
How To Measure ML Model Accuracy
What’s good enough for one application may be insufficient for another.
Big Challenges In Verifying Cyber-Physical Systems
Experts at the Table: Models and standards are rare and insufficient, making it difficult to account for hardware-software and system-level interactions and physical effects.



Leave a Reply


(Note: This name will be displayed publicly)