When AI Goes Awry

So far there are no tools and no clear methodology to eliminating bugs. That would require understanding what an AI bug actually is.


The race is on to develop intelligent systems that can drive cars, diagnose and treat complex medical conditions, and even train other machines.

The problem is that no one is quite sure how to diagnose latent or less-obvious flaws in these systems—or better yet, to prevent them from occurring in the first place. While machines can do some things very well, it’s still up to humans to devise programs to train them and observe them, and that system is far from perfect.

“We don’t have a good answer yet,” said Jeff Welser, vice president and lab director at IBM Research Almaden.

He’s not alone. While artificial intelligence, deep learning and machine learning are being adopted across multiple industries, including semiconductor design and manufacturing, the focus has been on how to use this technology rather than what happens when something goes awry.

“Debugging is an open area of research,” said Norman Chang, chief technologist at ANSYS. “That problem is not solved.”

At least part of the problem is that no one is entirely sure what happens once a device is trained, particularly with deep learning and AI and various types of neural networks.

“Debugging is based on understanding,” said Steven Woo, vice president of enterprise solutions technology and distinguished inventor at Rambus. “There’s a lot to learn about how the brain hones in, so it remains a challenge to debug in the classical sense because you need to understand when misclassification happens. We need to move more to an ‘I don’t know’ type of classification.”

This is long way from some of the scenarios depicted in science fiction, where machines take over the world. A faulty algorithm may result in an error somewhere down the line that was unexpected. If it involves a functional safety system, it may cause harm, but in other cases it may generate annoying behavior in a machine. But what’s different with artificial intelligence (AI), deep learning (DL) and machine learning (ML) is that fixing those errors can’t be achieved just by applying a software patch. Moreover, those errors may not show up for months or years—or until there is a series of interactions with other devices.

“If you’re training a network, the attraction is that you can make it faster and more accurate,” said Gordon Cooper, product marketing manager for Synopsys‘ Embedded Vision Processor family. “Once you train a network and something goes wrong, there is only a certain amount you can trace back to a line of code. Now it becomes a trickier problem to debug, and it’s one that can’t necessarily be avoided ahead of time.”

What is good enough?
An underlying theme in the semiconductor world is, ‘What is good enough?’ The answer varies greatly from one market segment to another, and from one application to another. It may even vary from one function to another in the same device. For example, having an error in a game on a smart phone may be annoying, and may require a reboot, but if you can’t make a phone call you’ll probably replace the phone. With industrial equipment, the technology may be directly tied to revenue, so it might be part of a planned maintenance replacement rather than even waiting for a failure.

For AI, deep learning and machine learning, no such metrics exist. Inferencing results are mathematical distributions, not fixed numbers or behaviors.

“The big question is how is it right or wrong, and how does that compare to a human,” said Mike Gianfagna, vice president of marketing at eSilicon. “If it’s better than a human, is it good enough? That may be something we will never conclusively prove. All of these are a function of training data, and in general the more training data you have, the closer you get to perfection. This is a lot different from the past, where you were only concerned about whether algorithms and wiring were correct.”

This is one place where problems can creep in. While there is an immense amount of data in volume manufacturing, there is far less on the design side.

“For us, every chip is so unique that we’re only dealing with a couple hundred systems, so the volume of input data is small,” said Ty Garibay, CTO at ArterisIP. “And this stuff is a black box. How do you deal with something that you’ve never dealt with before, particularly with issues involving biases and ethics. You need a lot more training data for that.”

Even the perceptions of what constitutes a bug are different for AI/DL/ML.

“The definition of a bug changes because the capability of the algorithm evolves in the field, and the algorithm is statistical rather than deterministic,” Yosinori Watanabe, senior architect in Cadence‘s System & Verification Group. “Sometimes, one may not be able to isolate a particular output you obtain from an algorithm of this kind as a bug, because it is based on an evolving probability distribution captured in the algorithm.”

This can be avoided by setting a clear boundary condition of acceptable behavior of the algorithm up front, said Watanabe. However, understanding those boundary conditions isn’t always so simple, in part because the algorithms themselves are in a constant state of refinement and in part because those algorithms are being used for a wide variety of applications.

Understanding the unknowns
One of the starting points in debugging AI/ML/DL is to delineate what you do and don’t understand.

This is simpler in machine learning than in deep learning, both of which fit under the AI umbrella, because the algorithms themselves are simpler. Deep learning is a data representation based on multiple layers of a matrix, where each layer uses output from the previous layer as input. Machine learning, in contrast, uses algorithms developed for a specific task.

“With deep learning, it’s more difficult to understand the decision-making process,” said Michael Schuldenfrei, CTO of Optimal+. “In a production environment, you’re trying to understand what went wrong. You can explain the model that the machine learning algorithm came from and do a lot of work comparing different algorithms, but the answer still may be different across different products. On Product A, Random Forest may work well, while on Product B, another algorithm or some combination works better. But machine learning is no good when there is not lots of data. Another area that’s problematic is when you have a lot of independent variables that are changing.”

And this is where much of the research is focused today.

“An AI system may look at a dog and identify it as a small dog or a certain type of dog,” said IBM’s Welser. “What you need to know is what characteristics did it pick up on. There may be five or six characteristics that a machine has identified. Are the the right characteristics? Or is there overemphasis of one characteristic over another? This all comes back to what are people good at versus machines.”

The chain of events leading up to that decision is well understood, but the decision-making process is not.

“There is this whole line of explainable AI, which is that you put some data into the system, and out pops an answer,” said Rob Aitken, an Arm fellow. “It doesn’t necessarily explain to you the precise chain of reasoning that led to the answer but it says, ‘Here are some properties of your input data that strongly influence the fact that this answer came out this way.’ Even being able to do that is helpful for a variety of contexts in that if we give AI programs or machine learning algorithms more control over making decisions, then it helps if they can explain why. Okay, you didn’t get the loan for your car, this is the particular piece of your data that flagged that. There is a flip side on security on that too. There are some attacks on machine learning algorithms that by asking the algorithm to give you responses on given sets of data, by playing with that data you can infer what its training set was. So you can learn some allegedly confidential things about the training set by selecting your queries.”

Training data bias plays a key role here, as well.

“It’s a big challenge in medical data because in some areas there’s actually disagreement among the experts on how to label something, so you wind up having to development algorithms that are tolerant of noise in the labeling,” said Aitken. “We kind of know algorithmically what it’s doing and we observe that it’s telling us stuff that seems to be useful. But at the same time we also have demonstrated to ourselves that whatever biases went into the input set pop right out the output. Is that an example of intelligence or is that just an example of inference abuse, or something we don’t know yet?”

What works, what doesn’t
Once bugs are identified, the actual process of getting rid of them isn’t clear, either.

“One way to address this is to come at it from more of a conventional side, such as support systems and optimizing memory bandwidth,” said Rupert Baines, CEO of UltraSoC. “But no one knows how these systems actually work. How do you configure a black box? You don’t know what to look for. This may be a case where you need machine learning to debug machine learning. You need to train a supervisor to train these systems and identify what’s good and bad.”

Small variations in training data can spread, too. “The data used for training one machine might be produced by another machine,” said Cadence’s Watanabe. “The latter machine might implement different algorithms, or it might be a different instance that implements the same algorithm. For example, two machines, both of which implement an algorithm to play the game of Go, might play with each other, so that each machine produces data to be used by the other for training. The principle of debugging remains the same as above, since each machine’s behavior is verified against the boundary condition of acceptable behavior respectively.”

Another approach is to keep the application of AI/DL/ML narrow enough that it can be constantly refined internally. “We started with TensorFlow algorithms and quickly found out they were not adequate, so we moved over to Random Forest,” said Sundari Mitra, CEO of NetSpeed Systems. “But then what do you do? Today we do analysis and we are able to change our methodology. But how do you do that in a deep learning that is already fabricated?

Progress so far
To make matters even more confusing, all of these systems are based on training algorithms that are in an almost constant state of flux. As they are used in the real-world applications, problems show up. From there the training data is revised, and the inferencing systems are analyzed and tested to see how those changes affect behavior.

“That data involves not only how the testbench and the stimulus generation has behaved, but it also involves how the design has behaved,” said Mark Olen, a product marketing manager at Mentor, a Siemens Business. “It knows that in order to generate a good set of tests I want to do a lot of different things, and it knows that if I’ve tried to present a certain set of stimulus to the device and i’ve done it 1,000 times over the course of the day, over my simulation farm, and I’ve always gotten the same result, it knows to not do that again because I’m going to get the same result, so it has to do something different. Iit’s actually an application of some methods that are pretty similar to what we would call formal techniques, but it’s not formal verification in the pure sense of the way we think about property checking and assertion-based verification. It’s formal in terms of formal mathematics.”

Olen noted that leading-edge companies have been working on this for some time. “We haven’t seen anything commercially yet, but you can imagine the Bell Labs type of customers. There are a handful of customers that have long since been on the forefront of some of this technology—developing it for themselves, not necessarily for commercial purposes,” he said.

The path forward
For years, debugging AI was put on the back burner while hand-written algorithms were developed and tested by universities and research houses. In the past year, all of that has changed. Machine learning, deep learning and AI are everywhere, and the technology is being used more widely even within systems where just last year it was being tested.

That will have to change, and quickly. The idea behind some of these applications is that AI can be used for training other systems and improving quality and reliability in manufacturing, but that only works if the training data itself is bug-free. And at this point, no one is quite sure.

— Ann Steffora Mutschler contributed to this report.

Leave a Reply

(Note: This name will be displayed publicly)