Gaps In The AI Debug Process

Verification and debug of AI is a multi-level problem with several stakeholders, each with different tools and responsibilities.


When an AI algorithm is deployed in the field and gives an unexpected result, it’s often not clear whether that result is correct.

So what happened? Was it wrong? And if so, what caused the error? These are often not simple questions to answer. Moreover, as with all verification problems, the only way to get to the root cause is to break the problem down into manageable pieces.

The semiconductor industry has faced similar problems in the past. Software runs on hardware, and the software must assume the hardware will never make an error. Similarly, when mapping hardware onto an FPGA, it is expected that the FPGA fabric will never make a mistake. In both cases, the underlying execution platforms have been verified to an extent where they built trust over time, and the amount of verification performed on them in some cases is monumental.

Chips that include AI often are designed using large numbers of heterogeneous processors that are not as well verified. Frequently, they use interconnect and memory architectures that are novel. These chips often deploy redundancy to get over issues of yield and perform large and complex scheduling tasks. In addition, they utilize compilers/mappers that take algorithms developed in the cloud and manipulate them to enable inference to be deployed on arbitrary hardware platforms.

Unlike a compiler for software, using the AI compiler/mapper is not a lossless process. The code on its output does not behave identically to the code that was entered. And at the top of the tree, AI itself is an inexact process that may yield completely a different output when faced with extremely small changes in its input.

It is often separate teams that deal with each of these stages. “There are at least three or four different stakeholders involved,” says Nick Ni, director of product marketing for AI, software and ecosystem at Xilinx. “All of them have different care-abouts in terms of debugging. Starting from the very top with AI scientists, their job is agnostic about hardware. Then there is inference. Now you start putting a snapshot of the AI scientist’s model onto something to deploy, and you have to debug that. Here you at least have a correct answer reference. You are debugging against the AI scientist’s results. ‘If this is the input dataset, then these are the expected results.’ When you have all the precision and accuracy checks done, and it looks like it is going to work on a TPU, or GPU, or FPGAs, then you put it on actual hardware and see if the bits are coming back correctly.”

But there are several meanings for correctness. “Generally speaking, a neural network provides both an answer and a probability,” says Steve Steele, director of product marketing for the Machine Learning Group at Arm. “One can never know if the ‘right answer’ was returned. If the result is 94.6% probability that ‘a person is in the crosswalk’, how do you know you are right or wrong? On the other hand, tuning ML performance is more straightforward. If you are only getting 2 inferences per second, but projection says I should be getting 15 Inferences/sec, tools can provide that level of insight.”

Learning from the past
Debug starts when something goes wrong in the field, or an act of verification yields an unexpected result. Usually, step one is to determine if the problem is in the design, in the test, or in the specification itself. Verification in a controlled environment is always easier because it is deterministic and repeatable. However, it still can be difficult to determine when something went wrong, which is very different than when the problem became observable.

Software means that multiple stakeholders are involved. “You never verify a processor by booting Linux on it,” says Simon Davidmann, CEO for Imperas Software. “You start at the smallest level and test it. When we are building a model of a processor, you test every single instruction in great detail. This is white-box testing. Then you build up the next level and you run a small program on it, then another program, and you slowly build confidence layer after layer. When you run Linux, you get to 2 billion instructions pretty quickly on hardware and you cannot debug that. With AI, you are probably talking about 100 billion instructions. There is no way to understand it by looking at that level.”

Thankfully, commercial CPUs and GPUs, and the tool chains they use are extremely well tested. “Mapping an AI algorithm onto a GPU, assuming the software stack and GPU are well supported, is almost trivial,” says Russell Klein, HLS platform program director for Siemens EDA. “A single line of Python does the trick, with tf.device(‘/gpu:0’). Therefore, it is feasible that an application processor running embedded Linux with a full python/TensorFlow integration on a well-supported platform, that by simply specifying the device, it will ‘just work’. Nvidia and Arm will do the work needed to have their processors well supported by all the popular AI frameworks. TPUs have a similar use model. Users will develop and debug using the CPU, and then deploy on either the GPU or TPU.”

But not all processors are that well supported. “The reason you buy an Arm core is because they spent 10 billion cycles testing it,” says Imperas’ Davidmann. “All you have to do is integrate it. But many people today are choosing to develop custom processors, such as those based on the RISC-V architecture, and they are less likely to have received the same level of verification.”

Going from single core to multiple heterogeneous cores is not simple. “Certainly, GPUs, TPUs, and many-core processor arrays are viable approaches to accelerating inferencing without developing any custom hardware,” says Siemens’ Klein. “Inferencing algorithms are embarrassingly parallel, so they are good candidates for many-core platforms. Multi-core systems have traditionally been very hard to debug, and desktop application debug has really been focused on debugging single threaded programs. While gdb and Eclipse — popular desktop debug tools — have added some capabilities for multi-core, in my opinion they leave a lot to be desired.”

Put simply, design tools need to keep pace with changes in chip architectures. “Users need debuggers that provide a non-intrusive, concurrent view into any number of processors (limited only by compute resources) in a simulated design,” continues Klein. “The design can be simulated with ISA processor models (AFM, QEMU, Spike), or RTL running in simulation or emulation. An important attribute is the amount of slop between the time a stop is requested and the time you can look into the core. With most debuggers there is a non-deterministic delay between the time each core is halted for debug, so you don’t get a consistent view across the cores.”

Seeking determinism
AI adds a new level to the debug problem, which is non-determinism. “AI is all about probabilities, and it’s not precise,” says Davidmann. “And that is before we consider quantization. An algorithm as developed may predict with 98% probability that a given picture is of a cat when using 32-bit precision. If we quantize that down to 16-bit, we may get 96% probability. When driving a car, I don’t need to know that it’s a cat or another animal that looks just like a cat. It is good enough to know it’s a cat and not a person or a tree because then I make different decisions.”

That is an important distinction. “An AI accelerator is often being custom-developed for a specific inferencing application,” says Klein. “A couple years ago I counted several hundred AI accelerator startups — some building IP and some building discrete devices — but all aiming to outperform the TPUs from Google and Nvidia. Today, the gold rush seems to be in AI deployment platforms. OctoML just got $85 million in new funding. This morning, through my LinkedIn feed, I was introduced to yet three more companies that claim to take an inferencing algorithm from ‘any ML framework’ to ‘any target hardware’. Here, I think the correctness of the translation is the responsibility of the AI deployment software provider, and any debug facilities should come from the AI accelerator provider.”

It is important to understand where the non-determinism is. “Once debugged and running on the device, the execution of the algorithm/network will be consistent,” says Arm’s Steele. “However, it’s possible the network itself may have an unexpected behavior stimulated by unforeseen inputs and inadequate training of the network. For that reason, good system design will have placed guard bands to check the output of the network is within expected limits.”

While it would be nice if that were true, many complex systems have failed because of issues like clock domain crossing, or timing races. These problems do not exhibit themselves in a deterministic manner.

Domain specificity can lead to the improvement of algorithms. “AI scientists are in continuous debug mode, or more like research, where they continuously train on different data sets, trying to find new corner cases and make the model more accurate for them,” says Xilinx’s Ni. “In a way, they debug or maybe validate the data sets. They typically come up with their own way of measuring whether it’s correct or not, whether it has the necessary picture-recognition accuracy or acceptable transcription, or things like that. There remains a lot of secret sauce. There are not a lot of things that are standardized on what is right or wrong. And this is really where it gets tricky, because at this level debug is not about providing the correct answer. When we talk about FPGA debugging, or even C, C++ debugging, there is a correct answer, and then you are regular checking against it. But here there isn’t, and sometimes it is a moving target.”

AI algorithm development is hardware-agnostic, and there is a big divide between that and inference on targeted hardware. “Mapping an AI algorithm to specific hardware is a big challenge, and a big portion of debugging should be carried out early in the design process by considering hardware constraints in a model-based way,” says Benjamin Prautsch, group manager of advanced mixed-signal automation at Fraunhofer IIS’ Engineering of Adaptive Systems Division. “Rather that mapping an algorithm to some hardware, and then debugging the hardware, problems should be identified early using the model. Supporting tools might be IDEs that include hardware-induced constraints that limit the AI development already in a hardware-aware way. Additionally, both the test environment of the model and the actual hardware should be co-designed in order to enable good comparison of model and hardware that will likely ease debugging in case something goes wrong.”

The step from algorithm to hardware remains a bit like black magic. “When you take a model and try to deploy into FPGAs or ASICs, in almost 100% of cases you can’t just deploy it as is,” says Ni. “At a minimum, you have to change precision. Even with GPUs, you may have to change to bfloat from float32. There is a level of debug here, which is to check if your answers are still reasonably correct. It doesn’t have to be exactly correct, but reasonably correct after tuning to that particular architectural precision. Sometimes you may merge layers. When you’re deploying to the actual hardware, you often deploy shortcuts, like Layer A B C can be combined into Layer D, and execute those together. You save memory, latency, etc. When you start doing that, it again makes debug harder.”

But those layers also make the process somewhat easier. “Neural networks are very complex, but they are made up of layers, each of which is rather simple,” says Klein. “Each layer will perform a function like a convolution, a matrix multiplication, or a max pooling operation. For all implementation of the neural network, you need to be able to capture the input arrays and output arrays for each layer. You also need to be able to automatically compare the input arrays and output arrays from different representations and flag any differences. This requires a bit of programming up front, but it will save a lot of time in the long run. With this in place, anytime you run an inference in C++ or Verilog, you can run the same inference on the original network in the ML framework and compare the results. I integrate the different representations into a single environment, so the comparison always takes place, and any errors are flagged immediately. This means using the DPI interface on the Verilog simulator or emulation environment as a bridge between the Verilog and any C++ or Python representations.”

This is something the frameworks are beginning to support. “The good news is that most of the frameworks support that level of debug,” says Ni. “In TensorFlow, there’s something called Eager mode. (TensorFlow’s Eager execution is an imperative programming environment that evaluates operations immediately, without building graphs. Operations return concrete values instead of constructing a computational graph to run later.) You execute layer by layer, you put the inputs in, and you get the outputs right away. Every layer can be debugged to see if this answer coming back is exactly what you expected from the reference.”

As always, doing debug earlier is better. “This enables you to immediately see the first point of divergence in the different versions of the inference,” adds Klein. “It is then easy to see which layer is failing. The failure will either be that the inputs were different, or the inputs were the same, but the output diverged. The former is a problem of data transfer between the layers. The latter is a problem in the functioning of the layer itself. Knowing the type and location of the failure makes fixing it rather straightforward.”

Building the process
Verification always has been about building a process that limits the possibility for bug escapes. “There is a maturing that has to happen in the AI space,” says Ni. “The existing approach is to say that my system has been debugged and seems to be correct across these datasets, inputs, and calibration measures, and so it’s ready for use in the real world and see what sticks.”

We all know how that ends. “Finding bugs in the field means your existing process is flawed,” says Davidmann. “If it fails when 2 billion things are involved, it means that you weren’t testing it properly when you had 200 million things, or 20 million things, or 2 million things, or 200,000 things. You test it at the small level and make sure you’ve got it right. Sometimes that means finding the right abstractions.”

Others agree. “When verifying an inference engine, you have to start with a high-level model of the system,” adds Davidmann. “This should be developed long before anyone touches RTL. You have to fully verify every one of the tiny kernels. They are the fundamental building blocks that tie the hardware into the software. Once they run in the high-level model of the hardware, you can run those on the RTL as well.”

While AI systems may be larger than processors of the past, most of the problems are the same. Verification does not start when a bug is found in the field. It requires a process to systematically ensure that each step in the refinement of the algorithm or hardware, or each transformation of the data, can be independently verified. What makes AI more difficult is that the industry has yet to define consistent metrics that ensure the end-to-end process has reached a defined level of quality. That is made more difficult because the starting point itself is non-deterministic.

The first step with any debug problem is to locate the root cause, and this should start by systematically comparing results from the highest level down to the point at which a divergence becomes apparent. The good news is that tools are emerging to make that easier.

Leave a Reply

(Note: This name will be displayed publicly)