What engineers can learn from ER doctors.
A while ago, I had to go to the ER with my friend who suddenly had a numb feeling in his face. He felt okay (and everything else is okay with him), but better be safe than sorry.
While the doctor examined him I noticed that before tracing the problem itself, she asked some questions to rule-out a problem she was already familiar with and that can manifest itself in similar ways. Only then, after all the “okay, it is not that…good,” did she look at all aspects to conclude what the problem might be.
Why am I telling you all that? Because there is no fundamental difference between how this doctor did her diagnosis and our day-to-day bug hunting. Maybe we can learn something from her.
In my last blog, I discussed the long and tedious verification convergence phase. Every complex SoC leader knows how long and painful the process of converging verification effort toward the end of the project, and how hard it is to reach a low-enough risk for tapeout.
The term “debugging” refers to “taking out the bugs,” i.e., perform root-cause-analysis, pin-point the problem, and fix it.
However, if we look at the verification engineer’s process from a failing test or regression to a passing test, debugging is only part of the job. Many times, it would not be efficient to trace back from the failure to the root-cause, at least not before we understand the scenario. Human beings are thinking in “stories,” and because of that the first thing to do when approaching a failing test is to ask what this test was doing. This problem intensified and was somewhat overlooked when the industry switched from directed tests to random verification. This problem can be even worse when looking at full-chip failures, where many different testbenches, subroutines, and parallel threads are executed to create a random and complex scenario. We call this process of understanding the story diagnostics.
Interestingly enough, proper debugging techniques are hardly ever taught. There are plenty of UVM and formal verification courses, but what do you do when the test fails? It seems that in this vast domain, we leave engineers to fight on their own without guidelines and methods.
At Vtool, we use Cogita, our diagnosis and debugging platform, to teach engineers how to perform a systematic and structured analysis process.
The first step is always to answer the question, “What happened?” When running a random test, oftentimes it is not really clear what the test performed or tried to perform. This step, when done properly and with the right tools, can save hours of ghost chasing. When this question remains unanswered, apart from wasting time of the verification engineer, many times it also wastes time for the RTL designer. That is because the suspected so-called RTL bug is not explained in terms of what is the scenario in question.
In order to perform this step properly, one must have a control panel or diagnostics tools that will tell the story in a few simple images. Textual form will not do the job. It will be too long and too complex.
Only after answering the “what happened” question can a step-by-step debug be performed. Here, too, one must get help in asking the right questions (or raising useful assumptions), and get answers quickly and efficiently. Because debugging is a lonely job, many times the process diverges simply because there is no control and guidance over these debugging steps. Here, too, Cogita serves as a guiding platform that helps improve verification engineering skills.
The doctor in the ER performed an efficient process because she learned a method. We verification engineers should develop such methods and spread them throughout the verification community.
Leave a Reply