Finding one bug should be a hint to search for more of them in the same area.
After analyzing bugs on several generations of CPUs, I came to the conclusion that “bugs fly in squadrons.” In other words, when a bug is found in a given area of the design, the probability that there are other bugs with similar conditions, in the same area of the design, is quite high.
Finding a CPU bug is always satisfying, however it should not be an end in itself. If we consider that bugs do not fly alone but rather fly in groups – or squadrons – finding one bug should be a hint for the processor verification team to search for more of them, in the same area.
Here is a scenario. A random test found a bug after thousands of hours of testing. We could ask ourselves: How did it find this bug? The answer is likely to be a combination of events that had not been encountered before. Another question could be: Why did the random test find this bug? It would most likely be due to an external modification: a change in parameter in the test, an RTL modification, or a simulator modification for example.
With this new, rare bug found, we know that we have a more performant testbench that can now test a new area of the design. However, we also learn that, before the testbench got improved, that area of the design was not stressed. If we consider that bugs fly in squadrons, it means we have a new area of the design to further explore to find more bugs. How are we going to improve our verification methodology?
To improve our testbench and hit these bugs, we can add checkers and assertions, and we can add tests. Let’s focus on testing.
To enlarge the scope so that we are confident we will hit these bugs, we use smart-random testing. When reproducing this bug with a directed testing approach, only the exact same bug is hit. However, we said that bugs fly in groups and the probability that there are other bugs in the same area, with similar conditions, is high. The idea is then to enlarge our scope. Random testing will not be as useful in this case, because we have an idea of what we want to target, following the squadron pattern.
Let’s assume that the bug was found on a particular RISC-V instruction. Can we improve our testing by increasing the probability of having this instruction tested? At first glance, probably, because statistically you get more failures exposing the same bug. However, most bugs are found with a combination of rare events: a stalled pipeline, a full FIFO, or some other microarchitectural implementation details. Standard testbenches can easily tune the probability of an instruction by simply changing a test parameter. But making a FIFO full is not directly accessible from the test parameter. It is a combination of other independent parameters (such as delays) that make the FIFO full more often.
Using smart-random testing in our verification methodology allows us to be both targeted and broad enough to efficiently find more bugs in this newly discovered area. It consists in tuning the test to activate more often the other events that trigger the bug. In other words, it means adjusting several parameters of the test, and not just one. It may seem more time consuming, but this methodology is really efficient in terms of improving the quality of our testing.
Improving testbenches by following bug squadrons, and killing each of them during the product development is key.
Leave a Reply