ML And UVM Share Same Flaws

Metrics for determining completeness are both incomplete and biased.


A number of people must be scratching their heads over what UVM and machine learning (ML) have in common, such that they can be described as having the same flaws. In both cases, it is a flaw of omission in some sense.

Let’s start with ML, and in particular, object recognition. A decade ago, Alexnet, coupled with GPUs, managed to beat all of the object detection systems that relied on traditional techniques. From that day forward, steady improvements have been made. Models have become larger and more complex, and their results have surpassed the accuracy of humans. A similar story could be related to verification that used to rely on directed testing. Then, the introduction of constrained random test pattern generation, followed by SystemVerilog and UVM, allowed for a greater degrees of coverage.

But there are two problems. The first problem for ML is that you never know if you have trained the network with enough images, and if each of those images is useful. There is one distinct common point with the notion of constrained random test pattern generation. Neither has a way to define completion, and the patterns used for functional verification — just like the image sets used for ML — are far from efficient. A lot of time and money is being spent on defining good data sets for ML that are both complete and free of bias, but most of the time they fail.

Maybe you think functional stimulus is free of bias, but that also is not true. It is biased toward the problems that each team is aware of, problems that recently tripped them up, or areas that are new – not to mention those problems that are most likely to cause a respin. There almost certainly are very few tests that look for performance issues, power issues, security issues, safety issues, and many other things that could more easily be swept under the rug.

The notion of completeness has been tackled by functional verification, by the inclusion of coverage, but those metrics are both incomplete and biased. They’re incomplete because attaining 100% functional coverage tells you nothing about how well verified your design is. It only shows progress toward an arbitrary goal. I have been told many time that functional coverage is often skewed toward those issues that are easy to find, rather than being truly representative of the state space of the design.

On the other hand, object recognition does not have any real metrics. They count size without a real measure of quality. For example, autonomous driving needs to be able to recognize street signs. In 2013, a database was created that contained about 900 images with 1,200 traffic signs captured during various weather conditions and times of day. In 2016, another database had grown to about 100,000 images at much higher resolution. But is it complete? Who knows?

How much better are detection systems based on that new database? As Steven Teig, CEO for Perceive pointed out in his DAC 2022 keynote, they remain quite easy to fool just by the injection of some noise. But that is just one form of adversarial attack on them. They often get it completely wrong on their own.

And this brings us to the second and biggest flaw they have. ML is inherently untrustworthy. Cars have been known to crash into parked emergency vehicles or come to an emergency stop for no good reason, thus causing accidents. How can this happen when they are supposedly better than a human at distinguishing between every imaginable breed of dog? It’s because they have no notion of importance or criticality. Every picture, every recognition, is treated as an equal. And while they are, on average, better than a human, humans are very unlikely to make these same kinds of catastrophic mistakes. Can anyone point to what was wrong with the model, or what image should have been in the training set? No. So ML remains inherently untrustworthy.

How does that relate to functional verification? We also have no way to measure importance or severity. Every cover point is treated as an equal. Teams will stop when they reach 99.99% coverage, but what if that one last cover point is actually a fatal flaw? Just because the random verification didn’t manage to activate doesn’t make it less important. Yes, it is possible that through a verification plan, cover points may be added over time, such that levels of functionality could be verified before going on to more obscure functionality. But I am unaware of any company that has a formalized methodology for this.

Both ML and UVM are unable to define what is important, what is critical, and that is why ML is untrustworthy — and why verification managers are unable to sleep at night. Using use-case scenarios is a step in the right direction, because it allows us to define what is important, but that is essentially taking us a step back into directed testing. It was constrained random that allowed many more test cases to be created and to get the design into states that nobody may have thought to create a test for, but in doing that we lost the notions of focus, importance, and efficiency.

It is time to rethink completion and the importance in both functional verification and machine learning, and along with that put in place a methodology that defines closure efficiently.


Lewis Sternberg says:

Glad to see you’re still in the game, Brian. Informative, insightful, and always a good read 🙂

Leave a Reply

(Note: This name will be displayed publicly)