Bug Escapes And The Definition Of Done

What needs to change in verification to improve the rate of first silicon success.


National Semiconductor’s design center (NSTA) in Herzliya was the place where I fell in love with chip verification. I joined the team in 1999, still during my BSc, and met a group of innovators with a passion to create great ASICs and improve the way we did it at all costs.

It was fast-moving learning for me, both on the verification engineering and verification management sides of things. Most of my managerial skills I learned from Limor who is a dear friend of mine still today.

One day, I saw for the first time the famous metric of the Bugs-Per-Week chart (BPW). BPW collects the bugs found every week by the verification team in order to help determine the chip’s readiness for tapeout.

Managers expect the trend will show more and more bugs as the team is getting to speed, then less and less as the design gets cleaner. Different companies are using different thresholds of the chart as a checklist item for tapeout.

You know how sometimes something looks off to you from a first glance? But everyone around treats it as a trivial fact and then with time it makes sense to you as well? And every once in a while you ask yourself: Does this really make sense?

Well, I remember myself saying to Limor, “You know, what we really should be counting is the bugs left in the design, not the ones we already found. What if the declining trend of BPW is because the verification team simply got tired and had a hard time finding these stubborn bugs?”

Bug escapes

Every two years, Harry Foster, chief scientist for Mentor works with Wilson Research to do a verification study. The chart below is stunning.

Despite investments of billions of dollars in EDA tools R&D and tens of billions in verification labor, only 30% to 50% of ASIC designs are functional enough to make their way to production after the first attempt. And that doesn’t even mean they had no bugs, it means they did not have bugs nasty enough to shut the party down. Despite all our efforts, bugs escape and make their way into silicon, again and again.

Designs are getting larger and we as an EDA community must come up with better ideas.

If, despite our efforts, we are first-time-right only 30% of the times, then there can be two reasons:

  1. We do not have enough time to complete a proper verification process.
  2. Our definition of done is not good enough.

To tackle the first cause, we have to be moving faster with verification. In my previous article ​I’m Almost Done I spoke about how to handle faster convergence of verification last mile and get it from 95% to 100% and how Vtool’s Cogita can help.

But 100% of what?

And that leads me to the second cause — the definition of done.

As of now, our definition of done in digital functional verification is based on code and functional coverage. While these are important metrics, 100% in both of them is hardly a promise to get your silicon bug-free. It could be interesting to count how many of these 70% buggy first-silicons have had 100% code and functional coverage before tapeout. In my experience, quite a lot.

Post-100% coverage bug hunting

There are several limitations with functional coverage:

  1. It is not objective. Not only is it only as good as its definition by the verification engineer, but it is also very hard to really review it and assess its completeness.
  2. Functional coverage does not indicate properly the distribution of the different scenarios. There is the ‘hit count’ property but most people simply ignore it because it is hard to analyze and follow.
  3. It is very hard to define time-based scenarios and it is even harder to detect them by looking at coverage results.

Oftentimes, we send waves to the designer just to advise whether different tests are really steered towards important scenario generation.

At Vtool, we have another layer on top of coverage for the definition of done. We are doing it at two levels.

The first is ​visual analysis​. By looking at different attributes of the test over time, engineers develop intuition towards whether their scenarios are sufficient or not.

The following example shows the distribution of packet FIFO occupancy over a full regression. The Write and Read I/Fs work independently and the FIFO’s occupancy depends on these rates, as well as the packet size.

The occupancy was defined as a coverage item, with proper ranges and was crossed with other attributes such as packet size.

After a few months of effort, the block reached 100% code and functional coverage. However, visualizing tests results with Cogita showed something like this:

In this case, the engineer could see that all tests are following similar trends and some basic scenarios are missing. For example, getting the FIFO to an overflow, then to empty it, then overflow again.

When he created this scenario, he found a bug that would have otherwise found its way into silicon and there you go, another one bites the dust.

The second level is by using ​classification algorithms​. Cogita is capable of classifying test attributes to groups that relate to failures or success of scenarios. This provides tremendous help in debugging failing scenarios. More importantly, a classification rate that is close to 100% reveals counterexamples.

For example, the algorithm tells you something along these lines:

When the block is configured to mode A and the Ethernet port is active above 1Gbps for a time window larger than 2 us, and the DMA is accessing the shared memory — then in 99.85% of the cases, the test fails. And here are the two counterexamples.

One must ask themselves, what was so great about these two counterexamples that made the test pass?

Well, in many cases the answer is, “Nothing is that great. The checkers missed it.”

For the time being, we cannot count the bugs that are left in the design and take a free ride to the fab. We can, however, find more of them after we hit 100% coverage and have our ASIC first-time-right.

Leave a Reply

(Note: This name will be displayed publicly)