Gaps In Verification Metrics

Experts from Arm, Intel, Nvidia and AMD look at what’s missing from verification data and how to improve it.


As design complexity has exploded, the verification effort has likewise grown exponentially, with many different types of verification being applied to different classes of design.

A recent panel discussion with leading chipmakers examined this topic in an effort to shed light on design health and quality, measuring the success of verification, knowing when verification is complete, being on the right track with metrics, amongst other things.

The theme posed to the panel at the Design Automation Conference was whether conventional verification metrics are running out of steam? Alan Hunter, senior principal design engineer at Arm quipped that Betteridge’s law of headlines gives the answer: no. “Any headline that has a question mark in it, the answer is always no.”

However, he contends that verification metrics are running out of steam. “One of the big issues we see internally is that it’s very difficult, as designs get more and more complex, to reason what the coverage model should look like. We often end up doing extremely detailed coverage models for parts of the design, and maybe too-simple coverage models for other parts of the design. It’s very difficult to get the balance between what is a good coverage model that tells you something about what you’re verifying, and just doing too much because you have to do coverage. It’s really hard to balance where the right point is of doing too much functional coverage and too little. It’s also very difficult to review that coverage model often.”

The reason for this can be the fact that the person that wrote it is usually the best person to understand what’s going on. “As we’ve grown in our CPU design teams, we’ve ended up having to move more and more people into various, smaller units of the design,” Hunter said. “There are more people now who understand the very detailed micro-architectural levels of the model, so it’s interesting that we can now have people to help review that stuff. Traditionally, it’s been hard to do that. It’s really difficult to reason about stimulus quality as well. Sometimes from functional coverage results you can say, ‘I’ve hit this point X number of times,’ but unless you can see that happening over a succession of timestamps inside your simulation, it’s very difficult to reason about stimulus quality.

To account for this, Hunter said, Arm has invested in a new technique, statistical coverage, that allows for a time element to be added to the coverage model. “We actually create a bunch of new style functional coverage that gives us more insight into what is going on.”

Then, at an IP level, he pointed out that Arm’s extensively uses functional coverage and code coverage. “Obviously it’s something we have to do. It’s a requirement for all our projects internally. In our experience, it is actually hard to extract like super detailed meaningful information about the stimulus quality, and it’s really hard to get insight into how stressful you’re pushing the unit or the sub-unit or the super-sub-unit of the design and so we created this statistical coverage flow; it gives us much more insight into how the DUT is being stimulated. We write very detailed micro-architectural probes that collect a whole bunch of time information as well as other interesting information about the microarchitecture in the design. We can use all of this to help us push into the design corners of where we know the design is going to be. An added benefit of this is that we get useful ML training data as well that we’re putting into our data lake, which gives us a lot of information about what’s actually happening with cluster usage and what we’re doing on the design itself,” he continued.

Fig. 1: Critical elements in design. Source: Arm

Further, Hunter noted that Arm works closely with designers to identify extremely detailed, critical parts of the design. “The designers know best where they’re likely to have issues with the design so we write these very lightweight custom probes that collect a bunch of information about what’s happening in the design. You can then start to plot things about what is happening in terms of transaction sizes, how much data is actually being pushed through the design. In this case, you can see that we do some of the very low level fill levels, but we really do nothing in the later parts where we’re trying to get all the design queues filled up, and the design is fairly bogged down. We can run this with all of our unit level test benches, and we have a graphing system that allows us to pull all of this data out directly, which is then pushed into our main data lake. Often, we can either do a tableau type back end to visualize the data, and we do some other things as well. We have some very lightweight graphing stuff that we just keep it locally to just work on the data directly and see what’s going on.”

And because they’re synthesizable lumps, he said, this data is included in all of Arm’s emulation flows to extract a whole bunch of information about what’s going on inside the emulator when OSes are put in, and how stressful that is. “It turns out, it’s not very stressful at all. It’s really interesting to actually visualize what’s going on when you start to do some of these things such as add hardware to support, like virtualization and things like that. It also gives insight as to how efficiently things can be moved off virtual interfaces, and the like. That has been really helpful to give additional performance metrics over and above the verification metrics that we want internally.”

Hunter noted that statistical coverage doesn’t eliminate the requirement for good functional coverage. But this approach has enabled Arm’s design teams to expose critical bugs in design, particularly when things are very full and weird data starts coming back from external systems. “At the system level, we also try to shape stimulus to really effectively stress the system in realistic ways that people are going to use it. Are conventional metrics running out of steam? In isolation, I think they are, yes. We don’t get enough information just from code coverage and functional coverage. But for us, adding the statistical coverage has really helped in closing out the verification process.”

Maruthy Vedam, senior director of engineering in the System Integration & Validation group at Intel, followed Hunter with a discussion about verification metrics play a huge role in maintaining quality. He said the answer of whether conventional metrics are running out of steam isn’t so simple.

Fig. 2: Types of metrics. Source: Intel

Vedam pointed to Intel’s new programmable acceleration card with a stack that goes along with it. “These application layers, and this layered stack, make it easier for people to go in and program these complex hardware engines. If you look at different types of metrics that are needed, one option is to look at this bottom up. You look at it from a component point of view, a core point of view, an IP point of view, or an API point of view. Typically when you do that, you tend to have metrics that are structural, functional, unit-level, or component-level. You look at it as a particular feature, coded or not, and whether that’s been validated. You look at it as a particular API being enabled and whether that’s working. It’s getting the right inputs and outputs and whether the interrupts are working. These metrics tend to be common and similar, irrespective of the product you’re building at the end of the day. These are the building blocks on top of which this product is built and they need to be really solid.”

He observed that many engineering teams come into a design review meeting and assume that all of this is working. “It’s almost taken for granted that these are there and for these metrics, they’re very critical and they’re usually common across most of these products. But if you start taking a top-down view of things, this is when we start looking at things like the system level, the platform level, the use cases. This is what distinguishes one product from another. For example, in a programmable acceleration card, you might want to enable some level of programmability and some amount of algorithms to be transferred from your main CPU into this particular card. You start looking at those use cases and start identifying the relevant traffic. You start understanding the hardware software relationship that triggers that particular workload or use case and what specifically needs to be validated to ensure a quality product.”

Fig. 3: Integration view. Source: Intel

Use cases add a level of complexity, and the biggest challenge Intel faces was the integration view of these use cases—what IPs need to interact to get this use case to work, what fabric or bus system needs to come into play to get this use case to work, what system flows, reset flows, power management flows need to be enabled to do this? And that’s purely on the hardware side. On the software side, there are BIOS, firmware, drivers and OSes, all of which interact with the hardware to get this use case to work.

“Especially when you drive for a higher level of quality pre-silicon, sometimes it’s not absolutely possible to run these use cases in real time, so you want to break those down into what is meaningful and what gives us high confidence that this use case will be working when the silicon comes back, and how you are able to run something at speed,” he said. “If any of these ingredients is not at the right quality or is broken, it’s going to slow you down, it’s going to make the use case not actually perform, and you’re going to have challenges in terms of ensuring you get that level of high confidence that a particular use case can be enabled when your silicon or the platform comes back. Looking at these internal quality metrics in addition to the external quality metrics about use cases and so on, and creating that combination becomes very critical as we move forward.”

Next up was Anshuman Nadkarni, who manages the CPU and Tegra SoC verification teams at Nvidia. “When we talk about metrics, it’s a measurement of things over time, and we use various things like regression health, coverage, code coverage, functional coverage to individually measure each of these components,” Nadkarni said. “One of the things that has been a learning for Nvidia is that oftentimes there are varied metrics, and you want to create decisions based on those metrics. The right way to do it is to create Key Performance Indicators, where you look at what the trend over time is, what you expect to get accomplished, and based on that, modify your strategy plan that could be putting more effort on certain aspects, putting less effort on other aspects. KPIs are as important as metrics, and become even more important as we get close to closure.”

Fig. 4: Metrics vs. KPIs. Source: Nvidia

When it comes to the various categories of metrics, Nvidia usually focuses on what has been done to bang on the design sufficiently to be ready for tapeout, Nadkarni explained. “There are other aspects, as well. There are project management aspects as to whether we are on track, and these may not be the same metrics. There would be a lot of overlap in project management versus health of the design metrics, but they tend to have different folks asking questions about it. Another category of metrics relates to continuous improvement. As we go through the design, we figure there are more bugs in a certain area. You want to have metrics that basically tell you, ‘Hey, you need to focus here.’ Then, output action metrics, which are similar to continuous improvement metrics where these focus more on the project being done, and figuring out what was done differently. Again, sometimes there’s an overlap, but they have to be treated as discrete, different categories.”

Aggregation of metrics can be quite helpful, and is something Nvidia has dabbled with quite a bit, he said. “We had all kinds of varied types of metrics. How do we create one metric that represents what the functional quality of the design is? The way we do it is to take regressions, coverage, whatever set of components are need for the metric, then apply weighted scores to each of these components. The weights may change over time based on what the significance of that component is, so something like effort may have a higher weight earlier in the project rather than later in the project, or regression health might have more value later in the project versus earlier in the project. We have mechanisms to aggregate metrics together, and over various projects we compare metrics. Although this is in terms of metrics we’ve created, based on our learnings, KPIs that tell us at some phase in the project what we could expect the overall health of the model to be.”

Metrics and KPIs can also be used as guidelines to suggest when verification might be done, Nadkarni said.

“However, it still comes down to gut feel and experience. What have we done relative to prior projects, how metrics look to those times, and how that correlates with the kinds of bugs found, what the health over time of the design is, regression health, integration and unit testing — sometimes you end up finding more bugs at integration levels where you’re trying to run a use case as the model has indicated. You have a top down approach to this as well. You have use cases and you find the machine getting into deadlock with a particular scenario so you’d want to focus on unit level testing of those. It varies over time the metrics that you use to decide when you’re done,” he noted.

Prototyping, emulation and formal verification have played a role here too, Nadkarni continued. “With formal verification becoming more and more tractable over time, I expect that formal verification metrics will play a key role in terms of sign-off, which in the past relied on using workflow as an add-on to what we do. But here, formal can replace significant levels of the design with C to RTL equivalence checking and other technologies. A lot of the design is actually verified using that so I expect formal to play a key role. Prototyping, emulation, running software on your designs is something that tells us about the level of integration testing where it gives us a sense of where the design is at, are we finding too many bugs in one part of the IP. It means that you need to focus more on it. Then, there is an aggregation metric—tweaking, improving the aggregation algorithm over time, it’s something that also helps us with that.”

Metrics is one area where the EDA industry has fallen short, Nadkarni asserted. “The EDA industry needs to pull out the metrics capabilities and build metrics away from the simulator and try to make something that’s generic because I don’t think we’d ever be in a situation where we will be sourcing all of our tools from one vendor. We will always have multiple different vendors. Having the EDA industry recognize that and build metrics tools that helped with easier aggregation of metrics, apply machine learning on predicting where the problems will be.”

Another area is the software development processes. “This whole concept of functional coverage and trying to get convergence based on functional coverage is more of an RTL hardware design methodology kind of flow, and applying to the software industry because nowadays we, Arm, Intel included, we all tend to think of things in terms of systems. So the software, the drivers are part of the solution as much as the design itself. Having the EDA industry apply hardware design principles and mechanical mechanisms and metrics over to the software side would be extremely helpful and valuable to us,” he added.

The final panelist was Farhan Rahman, chief engineer for augmented reality at Advanced Micro Devices, who focused solely on RTL to make the point about how big “that little space of the core RTL is, let alone, if you just have the whole system and remaining stuff, it’s really scary.”

Case in point: Complexity of the processor is increasing 40%, year over year, he said.

Fig. 5: Processor core complexity. Source: AMD

“This is not performance. It’s the number of cores getting integrated over time. Back in the 2005/2006 timeframe, everybody used to say single core or dual core. Now you talk about multicores and there is no limit for multicores. I’ve heard 28 cores, 32 cores. Then there are threads. It used to just be single-threaded, multi-threaded. Now there are 64 threads, 128 threads. When does it end? It doesn’t. The memory channels and the bandwidths are increasing leaps and bounds to support all of this multithreadedness,” Rahman said.

This results in an explosion of verification data.

“Just from the RTL perspective, only on the processor side, the number of builds per week when I worked at Motorola way back when, we used to have only couple of testbenches. That was about 15 years ago. Fast forward to now, and 8 to 10 models are pretty standard, and that probably on the low side. Regressions per week is 400K to 500K sims. That’s a lot of stimulus and simulations.”

Fig. 6: Dealing with RTL bugs. Source: AMD

When it comes to preserving regression data for a certain number of weeks, many times the engineering team is not able to get through the debug process, he reminded. “Let’s say you’ve got 100 signatures, you have 10 debuggers, and you could get through, say, 60% of the signatures debug. But what about that 40%? If you have to keep them around for a 28-day window, the debuggers could go there and debug them, and mark them as debugged or properly dispositioned as an RTL bug or a verification bug. It’s a lot of challenges, so you’ve got to preserve that. Typically I’ve seen 2.5 to 3 terabytes just to preserve that. For functional coverage bins, it’s about 1.5 million bins. That’s not a small number. Failing test cases, data per test case could be 1 gigabyte. And let’s say you’re running 500,000 sims — sims are equivalent to test cases — and you have like 2,000 failures. It adds up pretty quickly. Task burndown is yet another database where approximately 1K tasks are there usually.”

So what comes next? “I am a very strong believer that we are not ready to throw away everything that we’ve built over the last 40 years in processor design,” Rahman said. “The traditional metrics have their place. But what we need to do is augment with big data and machine learning algorithms to give us the prediction to transform our guts into real things so that we know that if I fixed this bug, there won’t be another set of cousin bugs lurking around in there.”

Leave a Reply

(Note: This name will be displayed publicly)