How To Optimize Verification

There’s no such thing as a perfect strategy, but much can be improved.

popularity

The rate of improvement in verification tools and methodologies has been nothing short of staggering, but that has created new kinds of problems for verification teams.

Over the past 20 years, verification has transformed from a single language (Verilog) and tool (simulator) to utilizing many languages (testbench languages, assertion languages, coverage languages, constraint languages), many tools (simulators, emulators, rapid prototyping, formal, focused apps) and multiple abstractions (system, transactional, RTL, gate). The recently introduced Portable Stimulus standard (PSS) also may initiate another transformation in the verification process.

While the number of verification engineers has increased over that time period, verification is no longer a task that can be handed over to junior engineers. It requires a broad range of skills and languages that span multiple conceptual domains. In fact, it is unlikely any single engineer can master everything, making some degree of specialization necessary. Constant learning and adapting to changes in the verification landscape is a necessity.

However, verification managers have a relatively fixed budget in terms of people and money. As a result, they must carefully choose where, how, and when to deploy resources to minimize the chance of bug escapes. Ultimately, verification is an infinite challenge, but nobody has infinite time or resources. It would be nice if there was some formula that could be used across the industry, but every product is different and there are different economic, time and technology forces that each team has to deal with.

“We have so many tools and technologies,” says Tran Nguyen, engineering services director for Arm. “Ultimately what we care about is the time to market, and we want to do a design every 10 months. That is basically the same time that it took to do a small design, but now there are hundreds of millions of gates and we need to do everything in the same timeframe. This means you have to increase the throughput. You can run a lot of cycles, but if any of those cycles is less useful or did not focus on the right things, then it is wasted. How can you know which cycles are effective? What does it mean to be effective?”

That’s not always obvious at the outset. “Perhaps you look at how many cycles found bugs,” Nguyen says. “But what about looking at it the other way – the cycles that are not to find bugs but to confirm that the design is right? You can have more technology, faster technology, but you have to run the right thing. You cannot just scale a single solution by adding more cycles.”

Each company has its own approach, and often that varies even by teams within a company. Raju Kothandaraman, graphics hardware engineering director at Intel, points to a number of different metrics at his company—build, resource utilization, how fast things can run in certain environments, and time to root cause, which is basically the debug portion of it.

“As verification engineers we pay close attention to these four metrics,” Kothandaraman says. “How fast, in RTL, are the hardware or software fixes getting into the system, which is what we call the build? How effectively are the resources utilized with respect to how fast the jobs are getting launched to the compute servers? And then, if we have efficient testbenches, which [ones] are running things as effectively and using the resources properly and benchmarking against the rest of the industry? Finally, the most challenging part of verification, is the debug throughput in terms of how fast you can narrow down on a bug. We measure all of these through the entire cycle. It requires getting the right mindset into the people. People need to be savvy about the different verticals and continuously drive towards making the lifecycle faster.”

Dale Chang, GPU emulation manager for platform and methodology at Samsung, focuses a lot on emulation. “Throughput is basically how many jobs you can submit to the emulators, with a reasonable iteration, so that the debug team is always kept busy. The output of this will define efficiency of how many RTL bugs we can find. There are a lot of things in this chain of high-level goals, which means that from an emulation infrastructure perspective you need to build infrastructure that is efficient—your run infrastructure is efficient, your testbench is generated in a way that you are not wasting cycles, and you also need to generate enough automatic scripting methods so that you can triage your failures and rerun using the debug methods. In emulation, the build time is significant. The runtime in batch also takes time, and queuing takes time, so you want to make sure that your engineers are not sitting waiting for the output rather than being more efficient. We have to work with IT, with the CAD teams, and also work with the EDA industry to speed up and increase the efficiency of the process.”

Metrics for success
How to measure efficiency isn’t always obvious, but it is important.

“I am a big believer that you cannot optimize what you can’t measure,” says Paul Cunningham, corporate vice president and general manager for the System Verification Group of Cadence. “The concept of throughput is very intuitive, but we have to make it measurable. There is no one way. It is not a simple thing. There is a notion of raw throughput. This is a basic measure that you might liken to the performance of the car rather than the performance of the person driving the car. Cycles per second would be one [metric], or time to waveform, or capacity. At the end of the day, you can have the most amazing, high-performance Ferrari, but if the person driving it is no good you will still not get much performance. So that is where you have to go to higher-order metrics, such as the time to root cause of bugs, or how many bugs can you root cause per person per day. Or, how much does it cost you to root cause a bug and how much compute resource is consumed. These are also important. You have to create a tree or a layered stack of these metrics and look at them all. You cannot say one is more important than another. They are just different. It is a mindset.”

An added complication is that verification is no longer just about functionality. “We have a couple of critical metrics,” says Samsung’s Chang. “One is for functional, the other is for performance and power. You have to make sure the GPU runs up to the task. You also have to make sure your GPU will run with the kernel and with the driver, making sure everything works together. The kernel and the driver play a critical role on scheduling and dispatching jobs and handling the result. A lot of bugs are happening within the communications between the driver, the kernel, the firmware, the RTL. We actually develop multiple different testbenches to satisfy each verification team’s needs. For running the kernel, we actually need a hybrid environment for that.”

Many products have to run multiple applications. “One thing that we adopt is essentially understanding the workload nature as part of our architecture definition,” adds Intel’s Kothandaraman. “The design and validation teams are working very closely with each other to make sure that the requirements are understood from the beginning. That adds a lot of value, defining what types of capabilities we want in our lower level validation, as well as scaling all the way to silicon.”

Thinking long-term also can have benefits. “Interoperability is part of efficiency and throughput,” says Arm’s Nguyen. “You use several technologies because each has strengths and weaknesses. In order to be able to jump between them, you should be able to re-use as much of the testbench as possible. Also, between projects, it usually will be from one product to another and you can re-use a lot. So reuse from infrastructure, from testbench, from testcase – all of this contributes. Verification engineers are scarce, and we need to re-use them as much as possible, and that means automation and a reduction in uncertainty. This is key to getting good throughput.”

Levels of abstraction
Another variable in the process is abstraction. “There are some things that you may do at the transistor level or at a unit level, other stuff that happens at gate and RTL, and there is some stuff that has to happen at the software level,” says Cadence’s Cunningham. “There are different levels of abstraction, and leveraging that is really important to getting the best kind of throughput. Abstractions are tools, and you have to use the right tool for the right job. You don’t want to take the Ferrari off road.”

But that often comes with constraints. “For software development, they usually develop on their virtual testbench model, then jump onto emulation with the real RTL model,” says Nguyen. “But that comes very late because the RTL stabilizes fairly late in the game. Usually the software team will not use unstable RTL. That means the bring-up cycle is critical. Very short. You have to deliver a lot of things in that timeframe.”

This is a significant driver for the industry. “If you look at the data from IBS, the rate of growth in spend on software exceeds the rate of growth in hardware spend,” says Cunningham. “You want to shift left and try and do as much of the software validation and bring-up as you can pre-silicon. So there is a lot of pressure and value to drive that. Hardware validation and software bring up are running concurrently.”

As tools change and new tools become available, verification teams also need to be aware of the impact that changes may have. “Collaboration with EDA is important to ensure they have an understanding of where your roadmap is going, while not quite revealing it,” says Kothandaraman. “You have to tell them about your challenges and what you expect to see in the future. We also need to understand the newer tools that are coming. This collaboration with EDA is by far the best way to gain on throughput. You have to be open and honest with the EDA vendors and make sure they have the attitude that they are part of your team. As an example, we always have to consider whether we should use more formal or more emulation.”

Intel is hardly alone in this. “Formal is a great example,” says Cunningham. “There is a ton of stuff we do right now with dynamic verification that could be done more efficiently with formal. It is still fairly early days for formal and there is more that we can do. It is the right tool for certain problems. Now we are introducing things like PSS for a higher level of writing testbenches. We have to understand what is going to happen. Sometimes we need to change the driver, sometimes we need to change the car. Nothing is fixed, and so long as we have that one team attitude, then we will drive throughput improvements as fast as we can.”

Conclusion
Verification tools and methodologies will continue to evolve at a rapid pace, which means that every verification manager has to constantly assess the effectiveness of their processes. It also means that verification engineers have to be prepared for continuous re-education if they are to remain productive.

The bottom line: There has never been a more exciting time to be in verification, as PSS initiates a new chapter in verification evolution.



Leave a Reply


(Note: This name will be displayed publicly)