Having real comparisons will go a long way in allowing chipmakers to better understand what they are buying and why.
Thinking back about DAC 2015 in San Francisco earlier this month, I am happy that at least some of my predictions came true—there was clearly a trend towards making verification smarter. However, one thing struck me while hearing all the discussions on connecting engines is what Jim Hogan called the continuum of verification engines (COVE)—and what we at Cadence call the system development suite. The discussions were very abstract and there were no fair comparisons of engines. So we in EDA are in need of a new metric, a new benchmark, which I would call “verification computing efficiency” or “VCE”.
In principle, COVE is pretty straightforward. Connect and combine the different pre-silicon dynamic execution engines: virtual prototyping, RTL simulation, acceleration and emulation, and FPGA-based prototyping. Use them for what they do best. At the highest level, things look straightforward, and I talked about some of these choices in a previous article called “The Agony Of Hardware-Assisted Development Choices”. Just compare and make your choices based on the different key characteristics: execution speed, capacity, marginal cost per unit, porting effort, hardware debug, software debug, hardware bring-up time, time of availability, accuracy, and system connections.
The challenge is the combination of the characteristics and how they fulfill what the user actually wants to achieve. Do you want the lowest power consumption? Well, as Jim Hogan pointed out here, the intrinsic power consumption, while it certainly matters, is actually only part of the equation. To understand the total power consumption, users need to understand how often new RTL needs to be mapped, i.e., requiring compile times, how often they run in debug mode vs. pure regressions, how long does the actual execution of the tests take and how much debug data is needed and transferred. The intrinsic power consumption—just like the actual execution speed of a platform—by themselves can matter much less than the combination of the different steps a user wants to take. It’s like the gas mileage or energy consumption of a car by itself doesn’t matter as much if its tank is not big enough or the user needs to reach destinations far away.
Here is a great example of a “debug loop” in emulation, as presented by AMD’s Alex Starr back in 2012:
It is obvious that finding the right trigger mechanism and channel out of the right set of debug data is crucial, as well as getting the right access to the emulation resource as everything is queued. Now imagine this across a large number of projects accessing the resource. I have seen companies with well above 20 parallel projects using emulation. Also, payloads of different length need to be executed. Imagine a queue of 500 smaller jobs of 8MG for IP verification, 300 medium-sized jobs of 64MG for sub-system verification, and 200 large jobs of 256MG or bigger for SoC verification. All jobs have different execution length. Users may have a choice of running the queue of jobs in a simulation farm, on an emulator, or in an FPGA-based prototype.
This is one of the places for which the new VCE metric is needed. There are four steps important to the user: build, allocate, run, and debug.
1. Build: For each job, how fast can the user compile it to create an executable of the job that then can be pushed into the queue? Compile time for hardware-assisted verification counts here. In emulation, things are automated and for processor-based emulation, users compile at 70MG per hour, getting results fast. For simulation, the process is similar, fast. For FPGA-based prototyping, it may take much longer for manual optimization to achieve highest speeds. One large EDA company claims its flow gets it down to five days, but there are not any public examples of that and the fastest I have seen in the past was four-week bring-up for manual optimized FPGA based prototyping, albeit with very stringent coding restrictions. Here is a video of Hitachi outlining how they are using an emulation adjacent flow to bring up the Protium platform fast, essentially within days. The benefit of fast bring-up is offset by speeds between 3MHz and 10MHz, not quite as fast as with manual optimization.
2. Allocate: This one is new and important. How can I map my jobs into the compute resource? For simulation farms, I am really mostly limited by the number of workstations and the memory footprint. Emulation allows multiple users, but the devil lies in the details. For smaller jobs, the small granularity and larger number of parallel jobs really tips the balance here towards processor-based emulation in the Palladium platform. There is an example graphic of such an allocation of jobs in one of my previous blogs, called “How Many Cycles are Needed to Verify ARM’s big.LITTLE on Palladium XP?”
3. Run: This is the actual execution of the job. Like the speed of the car, by itself it doesn’t say much. Does the higher speed of FPGA-based prototyping make up for the slower bring-up time and the fact that only one job can be mapped into the system? Sometimes, but not often. That’s probably why FPGA-based prototyping is mainly used in software development, where designs are stable, and less in hardware verification. Users are later in the cycle but run faster. For FPGA based emulation, allegedly faster that processor based emulation, users have to look carefully how many jobs can be executed in parallel. And in simulation farms the limit is really availability of server capacity and memory footprint.
4. Debug: Here it comes down to being able to efficiently trigger and trace the debug data for analysis. FPGA-based prototyping and FPGA-based emulation slow down drastically when debug is switched on, completely negating the speed advantages for debug-rich cases found when RTL is less mature. It all depends on how much debug is needed, i.e., when in the project phase the user is running the verification queue set up above. In addition, the way data is extracted from the system determines how much debug data is actually visible. I had detailed some of the impact of the trace depth in the past here. Bottom line for debug simulation is the “golden reference.” Users need to assess carefully how the data generation slows down simulation. With processor-based emulation debug works simulation-like. For FPGA-based systems slowdown and accessibility of debug data need to be considered. Again, FPGA based prototyping works great for the software development side, but on the hardware debug it is much more limited compared to simulation and emulation.
Bottom line the verification computing efficiency VCE is really impacted by the build, allocate, and debug steps. An efficiency of 100% would always use all the verification resources available with no losses and make debug data available without missing debug windows, etc. Combined with the actual execution speed, the run step, users can then calculate how efficiently a queue of verification jobs can be handled.
The future for new levels of benchmarking looks pretty bright. More to come in the next couple of months!
Excellent article.