Simulation: Go Parallel Or Go Home

What happens when you can’t count on ever-faster processors to improve simulation performance.

popularity

Although complemented by other valuable technologies, functional simulation remains at the heart of semiconductor verification. Every chip project still develops a testbench, usually compliant with the Universal Verification Methodology (UVM), and a large test suite. Constrained-random stimulus generation has largely replaced hand-crafted tests, but at the expense of much more simulation time. Surveys by Synopsys in recent years consistently show “verification took longer than expected” as one of the top reasons for delayed tape-out and “simulation runtime performance” as one of the biggest challenges to reducing verification time. For years, users could count on ever-faster processors to improve simulation performance. They can’t count on this anymore, due to deep submicron effects, the weakening of Moore’s law and slower adoption of new process nodes.

Since the demand for more speed continues unabated, most chip vendors are turning to parallelism rather than faster clock speeds to increase performance. Over the last ten years, the number of cores in high-end processors has grown from a few to dozens and even hundreds. An operating system can easily run independent tasks in parallel on these cores, but this does little to speed up a given application. A single job such as a simulation can only be run on multiple processors if it is architected and coded for parallelism. This must be done cleverly; simply breaking a chip design into big pieces and simulating those pieces in parallel results in load imbalance and communication overhead that prevent any performance gain. Not to mention the overhead incurred if the parallel engine is not native, which is not the case with VCS. In VCS there is a single engine for both serial and parallel simulations. Many applications require fine-grained parallelism, which breaks the program down to many small tasks that communicate using the shared memory systems common in multi-core processors.

Synopsys has moved simulation into an entirely new generation with the addition of fine-grained parallelism (FGP) features in the VCS Functional Verification Solution. FGP technology is designed to fully utilize available resources on x86 servers by decomposing the design model being verified into massively parallel micro-tasks and events. This optimizes simulation performance for the target processor architecture in terms of task scheduling, load balancing, cache efficiency and memory usage. This technology is flexible enough to accommodate multi-core processors and the emerging generation of “many-core” processors. One important aspect of this flexibility is that FGP is enabled when the design is compiled, but the number of cores (and parallel tasks) is specified at runtime. Thus, it is not necessary to re-compile for every possible processor configuration on which simulation may be run.


Figure 1: VCS supports a wide range of processor designs.

The result of the FGP technology in VCS is breakthrough gains in simulation performance. Speeding up the slowest tests in a regression test suite can considerably reduce turnaround time (TAT), speeding the overall verification process and ultimately reducing time to tapeout. VCS FGP can speed up long-running tests by 3-5 times for RTL designs, and up to 10 times for gate-level simulations. The speedup can be even greater in simulations of design-for-test (DFT) features such as scan. These performance gains for VCS have been observed in real-world projects spanning a wide range of end applications. However, some types of designs benefit more from FGP than others. The technology is most likely to be effective on:

  • Low-power RTL designs (2-3X)
  • Networking RTL designs (3X)
  • Multicore CPU RTL designs (3-4X)
  • Graphics RTL design (5-7X)
  • Gate-level netlists (5-10X)
  • Scan designs (10-30X)

On the other hand, some types of simulations offer less opportunity for speedup via parallelism. When the testbench dominates the runtime or when there is a high level of communication via PLI/DPI to other processes, FGP has limited impact. Unsurprisingly, interactive simulation jobs and short tests also benefit less. Another factor affecting speedup is the amount of information being dumped from simulation for debug purposes. In general, the more detail dumped, the lower the performance. VCS ameliorates debug effects on simulation performance by dumping the database information to FSDB files in parallel. By using multiple cores to dump data, the impact is reduced, and better use is made of the numerous cores in modern processors.


Figure 2: Parallelizing long-running tests reduces simulation turnaround time.

There are other factors that may slow simulation, and VCS includes Simprofile, a versatile tool to identify performance bottlenecks. It identifies how and where simulation time is being consumed, down to the granularity of a single line of design or testbench code. Users can speed up both single-core and FGP simulation runs if they can reduce the impact of the most time-consuming constructs. The goal is to run the regression suite faster, reduce turn-around time for verification of new code and bug fixes and tape out earlier. FGP and the other advanced features in VCS make this possible by maximizing the use of today’s multi-core and many-core processors. A new generation of simulation has arrived.

For more information, download the “VCS Fine-Grained Parallelism Simulation Performance Technology” white paper.



Leave a Reply


(Note: This name will be displayed publicly)