Scaling Simulation

Has simulation performance stagnated, and what is the industry doing to correct it?


Without functional simulation the semiconductor industry would not be where it is today, but some people in the industry contend it hasn’t received the attention and research it deserves, causing a stagnation in performance. Others disagree, noting that design sizes have increased by orders of magnitude while design times have shrunk, pointing to simulation remaining a suitable tool for the job.

During a recent DVCon panel, Serge Leef, program manager in the Microsystems Technology Office at DARPA, kicked open a hornets’ nest when he claimed that simulator performance basically has stalled since 1988. “We can’t expect processor clock rates to keep increasing, and thus simulation performance has stalled. We need to rethink the simulation kernel so that it maps better on to the cloud, so that it can scale linearly with the addition of instances, and so that we can use HPC strategies.”

Industry approaches to the problem are dependent on the verification task being performed. For a debug run, you want a single simulation to get to the point of failure as quickly as possible. For a regression run, you want to know as quickly as possible if something has changed across many runs.

For a single run, you can exploit fine-grained parallelism in the design itself, where concurrent activity is processed by multiple cores. For a regression run that contains many individual simulations, you can distribute the operation across many machines. Verification tasks such as regression have received a lot more attention recently. AI/ML techniques are being introduced that optimize the runs to ensure the fastest time to a significant result.

General-purpose computers are not ideally structured for simulation. Simulation requires very large memories, where the access patterns truly are random. This means that cache is ineffective and, in some cases, can slow down a simulator. Because of this, it’s not surprising that hardware-assisted verification appeared early on. Today, it commands a large fraction of the total dollars spent on verification.

Simulator performance may be the wrong focus. “Ultimately, designers are looking for productivity,” says Anika Malhotra, senior product marketing manager for the Verification Group at Synopsys. “They are looking for faster turnaround time for a debug run, or full regression, or for faster coverage convergence. This goes beyond just simulation. The availability of the cloud makes solutions available to an increasing number of customers.”

Today, simulators are a lot more complex than they used to be. “There are a lot of specialized engines in a simulator, meaning that when we talk about performance, we talk about many aspects of it,” says Amit Dua, senior product engineering group director at Cadence. “It’s about regression throughput, it’s about RTL performance, it’s about testbench performance, it’s about how you’re solving the constraints and covering the state space, it’s about gate-level simulations. Simulator engines have to be smart and be optimized for each of these. Advanced, specialized simulation engines complement single-core performance.”

Simon Davidmann, CEO for Imperas Software, has been focused on simulation for more than 35 years and contributed to most of the advancements that have been made. “In 1988, Verilog XL had just been released. That was 10X to 20X faster at gate level than the other simulators. VCS in 1993 was 10X to 40X faster than Verilog and 50 to 100 times faster than some VHDL simulators. SystemVerilog introduced the bit, enabling two-state simulation. That provided another order of magnitude performance increase over multi-level logic. And then there are abstractions in terms of hierarchy and interfaces and bundling. Design abstraction has not improved since that time. We tried to introduce state machine languages and protocol languages, which would have improved simulation, but they weren’t practical from a usability point of view and evolution point of view.”

While the basic gate level simulator hasn’t changed much since 1988, the requirements that the world places on simulation have moved on. There have been many new algorithms, technologies, and products that have addressed other issues.

Making a run faster
If you are performing a simulation for the purpose of debug, it is likely you want that simulation to run as fast as possible. “If you have a design that can be parallelized, high performance on a single simulation is achievable,” says Synopsys’ Malhotra. “For example, if you have a piece of the design that is replicated eight times, they can be simulated in parallel, and you can get 8X throughput. Performance increases are dependent upon if there is enough activity happening in parallel. That means activity that is not dependent on other activity, and there’s no ordering within the design. Fine-grained parallelism works on a single machine with multiple cores. If you are on the same machine, the communications overhead is very low. But if you want to run on multiple machines, which is what the cloud gives you, you need to think about distributed solutions.”

Theoretic parallelization is not always achieved. “There are multicore simulators in the market from major vendors, but these don’t scale beyond the cores on a signal chip,” says Doug Letcher, president and CEO of Metrics. “Even there, the gain is limited and design-dependent. It is certainly not linear scaling.”

An increasing number of designs may be able to exploit parallelism. “Amdahl’s Law gets you if there’s a tight mesh between everything,” says Imperas’ Davidmann. “In an AI processor, if you’ve got one processor doing a whole convolution plane, and you’ve got a lot of those, you can run them all in parallel. And they’re very loosely connected just with data frames, so there’s no tight synchronous co-operation there.”

Simulators are able to detect design styles. “We continue to add optimizations into the kernel for new modeling styles, new flops and latches, and new library cells — especially with power-aware cells coming in,” says Cadence’s Dua. “There’s a lot of designs that have a lot of replication, especially in the modern AI chips. The kernel needs to be smart and detect replications in the design. Then we can do sharing optimizations. These are necessary to keep the speed and the memory requirements for simulation jobs of these large SoCs under control. These are engine-level optimizations that are continually happening in the simulation products.”

There are other types of simulation where specific optimization can be added. “If you have a lot of activity in a given simulation time step, there is an opportunity to parallelize that,” adds Dua. “These high-activity tests and DFT simulations are a perfect example where we do see simulations speeding up by 3X to 5X.”

Are orders of magnitude speed-up possible? “There is a long history of academic research that has developed alternative kernel architectures,” says Metrics’ Letcher. “These haven’t seen any commercial success, and there are still challenges limiting the gain on real-world designs. The industry has so much invested in the SystemVerilog kernel that switching to something incompatible would be difficult to justify without a game-changing leap in performance, which isn’t apparent so far.”

A lot of the time, the value is not in the simulator itself but in what surrounds it. “Simulators are just calculating what your logic gates’ outputs are in terms of the inputs,” says Philippe Luc, director of verification for Codasip. “That job is fairly easy, and it is completely generic. Debugging with waveforms, with trace analysis, viewers with schematics — these are examples of capabilities that commercial tools provide, and add significant value. Debugging UVM is even more complex. As for the methodology itself, full random is too random, and directed test is too direct. It is only by considering the hardware we are designing that we know what the interesting sequences are, and where we want to concentrate our verification efforts.”

Accelerating verification
While single simulator performance is important for some tasks, most time spent in simulators is performing tasks such as regression. “When we talk about simulation using high-performance computing (HPC), and you’re trying to run thousands of simulations, the question is how fast those simulations can finish,” says Malhotra. “This is very different than asking how long it takes for a single simulation job to finish. Moving simulation to the cloud does require some enhancements to the infrastructure and enhancements to the tool to be able to support running at such huge capacity. The focus needs to be about how you speed up simulation workflows on the cloud.”

Dua agrees. “It’s about finding bugs in the least amount of time and price. When we start thinking from that point of view, you can come up with a bunch of solutions to improve the regression throughput. We can manage how to send your jobs to the appropriate servers in the cloud and there are lots more opportunities in this area.”

In today’s verification methodology, an important component is the constraint solver. “Your regression may comprise 1,000 tests and achieved 60% or 70% coverage,” says Malhotra. “You’re not able to find out why things were not covered because you need more insight or visibility into the stimulus distribution. You may have over-constrained something. Within the tool there is understanding, and there is learning. The tools can automatically identify the test, or the distributions will help you determine the right test, which will help you improve coverage or reduce the turnaround time.”

One current area of development is to maximize the effectiveness of a verification task by ensuring the most important scenarios are included. “When you talk about regression throughput, you could use machine learning techniques where you learn iteratively over an entire simulation regression,” says Dua. “Then you analyze patterns in the verification environment, which guide the simulation engine to basically achieve the matching coverage with reduced simulation cycles.”

Machine learning techniques will either enable you to achieve better coverage or achieve the same coverage in fewer cycles. This directly impacts verification productivity without any improvement in simulator performance.

Getting assistance
Simulation is not an appropriate technology for all aspects of functional verification. Verification methodologies have evolved to identify some of these, such as clock-domain crossing issues, and apply dedicated formal engines.

Dedicated engines do a better job, and relieve the burden placed on simulation. “Simulation will always miss corner case bugs,” says Rob van Blommestein, head of marketing at OneSpin Solutions. “Even if multiple simulation runs are employed, bugs can escape. Formal not only provides exhaustive verification to ensure that critical corner case bugs are detected, but also allows for comprehensive coverage of the entire design. Using formal early in the design and verification process, will relieve pressure on the simulation runs so that simulation can focus on non-corner case issues.”

The industry always has been aware that general-purpose computing is not particularly suitable for simulation. “In 1988 there was already a simulation accelerator from IBM,” says Davidmann. “Simulation acceleration has always been necessary for the largest chips. I don’t think any large piece of silicon could be built without emulators today. These are just simulators using an FPGA or custom processor. We can’t simulate 2 billion-gate designs in software, so you use hardware.”

FPGAs have been used for simulation acceleration almost since the day they were invented. Xilinx devices powered most of the early hardware boxes, and can still be found in some of them today. Hardware prototyping has become increasingly important recently, and these are powered by Xilinx devices. While early emulators were difficult to use, modern tool chains mean that users often are unaware that their simulations are running on dedicated hardware engines. No cloud solution is likely to come close to the performance of these devices, but each solution excels at certain tasks.

Emulation is not a universal solution. “Emulation is great for its intended purpose of early software development and system testing,” says Letcher. “It is not so great for accelerating DV simulation — at least not in the current emulators, due to limits on what can be run efficiently. At best emulators are a very expensive and difficult to use solution. Alternative forms of hardware accelerated simulation are possible.”

The industry also has developed intermediate solutions and hybrids of them. “Some people are looking for something in-between,” says Malhotra. “You are not totally in the simulation world, but you’re not in the emulation world. You are in the middle where you’re able to achieve, maybe not 1,000X, but possibly 100X to 500X. That is where part of your design is in simulation, whereas the other part is running on an emulator.”

One area where that technique is prevalent is with processor models. “We simulate a processor at the instruction level, where we don’t even worry about the microarchitecture of it,” says Davidmann. “We simulate it the way that the software sees it, so we can run production binaries in exactly the same way their hardware will run them. Today, using abstraction, I can boot Linux in 10 seconds in a simulator on my laptop. If I tried that in 1988 using a Verilog simulator, that would take about 10 months.”

Industry success
Could the industry have done more? Possibly. “Back in 1988, we were just approaching the million transistor count for chips,” says Dua. “Today, there are tens of billions of transistors. Transistor count may not be the only metric, especially for functional verification that predominately happens at RTL. But it gives us an idea that design complexity has increased two to three orders of magnitude since 1988, and the tapeout periods have shrunk since then.”

Others agree. “In the ’80s many chips were being re-spun because of bugs,” says Malhotra. “Today, RTL issues are being uncovered very early in the design process. What people want to see going forward is full visibility into the testbench. They want to have faster root cause analysis. This is where AI and ML will help you debug or find the root cause in faster time.”

Is there a better way? “If it’s a new algorithm you need, money isn’t necessarily the way to do it,” says Davidmann. “What’s needed is a lot of research, 90% of which will be interesting but of no use practically. Somebody might come up with something that will change the way things could be done, but I don’t believe it will happen indigenously. EDA breakthroughs are done by researchers in universities, spinning out and turning into companies.”

But few universities are interested in functional simulation these days, and it is prestige that inspires research, not money.

“Moore’s Law means there’s always a challenge, and I’d say the fact that we’re getting silicon out that works is amazing,” concludes Davidmann. “It means simulation is doing a good job. If we didn’t have simulation, where would these things be?”


Luca De Santis says:

Hi Brian, any reference to Davidmann’s statement “Design abstraction has not improved since that time. We tried to introduce state machine languages and protocol languages ” ?
Thank you

Leave a Reply

(Note: This name will be displayed publicly)