GPUs May Speed UP EDA Algorithms

Old, sequential EDA algorithms can’t keep pace with design complexity. New architectures are being explored.


The sequential EDA algorithms of old cannot keep pace with increasing design complexity, which is driving the industry to look at parallelism and other computational architectures such as the graphical processing unit (GPU).

A 10X or 20X speedup for gate-level simulations means that a test that runs today in a week will run in less than a day, and a test that runs today in a month will run in a few days.

“This reduction in the simulation run-time has a significant impact on the overall time-to-market of a silicon product,” said Srinivas Kodiyalam, ISV (HPC) applications manager at NVIDIA. “In some cases it enables you to run gate-level simulations on large and complex designs that would otherwise be impossible or infeasible to run. It also improves the quality of the product by allowing greater coverage of complex designs, as well as saving IT costs for semiconductor companies by enabling software-based accelerators.”

As such, for solving certain EDA tasks problems GPUs can be ideal. Because EDA data sets are too large to fit into cache and the EDA application’s access pattern is somewhat random, memory bandwidth becomes a bottleneck. “GPUs are perfectly suited for data-parallel algorithms with huge datasets, as they provide memory bandwidth and floating-point performance that are several factors faster than the latest CPUs,” he said.

For example, with NVIDIA Kepler GPUs there are several thousand processing cores organized in SIMD groups that allow the engineer to launch several million short-lived independent threads that don’t have to communicate with each other, he said. “Instead of optimizing for the latency of the single thread, optimization is for throughput—the number of threads that can be processed in specific time duration.”

A paradigm shift
GPUs are a paradigm shift that give engineers the opportunity to step back and instead of saying, ‘How do we map this algorithm onto this brand new architecture? It doesn’t really fit very well the way it is,” said John Croix, senior solutions architect at Cadence. “What if you take a step back and ask, ‘Can you redesign the algorithms to still solve the same problems but do it in a fundamentally different way?’ I’ve always thought that to make great strides, you have to have a revolutionary approach not an evolutionary approach. Evolutionary gets you 10% to 20%; revolutionary will get you 5X to 10X. The cool thing about GPUs is they make you look at the problem from a different perspective, trying to not force fit an algorithm onto a new architecture, but instead looking into the architecture and saying, ‘How can I take advantage of that? What types of algorithms really run well on this?’ And if you take a step back and work from out of the box, looking in and go back to fundamentals, sometimes you find new ways to solve a problem that you never would have thought of before.”

IT matters
From the perspective of how GPUs could impact the space between design and manufacturing, Juan Rey, senior director of engineering for Calibre at Mentor Graphics, pointed out that the type of EDA/CAD algorithms that he cares the most about are those used at the interface between design and manufacturing. They deal primarily with the layout that has been created to generate the masks for sending the design to the semiconductor foundry.

“There is a large difference in the type of infrastructure required to run software in this area if the software is used for design purposes versus when it is used on the semiconductor manufacturing floor,” Rey said. “Even though the algorithms need to be common, because there should be a common language between the two communities, the requirements are very different. Essentially the design community needs to verify that things are going to work. And to put it in general, broad terms, the semiconductor manufacturing facility needs to make sure that the design is going to have a good yield.”

This in itself generates a big difference between the two modes of operation to the tune of tens to hundreds of cores that are typically used for a very large chip for the verification on the design side, compared with the hardware that is used on the semiconductor manufacturing floor that goes to several thousand cores. Another comparison is that designers tend to have either workstations or servers, whereas the semiconductor manufacturing floor needs to have clusters of servers to run applications.

Mentor has been making sure its software works in a highly parallel way since early on because the transition started happening more than 10 years ago, Rey said. “The software needed to be multi-threaded first, such that it could work on shared memory processors. Then it had to become distributed and then work on distributed systems. At some point we had to pay attention to even hardware acceleration techniques to satisfy the semiconductor manufacturing needs. That happened a few years ago when we had to do hardware acceleration using Cell processors on top of having clusters of distributed x86-type of architectures.”

But what happened then is that with the general multicore and many-core architectures that started being used, the hardware and software community caught up with general computing applications. It was no longer necessary to do a lot of specific hardware acceleration to satisfy needs of customers.

Still, the company keeps doing research and investigation for all of its algorithms, Rey stressed. “We are seeing there are envelopes of cost and power for running these systems. For that reason, we have been investigating GPUs for many years. As a matter of fact, we started investigating GPU usage at the same time we started investigating Cell processors. At a certain point several years ago, we decided to go with the Cell processor, but we keep an eye on [GPUs] and keep doing research on the applications that run well on GPUs. We keep a good level of attention on where the GPU community is going.”

In that vein, he noted that while there are algorithms that are naturally well-suited for getting acceleration on GPGPUs (general-purpose computing on graphics processing units), the EDA community needs to pay a lot of attention to what the total percentage of computing is that that algorithm requires as compared to the complete application that needs to be run. “To give an example related to this, in some cases a simulation component may take 80% of the total computing that is required, and it is easier to port into GPU than the remaining 20%. When you are in a situation like that, you know that the maximum you can get is 4x to 5x acceleration on the complete system.”

A second aspect to heed is whether that will be enough to justify customers making the IT investment required to support that type of hardware system to put it into production and get the support needed to perform that activity. Rey admitted that this is not something that is very clear, and it depends on the particularities of each activity. In some cases it may be, in other cases it may not be justifiable to go for that investment. “Essentially what happens is that remaining 20% of the algorithms that are not easy to port into a GPU or to accelerate into a GPU may be a major roadblock for being able to justify the total turnaround time that is required. “

Croix added that the more that can be put onto the GPU, the better off you’re going to be, but the GPU has a few fundamental limitations. “One is that it doesn’t have the same sort of cache that the CPU has, and there is certain code that is inherently scalar…It’s absolutely the wrong tool for small problems. But if you put a CPU onto the GPU for those serial portions of code that you can’t make parallel, and you wrap it with the same type of caches and whatnot that we have today on our normal microprocessors, then you can eliminate a lot of that data transfer that really kills the GPU performance.”

Interestingly, Kodiyalam pointed out that EDA tools such as RocketSim or VCS solve the Verilog/System-Verilog simulator’s bottleneck challenge by offloading most time-consuming calculations to an ultra-fast GPU-based engine. “In this hybrid computing model, the GPU serves as an accelerator to the CPU to offload the CPU and to increase computational efficiency. In order to exploit this hybrid computing model and the massively parallel GPU architecture, application software will need to be redesigned.”

He added that Rocketick and Synopsys engineers along with NVIDIA engineers have been working together over the past few years on the use of GPUs to accelerate both the RTL and gate-level simulations in the respective EDA tools. These tools work from within the familiar simulator environment and run alongside the existing testbench, eliminating ramp-up time while providing bit-precise results.