How has the migration to multi-core architectures affected the EDA industry and what does this tell us about the way in which the rest of the industry is handling the transition?
Until recently, EDA software rode the coattails of increasing processor performance as part of its drive to continue providing faster and more powerful development software to the people designing, among other things, the next generation of faster processors. It was a fortuitous ring. Around the turn of the century, with the migration to multi-core computing systems, all of that changed. In order to improve performance, effective use had to be made of multi-core architectures and this required a re-architecting of software. How well has the EDA industry dealt with this change, a change similar to the one that the entire software industry is dealing with? Semiconductor Engineering asked the industry for its report card.
There are three primary ways in which multi-core can be applied. The first is to change the algorithms used to support the underlying hardware architecture. The second is to change the hardware on which the software is running. And the third is to change the problem into one that can be solved in an easier manner. EDA is attempting all three of these solutions and, interestingly, they are all being used to help solve the verification problem.
Change the algorithm
, a Cadence fellow, starts the discussion in a positive manner. “Happily there is a fair bit of the EDA world that is inherently parallel.” Many of these applications migrated to multi-core architectures fairly early on, and quite often the usage of multi-core inherently goes unnoticed.
The biggest limiter is often not the algorithm itself, but the memory system. “One thing that has reduced the pace at which parallelism has been adopted in EDA tools is memory bottlenecks,” says Pranav Ashar, chief technology officer at Real Intent. “Fine-grain multi-threading quickly triggers latency bottlenecks, and for the typical SoC benchmark these limits are reached rather rapidly.”
However, Ashar encourages continued development. “Even with the known limitations of current multi-core architectures, EDA companies must push ahead with parallelizing their software. An 8x speed-up on a 32-core processor is better than no speed-up.”
Michael Sanie, senior director of verification marketing at Synopsys, talks about problems associated with speeding up a simulator. “We can divide a design and have each core simulate a partition. This technology has been around for a while. The result is design-dependent. If each core is operating independently, you will get a very good speedup. If there are a lot of interdependencies, there will be little speedup.”
More cores definitely help. “It is easier to break out different test cases and run each test case on different CPUs,” says Cadence’s Rowen. “This is a whole lot easier and in theory we should be able to distribute a stream of requests across millions of CPUs. There are some practical limitations in this, not the least of which is our customers care even more about data security than their Internet searches.”
David Kelf, vice president of marketing for OneSpin Solutions sees simulation as a hopeless problem. “It requires a very large numbers of test vectors to simply get the SoC into the right state, an approach that is rapidly becoming impractical. Automated solutions are now replacing dynamic simulation testing, leveraging the state-space exploration that allows questions to be asked of the communication fabric and the IP on it.”
Back in May of 2013, OneSpin looked at ways to solve the security problem and made use of some unique aspects of formal verification. Formal splits the problem into a number of small pieces, where each of these can be worked on individually and the results brought together for display to the user. It is also rare that a formal problem will work on the complete design at any time.
The approach carved out a problem to be solved and extracted the pieces of the design necessary. It then used an internal mapping table to obfuscate the design. , CEO of OneSpin said there was no way to reconstruct the design from the individual pieces. In addition, a mathematical abstraction of the design was made before the snippet was encrypted and sent off to the cloud to be worked on by a formal solver. On the cloud, the abstracted, obfuscated design fragment was worked on and when the results were ready, they were again encrypted before being sent back to the local machine.
Sanie describes another approach to application-level multi-core. “If you are running simulation and checking , doing debug and managing , each of those tasks can be placed on a separate core. At that point they are fairly independent but the piece that takes the longest, which is usually the simulation, will dominate — but all of the others basically happen for free. Simulation is very memory intensive and when you have the design partitioned into four or more cores, memory access becomes the bottleneck. This is really why design-level partitioning doesn’t really give you that much.
Change the Hardware
Almost since the introduction of simulation, companies have tried to design hardware that would speed up the process. An early example was the IBM Yorktown Simulation Engine, a special-purpose, highly parallel programmable machine for gate-level simulation built in 1985. It was estimated that it would be capable of simulating two million gates at a speed of over three billion gate simulations per second, or an IBM 3081 processor at the rate of 370 instructions per second. That’s a far cry from the levels now achievable.
Today, a more common approach makes use of FPGAs, or similar types of structures, to build an equivalent circuit that can be executed. This we know as , and emulators are quickly taking over many of the larger simulation tasks. But emulators tend to be large and expensive. Even prototyping platforms, which are significantly cheaper, have some issues when it comes to getting the design running reliably in the early stages of development. Simulation tends to be a better approach when the design has not stabilized.
What people want is a low-cost approach that is available using standard hardware, and many processors now contain a solution for this.
“There are other approaches such as using a GPU,” says Sanie. “This makes the problem more memory scalable. Each GPU has many processors and each core can operate on one, two or ten gates, or one line of RTL. It is perhaps easier to partition the design into 10,000 pieces than to divide it into four. This is because the GPU is structured differently from a CPU and memory handling is different.”
Aveek Sarkar, vice president for Ansys-Apache agrees. “We use a mix of CPU and GPU architectures to accelerate our algorithms. For mechanical and meshing problems, computational fluid dynamics, or electromagnetic problems, we can benefit from the architecture of the GPU. We need to look at which architecture benefits us the most. The GPU communication structure is different and the number of threads that can be run is high. The problem with GPU-centric approaches is the memory. The amount of memory you have available to you is limited.”
There is another downside to this approach that makes it difficult for the general software population to make use of the GPU. “We have to partner closely with the GPU providers to do this,” admits Sarkar. “We also have specialized groups who are tasked to enable this.”
Synopsys sees the same problems and notes that there are no standardized tool chains available today.
Change the Problem
If you can’t change the simulator performance enough or cannot make use of other available hardware, there is a third approach. Change the problem and that is a viable approach being used by many companies today.
“We need to have a divide and conquer approach,” points out Johannes Stahl, director of product marketing for virtual prototyping at Synopsys. “If you try and optimize a system by putting hardware together, executing it in an emulator and try to explore the different architectures, and the application software, you will never get done. You would be too late. So at an earlier stage of the design project, you have to abstract and make decisions based on this.”
Abstraction is the key to a successful , which can be created and deployed very early in the development cycle before the RTL has been designed. This is before emulation even becomes a viable alternative. Virtual prototypes can run within a fraction of real time and enable many aspects of system architecture exploration, early hardware verification, the design and optimization of power profiles, low-level software and driver development and many functions that are becoming necessary in today’s complex SoCs.
Jon McDonald, strategic program manager within Mentor Graphics explains some of the benefits of the virtual prototype. “By providing virtual platforms, the system designers can make more informed decisions based on the requirements of the system, the software developers can optimize their code for the constraints of the system, and the hardware developers can be confident of the implementation requirements for a successful system.”
Providers of multi-core platforms probably have a larger problem than EDA companies. “The software stacks, even if coming from open source, have to be modified,” points out Stahl. And Real Intent’s Ashar adds a final problem that affects all multi-core software: “It is challenging to write bug-free multi-threaded programs. Revamping a large existing code base to become multi-threading friendly is a nontrivial reengineering undertaking.”
Rather than asking the industry to assess itself we should ask the customers.
I’m sorry, but to be frank the EDA industry has dropped the ball on multi core processors. The Pentium D and Athlon 64×2 were released in 2005. Its 2014. Quick, raise your hand if your software supports running on > 1 processor. Now put it down if it isnt 2x faster on 2 cpus. Now for those of you with your hand still up and the air and are being honest… put your hand down if it isn’t near 4x faster on 4 cpus (because those 2005 machines were 2 cpu, dual core so 4 cpus)… and finally, put your hand down if you charge more for 4 cpus than 1 cpu.
Those of you with your hands down, shame on you. You are 9 years behind.
Now… I just went to Dell’s website and for around $8500 I can buy a blade with 2 12 core processors, with hyperthreading support. Ya, this doesn’t count memory upgrade or the blade… but I’m also not asking for volume discounts or anything like that either. This is a reasonable mainstream machine.
The industry goal should be 20x faster on a 24 core machine, and maybe 30x faster if I enable hyperthreading. And this should be using a single base license. This is the goal TODAY. If this is your goal for 2-4 years from now, scale up to what Intel or Arm will have then.
Quick… lets respond with any tool which even comes close to what I just said.
I was just thinking the same thing. Why the hell do we have to pay additional for multi core support? It’s just good programming practice these days and should be included in the base software package. Every high end machine I’ve used in the past 6 or so years has had at least 4 physical cores. Give me a break.
I agree with you, and thank you for the comment. Most of the time, customers are not allowed to provide specific information without violating their non-disclosure agreements. I hope, as we grow, we will have the resources to perform full industry surveys. Until then we will just have to ask vendors to self assess. I know from the days when I did performance reviews, people do tend to be quite honest when you ask them and I hope the same truth exists at the company level.