Experts at the Table, part 1: Are we looking to solve the wrong problems? Where does parallelization work best and why.
Semiconductor Engineering sat down to talk about parallelization efforts within EDA with Andrea Casotto, chief scientist for Altair; Adam Sherer, product management group director in the System & Verification Group of Cadence; Harry Foster, chief scientist for Mentor, a Siemens Business; Vladislav Palfy, global manager for applications engineering at OneSpin; Vigyan Singhal, chief Oski for Oski Technology; and Bill Mullen, senior director for R&D at ANSYS. What follows are excerpts of that conversation.
SE: Why is parallelization so difficult?
Foster: I have a background in parallelization that goes back to super-computers and high-end servers from the ’90s. We used hundreds of processors and we were trying to take advantage of developing different types of EDA solutions. There were some key lessons learned. Two factors still hold true today, that explain the limitations of parallelism. First is the inherently serial portion of the code, and that is accounted for with Amdahl’s Law. Second is the overhead associated with parallelism. That has improved since the ’90s, but it cannot be dismissed.
Sherer: One of my first projects was working with an analog simulator. It was a parallel, distributed set of processes, and we built a mixed-signal environment that had to pass data, so decoupling was critical. What we see now, within our own engines and with digital simulation, is that algorithms that were conceived for serial application do not lend themselves very well to just porting to a multi-processor environment. We attempted that four or five years ago. We attempted to take the algorithm, pull it apart, decouple the chunks and then distribute them. We had cool names for it, but since we didn’t fundamentally change the algorithm there were hard limits in the parallelism that we could achieve.
Singhal: Parallelization is really cool and powerful, but it is extremely hard to implement and get right. The reason it is so hard is the same as why hardware design and verification is so much harder than software design and verification. If you think of hardware design, it is essentially one massively parallel program where every block is a parallel process. What that does is make design hard, debug hard, synchronization hard, and they all add a layer of complexity that is hard to manage with software. You have to decide at what level you want to parallelize things. If you go very fine, threads and synchronization, you have to worry about overhead and debug and reproducibility. If you go coarse-grained, then you have to pick the right granularity and make sure the parallel pieces are balanced. That adds a layer of complexity, which we are used to when designing hardware. But in software, it is very hard.
Palfy: There are some problems that are very good for parallelization. In the world that we come from with formal, there are tasks that you can automate. These can be quite neatly separated and you can get good performance. But what we try and do is be smart about what we parallelize, because sometimes instead of brute force parallelization you want to have some data analysis that helps you choose the correct way to solve the problem. That can lead to the elimination of the need for parallelization.
Mullen: Partitioning is really important for distributed computing. You cannot be too fine, but you can’t be too coarse, either. They have to be relatively balanced so that you get efficient usage of all of the processors and really get scalability to thousands of cores.
Casotto: We can go to 20,000 cores to solve a problem. We have found a lot of parallelism, not at the application level, but at the workflow level. We have been working with this for the past 20 years trying to educate people that there is low hanging fruit in terms of parallel execution of work onto tens of thousands of machines. Today, our customers do run tens of thousands of simulators, and each one may be single-threaded. But the impression is that we are actually taking advantage of the available resources.
SE: Is dealing with parallelism in EDA and hardware design inherently more difficult than for the generic software industry?
Foster: I don’t think so. When you look at simulation, what criteria is required to be successful in parallelism? There are three things. First, you have to have a design – that is a key component of this. The design must have a high degree of concurrent activity going on. That is required for partitioning. Second, you have to have a balanced partition. If it out of balance it doesn’t matter how much parallelism is going on. Third, and possibly the most important, you have to have low inter-partition communication. We all have marketing slides that claim that we get 10X improvement through parallelism, and yet the reality is that you can find another example where you actually do worse than single-threaded. It is so highly dependent on the design. It is the same challenge in multiple domains, at least from my experience.
Sherer: I will echo that. There are similarities with cycle simulation, which the industry went through in 1997. We claimed it was 1,000X. By the end of the year it was 3X. You wondered why and it wasn’t as much about the engines. The engines were good cycle engines. The trouble we found was that we would run with a customer and they would say that it was a cycle-ready design. When we looked at it, they would say except for this one block that was built 20 years ago, and it was hand-coded and delay-dependent, but everything else is cycle. You can easily end up with a <1 gain. With that said, if we look at designs today we see highly asynchronous designs. Around the late ’90s it was single-clock designs or single fast-clock-based design. Customers could not just port timing from one to the next, so they started to move the time dependency out of the design and it became cycle-ready. But at 40nm we went to asynchronous timing domains.
Foster: A lot of them do that for power requirements.
Sherer: But that broke some of the parallelism that has to be stitched together again. We will probably see another wave of design change.
Singhal: Parallelization is no harder or easier for EDA. You just have to be careful about which problems you are trying to parallelize.
SE: A design has already told you where all of the parallelization is. They have done the work for you. That is not the case with software.
Singhal: Signal integrity is a parallelization problem and there are tons of problems that are. If you have thousands of tests, you can run each of them on a different machine. Or if you have thousands of properties, you can run each of them on its own. In 1993, there was a formal proof that any verification problem, a single verification problem, is a PSPACE hard problem, and PSPACE is harder than an NP-complete problem. NP complete means that if you have an infinitely parallel system, then the problem can be finished in polynomial time. But PSPACE hard means that even with infinite parallelism you still cannot finish the problem in finite time. If that is the case, then a single verification problem that has been shown to be PSPACE hard will never be parallelizable. There is nothing that you can do — ever. Think about floor-planning and placement. The way we have tackled it is to break the problem into smaller parts. High-level floor planning divides up blocks, decides where they go and then you do lower-level floor planning, then place and route. The same with verification. If you can break up a system-level verification problem into sub-components and specify the contracts on the blocks, we should be able to say, at an abstract level, what I am promising to everybody else. Everybody else should be able to say the same. If I can take those promises and those contracts and verify the composition of those contracts, then the problem is solvable.
Palfy: Then you are basically looking at an integration problem, not a parallelization problem.
Singhal: Yes. Then the verification problem becomes parallelizable because you have changed the problem. But unless you do that, you are stuck with a very hard theoretical problem.
Sherer: I agree that all of the comments that have been made are simply engine-based, but you are right. Verification is bigger than any single engine, and the state space is so large that unless you apply some form of parallelism to break the problem, there is no solution.
Foster: Divide and conquer.
Singhal: Yes, but that means you have to know how to divide. The problem has to be dividable, and then the results have to be integrated.
Mullen: Sometimes it is very difficult to divide. In my product, we are dealing with a circuit that has tens of billions of nodes and we are trying to get accuracy that requires flat analysis. So we have to get a circuit simulation result that includes power integrity, voltage drop, where you can partition and parallelize it, as if it were solving a single matrix. It is not a harder space than other software domains, but the EDA domain knowledge has to be applied very carefully to decide how to solve things and give the users what they need in terms of accuracy. Take advantage of things such as layout and matrix techniques that allow you parallelize efficiently.
Casotto: You said it was about the same as other problems, and verification has a complexity. But we also have other types of problems like analog simulation, which is hard because you have signals that go everywhere, and that is hard to partition. We also have some bad habits. That is the dinosaur of automation. We started a long time ago, maybe before others, and it was easy back then when you wanted to speed up your simulator. You just waited for a generation of processor and you got a factor of 2X, which would have taken a lot of effort to do with partitioning. Now there is a plateau in performance, so the motivation to break out of the sequential model is getting stronger. It will become easier to partition because the chips are so large, so there will be ways to break them down easily, and we need to be looking out for the factors of 2X, 3X and 4X, but also the factors of 100X.
Palfy: But the problem is, are they repeatable? We are dealing with the same types of problem, and what we are looking into and what is really interesting is that we can utilize machine learning now and learn more about the ginormous amount of data that we actually get from collaborating, working and verifying these designs – to be not just faster using brute force, but to find ways to be smarter about how we deal with the problem. Include some heuristic and data analysis. Tackle it in a smarter way so that we are no longer dinosaurs.
Casotto: Yes, we have a bad habit. Every simulator has to be compatible with HSPICE. If it’s off by something in the seventh digit, then it is not acceptable. Maybe that has constrained innovation, and this limits the adoption of new technologies. Also, the business model and the licensing of parallel applications affects the adoption of the technology. A customer had a new fancy analog simulator that was parallel, but the licenses to run in parallel were expensive. So they did a quick tradeoff between benefit and license cost and they determined that at four threads, that was it. Because of the business model they were limiting their adoption of parallelism to just four. That is a long way from 20,000.
Sherer: There is a grain of parallelism that is more simulations, more runs. We see simulation license counts growing 1.5X to 2X per customer per year. The consumption is almost insatiable, both on the analog and digital sides. But the question is when you start running into internal machine limits and you go to the Cloud, are those just extensibility of available hardware? We are churning through the same data simulations, the same results again and again. We are repaving the same road. With parallelism and the ability to spawn off all of that parallel simulation, it doesn’t necessarily get you any better verification. It gets more cycles, more licenses.
Foster: Right, there is no correlation between cycles and verification.
Sherer: Just the availability of parallelism isn’t itself an answer.
Having worked on parallel versions of logic simulators, with poor results, I came up with this fix –
http://parallel.cc/cgi-bin/bfx.cgi/WT-2018/WT-2018.html
(https://youtu.be/Bh5axlxIUvM)
Just happens that it also fixes security and scalability problems in general, and neural-networks look the same as circuits computationally, so does AI too.
I don’t see that question has been answered. If a job can be broken up into independent tasks, it is suitable for parallelization. However, when these tasks share resources like memory, they need to queue up to become serial jobs. If they have dependencies, they need to wait as well, as their inputs are not ready. It is application specific.