Experts at the Table, part 2: What can be done instead? And why are companies reluctant to do more in the cloud?
Semiconductor Engineering sat down to talk about parallelization efforts within EDA with Andrea Casotto, chief scientist for Altair; Adam Sherer, product management group director in the System & Verification Group of Cadence; Harry Foster, chief scientist for Mentor, a Siemens Business; Vladislav Palfy, global manager for applications engineering at OneSpin; Vigyan Singhal, chief Oski for Oski Technology; and Bill Mullen, senior director for R&D at ANSYS. What follows are excerpts of that conversation. To view part one, click here.
SE: Emulation employs massive parallelization. Why have we not been more successful appling thousands of machines to simulation?
Sherer: We have been successful with thousands of machines.
Mullen: Is that running separate simulations?
Sherer: Yes.
Casotto: You want one simulation on 10,000 cores? Why not run 10,000 simulations?
Foster: It comes back to fine-grained versus course-grained. Sure, we are throwing lots of simulations out there on multiple machines, all running their own test. When you get to fine-grained, it goes back to the three criteria.
SE: An emulator doesn’t care if it is balanced.
Casotto: That is why emulation is disruptive. It is the way out and uses a totally different approach to simulating chips using hardware instead of software. The communication between the elements has a different cost. I cannot imagine breaking down a chip simulation into 10,000 tiny simulators and manage the communication between the pieces.
Mullen: And even if you could do it, it is not scalable and the power efficiency is going to be terrible compared to running them serially.
Sherer: Rocketick targeted GPUs. That is effectively what you are saying about tens of thousands of compute elements. The difficulty is that the GPU is not purpose-built for simulations the way that an emulator is. The ability to divide the computational elements and find detachments that would fit into the compute environment of the GPU is very expensive to do, and the GPU architectures change from generation to generation. So building a consistent piece of software to track them was challenging. There is an alternate challenge in just running on an x86 or Arm multi-processor environment, but that gets back to the course-grained/fine-grained calculation. Can you come up with software algorithms that find that? What a user expects to see with 8 CPUs is 8X performance, and what they don’t calculate into that is that each of them has separate memory or bus subsystems and different levels of communication between them.
Foster: And there is still a serial portion of the code that is required, that you cannot parallelize.
Sherer: In an emulation environment that will run fast, but to take that and try to compile it run in simulation across multiple processors may not yield a performance improvement.
Singhal: You can parallelize tests or assertions, but one simulation – I don’t know what to do with that.
Palfy: You mentioned cloud computing. There are a lot of companies that cannot afford their own farms. We have a cloud solution that we implemented to safely go into the cloud and to consume whatever compute power they needed, but it raises another question, which is security. We guarantee that data cannot be intercepted and cannot be interpreted, and people are willing to keep their finances online, their company data, but if you ask them to upload an encoded verification problem, they get very worried. There are select people who do it, but it is difficult.
Singhal: It is more sensitive than financial information.
Palfy: People can have my credit card, but not my FIFO.
SE: Emulation was a paradigm shift in tackling a problem, and in the past there were hardware accelerators for things such as layout and other applications. But is it too expensive to look at these market opportunities today?
Mullen: Verilog simulation has a very broad market, and everyone needs it. It is not changing that rapidly, so you can have an emulator whose architecture will live for a while. But if you want specialized hardware for place-and-route or a circuit simulator, it seems as if the ROI is not there. You also have the same capacity issues. How big do you make it?
Foster: A long time back in my career, we used to create our own tools. You would always run into a case where you had a class of designs for which you would not see improvement by using parallel techniques.
Casotto: The customers know the limitations of the simulators. They know that something may take two hours or eight hours. They organize their design around those limitations. The methodologies become more complex instead of being simple and that is why we can help them manage those complex methodologies and the dependencies that are involved. If they want to get their design done overnight, which is a nice goal, then they find a way.
Sherer: That can be both the chicken and the egg. Engineering teams may narrow their test because the compute environment can only simulate so much, but now the test becomes almost useless because it is too narrow and not testing interoperability in the chip or all of the functions.
Foster: The goal is to be able to start debug in the morning when you get there.
Sherer: Exactly. So we say you need to rebuild your test and move to emulation. So, it is not chicken or egg, it is chicken AND egg. It is a missionary sell and takes time. Once sold, then you get additional benefits because now they are freed from the perceived limitations. They can change their test environment. They can test different things. They are open to new ideas.
Foster: That has always been the case. The more cycles you get, the more you consume.
Palfy: Don’t you think that going higher in abstraction would also help reduce the problem you are trying to solve? We have SystemC and people can verify that with simulation, with formal. It reduces the problems down the road.
Sherer: Properly managed, yes. A key concept is that you don’t run formal on what you ran in simulation. It has to be ‘instead of’. You segment the problem so that you can take advantage of formal, and then don’t consume those cycles. You will use them someplace else. It’s both the horizontal distribution and parallelism there. Abstraction is also a value as long as you take advantage of it and cover different things, rather than just repeat the same things.
SE: How much do people do speculative execution of tests knowing that they will throw a lot of it away? When you run serially, you know there is no point running subsequent tests.
Palfy: At a certain point they would be, but we trade off between doing things in parallel with brute force and do it in a smarter way. If you have a smarter way, then you don’t have to parallelize as much. The goal is to find the bug as soon as possible. If it means that you wasted some cycles, then that is better than doing it one by one and finding out much later. You want to shift left. If it means you find the bug quicker, it is worth the waste of resources.
Mullen: I have been dealing with signoff and turnaround times. If you can do something overnight then that is a typical cycle, so people run as much as they could overnight and come in in the morning to do analysis. If you cut that to 4 hours or 2 hours through parallelism, then they can do more turns and improve the design much quicker. That is the goal. A 10% improvement doesn’t buy much, but if you can do 2X it increases the number of design turns a day. You still have to look at the overall flow, because there are a lot of different tools in the flow and you have to get data in and out. It is not just one engine running quickly.
Foster: You can always have the case where you are doing serial simulations and you find a bug in the last one, while if you ran that in parallel you would have saved a lot. What has changed is the wait time.
Sherer: A common effect in both digital or analog is that someone has checked in bad code, so if your farm is seeing 10% failures, you are better off seeing them in the first half hour than at the end serially. That does mean that smoke tests are good, along with other precheck activities.
Casotto: Everyone agrees on workload management. There is another low-hanging fruit in terms of reducing the waste of simulations. If you have run everything last night and you make a small change, it is nice if you have the data analysis available to know which parts that small change affects. So instead of rerunning everything, it would be possible and nice to just rerun the subset affected by the change.
Palfy: Any formal run is good that provides a full proof. Simulation can be wasted. Formal tells you if it is good or not. But you still need to have proper management of the code.
SE: Could we create better tools if the hardware became more aware of parallelization?
Sherer: I do not think there has been enough of a spotlight put on parallelizing the hardware algorithms themselves—rethinking the algorithms that we are putting into the chips so they are decoupled, so that they can run independently. There are inherent advantages in terms of performance and compute speed that will come, but a lot of the systems we see today are very much packet-based communication, linear data-to-data communications even in designs with hundreds of IP components. There is a need to shift some of the design thinking. Will that lend itself better to emulation and fast simulation? I am not sure what this does to analog, but we as an industry needs to drive this. We have been struggling with this concept, as well. I come back to cycle simulation, and the industry moved to it under pressure, but who do we tap? A customer? They are not going to lead the industry with their design technique and give away IP.
Foster: I agree, but the challenge is that in any design—Arm, Intel, whatever—there is so much legacy, and every design is an iteration on that legacy. People are not willing to throw stuff out. But that presents part of the problem. We cannot rethink it until we can push the button and do a reset on this legacy.
Palfy: That may be possible for some niche industries. If you have new safety standards in place, the legacy code just doesn’t cut it. So they have to re-do it, and that is a good opportunity to come in with a new methodology.
Sherer: Automotive is the worst-case scenario. There are IP blocks that were designed, they are sound, they are 10 years old. They are not being replaced, and it will be another 10 years. Now you wrap safety around them, but those IP blocks are not going away because they are trusted in use.
Palfy: Trusted in use sounds like simulation against formal. The fact that it did not crash so far doesn’t mean it won’t.
Sherer: The pressure will come from the newcomers in the industry. They have new IP, and this will upset the cart. But the concept of legacy through semiconductor and the environment built around it is a challenge. There are no easy answers. Rerunning everything in SPICE? When do you trust high-level models? How do you trust them and characterize them? We have to answer some of those questions. We need abstraction. We have seen a resurgence in gate-level simulation based on fear.
Casotto: Design teams know the limitations of the tools and they adapt to it.
Sherer: There is a virtuous circle between design and tools.
Palfy: Yes, it is the customers that push the limits that force us to invent new stuff.
Leave a Reply