The Good And Bad Of Chip Design On Cloud

Designing chips with the help of the cloud is progressing, but users still want greater flexibility in tools licensing and other issues.


Semiconductor Engineering sat down to talk about how the shift toward chip design on cloud has sped up, whether the benefits of cloud are realized in chip design, and some of the most pressing challenges to chip design on cloud today, with Philip Steinke, fellow, CAD infrastructure and physical design at AMD; Mahesh Turaga, vice president of business development for cloud at Cadence Design Systems; Richard Ho, vice president hardware engineering at Lightmatter; Craig Johnson, vice president cloud solutions at Siemens Digital Industries Software; and Rob Aitken, fellow at Synopsys. What follows are excerpts of that conversation, which was held in front of a live audience at the Design Automation Conference. Part two of this discussion is here.

SE: The shift toward chip design on cloud is accelerating. Business models are getting worked out, and workloads are better understood, as evidenced by some key partnerships between the semiconductor ecosystem’s biggest players. That said, from the user perspective, what are the top benefits you’ve experienced by adopting chip design on cloud? Is the promise of cloud paying off?

Steinke: At AMD we’ve gone with a hybrid multi-cloud strategy. The top benefit we’ve seen from adopting cloud would be an extension of our compute infrastructure, giving us a little more flexibility on obtaining compute when we need it, and fitting with our project cycles and roadmap. We’re also really curious to see how we explore differentiated solutions on the cloud. There are a couple of spots where cloud infrastructure can bring in things that either we have chosen not to deploy or that would be difficult, such as really high-speed networks or some different models of storage. Those are things we continue to explore to see what kind of value they can bring beyond what we do with our regular workloads.

Ho: As a fabless startup semiconductor company, being on cloud was essential. It meant that we didn’t have to build our own infrastructure in order to get going and launch a design. But the advantages go much deeper. The ability to flex and increase resources, especially for things like verification, was critical for us. We didn’t have to size our infrastructure for one workload; we could size it as we needed it. And as we got to our peak load, when we were getting close to tape-out, the cloud was able to scale with us. This outsourcing of the entire infrastructure is valuable, both from an application point of capacity, but also in terms of performance. We have access to the latest CPU cores, so we don’t have to keep upgrading our own on-premise CPUs to keep up, and we can actually make use of the cloud operator’s upgrades occurring automatically in the background. Also, having a very minimal IT department on staff and being able to take advantage of the security that was on the cloud and the support they have being up most of the time has been one of the key ingredients for us to be able to get to fast tape-outs, and be able to move forward. The tools, as well being on [cloud], and the range of tools that you can get on there, has been super valuable. And that’s also something where I have feedback about what you could do better.

(L-R): Richard Ho, Lightmatter; Rob Aitken, Synopsys; Mahesh Turaga, Cadence; Ann Mutschler, Semiconductor Engineering; Craig Johnson, Siemens Digital Industries Software; Philip Steinke, AMD. Source: Sashi Obilisetty/Synopsys

SE: From the perspective of the EDA tool provider, what are you seeing with the adoption of cloud in terms of benefits?

Johnson: The cloud infrastructure’s availability to service dynamic requirements is really high. Another one that’s a little bit surprising is the availability of hardware. That wasn’t the case five years ago. The lead times due to supply chain issues coming out of COVID were such that even pretty large companies had longer wait times than normal. Being able to leverage cloud infrastructure for that has been something that comes up again and again.

Turaga: We’re hearing a lot of business benefits from our customers — improvements in engineering productivity, increased innovation, faster time to market. Those are some of the business benefits. An example is Arm-based servers benchmarking Cadence tools on the cloud. They talk about how fast they were able to reduce the time to market by two months. They’re also happy engineers. That’s another thing that we all love. If you’re not waiting for jobs to run, with zero queue times, and all of us are happy — that’s number one. That improves the productivity, as well, and then the overall increased throughput. You’re able to do more.

Aitken: Another benefit is the ability to control and monitor, so if, as the administrator of your cloud you can keep track of what the users are doing, it’s not the old days where everybody’s stuck waiting for the queue. But it’s the case that you do know which projects are in need of which level of compute at a given time, so it’s very helpful for the administration of the facilities, as well.

SE: What are the biggest challenges with chip design on cloud that need to be addressed?

Ho: One of the issues that needs to be addressed is having very low latency for the startup of large jobs. Sometimes what happens is that when you have a certain number of virtual machines (VMs) in the cloud, and you want to, say, run 100,000 simulation jobs, splitting up those extra VMs actually takes a long time and turns into a bit of a problem. There are things we can do on the infrastructure side to stay ready, keep VMs ready, and be able to manage that to get low-latency spin-up of large volumes going there. The other big problem is that it’s not just about the infrastructure, it’s about the licenses. [EDA players] are still licensing based on 10 years ago with three-year contracts. In the cloud you need to be flexible with us. You need to allow us to be able to use licenses anywhere when we’re at a low point in the project. And then when we go to peak, you need to be able to quickly get us the licenses, just have them available, and be able to let us dispatch them immediately so that we’re not held back and constrained by that resource. That’s one of the things that is quite troublesome right now. We have to plan, not only for the machines, but also for the licenses. It would be great if it was all seamless.

Steinke: The cloud’s got all of this compute capacity distributed worldwide, but we’re still running in a ’90s style datacenter where algorithms are looking to run on a low number of CPU cores and are looking for some sort of POSIX storage really close to all of that compute. There’s only so much you can expand out. What cloud brings in terms of modern compute infrastructure is a real global distributed network, where you’ve got a massive backbone that each of these cloud providers has built spanning different geographies, and object storage-based systems that are meant for making that data available in different locations. I would love to be able to take advantage of compute in Antarctica or Timbuktu, where maybe someone else doesn’t need it, where it’s got the lowest type of spot pricing, and my data can get there in a reasonable amount of time. But to be able to get to that, we’re going to need tooling that can work in this kind of distributed environment, and understand how to get at the data, when we need it, without having the workload up and around all the time. We also need to be able to scale out to those number of CPUs to really accelerate the turnaround time with these large jobs.

Ho: A lot of the tool flows today assume POSIX style shared NFS storage, and it’s expensive and time-consuming to move data from one cloud to another cloud. That’s a big problem. A lot of EDA tools today assume you have shared data storage, and we’ve got to move away from that.

Aitken: That’s easier said than done for some algorithms. There are some types of workloads, like simulation or otherwise, that are fairly easy to move to a cloud compute model. And then there are others, like place-and-route, that are just going to be miserable because the algorithm itself was developed back in the era of somebody sitting at a workstation, typing to interact with a file system. The structure of the solution is such that the ability to scale that algorithm to operate on a data center using modern file systems and communication literally makes no sense. So there’s work to be done by the research community to figure out how you design a new place-and-route, or some of these other local-data-intensive algorithms that can map to a cloud environment. Effectively, if you were going to start from scratch, if no one had ever seen this problem up until this minute, you would probably come up with a different way of approaching it. But after 30 or 40 years of playing in that space, moving is hard.

Johnson: For most of these issues, like the licensing, and even the storage and the availability of compute, there are always three constraints. You’ve got economic constraints, technical constraints, and operational constraints, and generally they are different stakeholders at the companies that represent those different factions. From a technical perspective it might be possible to figure out a solution that economically doesn’t work or operationally may not work. Some of the delay in addressing these things, which we certainly have been talking about for a while, boils down to how find the least common denominator, the best common denominator for a combination. We agree that these are common issues on the EDA side.

Turaga: I agree these are some of the challenges we have today, and we’re working on solving some of these. We have been offering flexible licensing models for a while now. Coming to the data issues, we’re still figuring out what’s the right amount of data that needs to be available at the right time, in the right amount. That’s a challenge. There are industry solutions that have flex cache, and IBM has some open source solutions available that address some of the data sync issues that make the data transfer seamless between on-prem and cloud. One thing that’s still an issue, which Rob was pointing out, is that some of the workloads are better suited for this, while others are more complex, depending on specific data requirements. With verification, for example, there is a little bit more meaning just sending the right jobs to the cloud and getting the results back. For huge data jobs like implementation or multi-physics analysis, you have to adopt different strategies. Adopting a hybrid tool approach is another area we are looking at solving so basically, a customer from within the comfort of the on-prem tool can send only the data that’s required, and then bring only the result that they need.

View part two of this discussion: Tradeoffs Between On-Premise And On-Cloud Design.

Leave a Reply

(Note: This name will be displayed publicly)