Rocky Road To Designing Chips In The Cloud

The EDA ecosystem is in the early stages of pivoting to the cloud, but not everything will move there.

popularity

EDA is moving to the cloud in fits and starts as tool vendors sort out complex financial models and tradeoffs while recognizing a potentially big new opportunity to provide unlimited processing capacity using a pay-as-you-go approach.

By all accounts, a tremendous amount of tire-kicking is happening now as EDA vendors and users delve into the how and why of moving to the cloud for chip design and verification. Not all tools are available in the cloud today, and companies that have their own data centers are likely to continue using them. But EDA in the cloud is adding new opportunities for tools companies, as well as established chipmakers and startups looking to experiment with different architectures and designs.

“If you can access larger pools of resources, you can dramatically improve turn time and get a design out faster,” said Michael White, senior director, physical verification product management at Siemens EDA. “For example, with on-prem, you may get one turn every 24 hours, but using the cloud you can do three physical verification runs a day. And you’re able to provide continuous feedback for designers to fix things. By accessing much bigger pools of resources, that productivity gain gives a real time to market advantage. If you’re the first one to market, you are going to be able to command an average sales price premium over your competitors.”

Fig. 1: Rising complexity and number of checks required at each node greatly increases the amount of computation required in chip design. Source: Siemens EDA

Fig. 1: Rising complexity and number of checks required at each node greatly increases the amount of computation required in chip design. Source: Siemens EDA

As an industry, EDA has been relatively slow to make this shift. In the past, this was largely due to security concerns over IP in highly competitive markets. But security has improved to the point where it is considered better than on-premise data centers. This is backed up by other industries, including retail, banking and manufacturing. Gartner predicts that by 2025 about 80% of data centers will move to the cloud to take advantage of the processing elasticity and fault tolerance.

Chip design has been slow to make that jump, but several factors are pushing the industry toward doing more in the cloud:

  • Systems companies such as Google, Amazon, and Facebook, as well as automotive OEMs, are now developing their own chips, and many are not as heavily invested in tools and on-premise data centers as long-time chipmakers.
  • Chips are getting bigger and more complex, driven by more domain-specific designs using custom architectures, each with unique dependencies and a widening expectation of longer lifetimes. Massive simulations, emulation, and prototyping are considered vital to getting reliable chips out the door in a compressed market window, and even the largest chipmakers with huge on-premise compute resources need extra compute cycles for limited amounts of time.
  • More companies from across the globe, including many startups, are developing smaller chips, or chiplets, as well as new interfaces, new types of memory, and experimenting with a variety of interconnect schemes. For these companies to be competitive, they need to focus limited resources on engineering rather than setting up on-premise emulation and simulation farms.

While the largest EDA companies have experimented with cloud-based services for the past decade, uptake has been slow. In fact, it wasn’t until TSMC announced support for EDA in the cloud that many chipmakers even began to take cloud-based design seriously.

“Once key foundries began endorsing the cloud, we started to see a gradual shift happen in the EDA industry,” said Mahesh Turaga, vice president, cloud business development at Cadence. “Security as a whole has also become less of a concern since the cloud has proven to be a highly secure environment. While the adoption of the cloud is being led by start-ups, small and medium businesses in the EDA space and beyond, there’s a mind-shift among larger EDA companies, which are actively working figure out ways to move many of the EDA workflows to the cloud.”

Lingering concerns
Even with the cloud, some hesitancy remains. Richard Ho, principal engineer and a director at Google noted that although he’s in the unique situation of working on his company’s cloud, security has still been a major concern because of the nature of the project, and some of the rules and regulations around it.

“If I was working with my own data center — if I was in a company that was not Google, for example — I would have to be very confident in the security setup of my IT versus in the cloud, and this applies to most of the public clouds. But in Google there are a bunch of people who are security-focused. There are teams of people whose job is to secure the cloud for all the customers, including myself. I would prefer to have a set of dedicated people whose job is to securely work on research and security and look for the zero-day exploits, and basically be monitoring for all the different open ports or other things that people have done in configurations and constantly monitoring. I feel safer being in a cloud than being on a kind of an on-prem network where I would have to rely on my IT, and likely have fewer people than work in a public cloud working on security.”

Still, there are challenges to making this work. Ho explained when he joined Google he thought that because the company has data centers, having an infrastructure for its chip design compute would be simple. “It turned out that Google doesn’t like having third-party black-box executables running on the same servers as Gmail, for example, so infrastructure turned out to be a bit of an issue for our chip design team. One of the first things we thought about was moving to the cloud, so we’ve been on a journey to move our chip design processes onto the cloud for several years now, and I can say that our TPUs are designed on the cloud. We do use GCP, our cloud services.”

There are several key advantages the cloud offers, even for those with nearly unlimited on-premise capacity.

“As the process technology is headed toward 3nm and below, the computing infrastructure requirements increase by multiple orders of magnitude,” Turaga said. “That necessitates the need for even faster computing at scale, which is possible only in the cloud. Companies find that the performance of current EDA tools increases by an order of magnitude due to the availability of advanced high-performance servers on the cloud, and engineering productivity can be increased significantly while simultaneously reducing costs and project timelines. In addition, the increasing need for global collaboration among chip design teams, and the need for innovation at lightning speed, makes the case as to why companies large and small are seriously looking at shifting their chip design workloads to the cloud.”

Others report similar trends driving cloud services. “The main problem that customers are facing today is that they want to do more complex problems, like complex equations and complex solving, and they are facing hardware limitations,” said Thomas Lejeune, product marketing manager at Ansys Cloud. “For almost 75% of users [Ansys] surveyed, the problem is coming from the hardware. Since they are working on a very limited number of cores on their computers, even if they have the best software in the world, if they are not able to compute the problem because of hardware limitations then there is no sense.”

Fig. 2: Survey results showing limitations of growth due to simulation time. Source: Ansys Surveys, based on interviews with 750+ IT managers and engineers and C-Level executives.

Fig. 2: Survey results showing limitations of growth due to simulation time. Source: Ansys Surveys, based on interviews with 750+ IT managers and engineers and C-Level executives.

Ansys’ stance is that by removing the hardware barrier, simulation can be increased. LeJeune said that managing its cloud for users is the ideal scenario, because Ansys knows its solvers best and therefore can tune them directly in the cloud.

Just as important as understanding when to use cloud is understanding when it should not be used, as well as what should be processed in the cloud, and what should be processed on-premise.

“Go to the cloud when you run out of hardware on the ground, and you need to do something,” said Bob Lefferts, a fellow at Synopsys. Still, real-world answers are not so straightforward.

“The problem is how to get the data there,” Lefferts said. “Then, you pay for when you bring it back. Another option is put everything on the cloud, so you don’t pay for the up and down, but you’ve got to pay for it being up there, and that’s more expensive. It’s a tradeoff between having, for example, a giant workload that’s going to come and go. If I buy the machines on the ground, that’s not wise. But if I have to get the data up there to run just this one job for the next three weeks, and then it’s going to come back down to zero, I’ve got to transfer everything up and back, so it is a tradeoff. That is definitely is a real part of figuring out what’s going on. And, fortunately or unfortunately, the way that cloud service providers work is they charge by the pint of blood and they don’t necessarily care where it came from or who drew the blood. They just say, ‘This is how many pints you owe me.’ You don’t know where it came from, so the whole cost accounting has been a rather interesting challenge, and not just from a cloud provider. It’s also a challenge internally to have those cost controls in place, because we’re not used to paying for hardware. Normally we would say, ‘I need 1,000 cores for three weeks, but we don’t have them.’ Now you can get them, but how are you going to pay for them? The flexibility brings an interesting cost accounting challenge.”

For this reason, Ho and his team worked a lot on the infrastructure because they saw exactly the same problem. “Where do you store your data? How much ingress, how much egress do you want to use? We designed our system and our infrastructure to hold our repos on-prem, and our local areas that are within the corporate network. We move the data up to the cloud to do the processing, and we try as hard as possible not to egress as much data on it, so if you want egress reports, you don’t egress entire databases out of it. You either leave it in the cloud, or delete it and regenerate as you need to. We spent a fair amount of time thinking about that in order to make that cost-effective for ourselves, as well. If you’re a customer thinking about that, those are important things to consider.”

Still, for some applications it doesn’t make sense. Simon Davidmann, CEO of Imperas said depending on the application, if there’s too much data moving around, it’s best to keep most of it in one data center. “When we check code in, we want to know in 20 minutes that it’s good,” he said. “We don’t allow people to check code in until all the tests have run, which is different for most people. They check in the code, then find a failure, and are told about it. We don’t. We built our own way of doing it that just doesn’t work in the cloud. If it did, we would use the cloud, and it would save us maintenance on the machines and updating the software, etc.”

Additionally, workflow considerations become important when moving compute to the cloud. “Engineers will have a little change in their workflow when working on the cloud. Even though it is an extension of the data center, if you have to get data to the cloud, it does cause a little bit more latency and so you have to work that into the workflow, get people used to it, and train for it as well. Those are some of the on-the-ground details that change as you make the migration to cloud,” Ho explained.

Eric Chesters, a fellow at AMD, said the first thing he thought about when approaching the cloud were the compute nodes. “That’s what we build and that’s what we’re trying to sell, but very quickly the mantra became, ‘It’s all about the data, understanding that data,'” he said. “Especially with flows that have been built or run on-prem for 20 years, they assume that they’ve got instant access to any piece of data in the whole data center. So even just identifying what is a piece of work, what data is actually pulling from, which file systems are tools retrieving the data from, what is a particular task using — there are a lot of challenge there including understanding what to move up, how to manage it, what the lifespan is, what to pull back. And there’s a range of answers to those questions. For some of the work flows for VCS, or for simulation workflows, some of those are very amenable because we push the models up into the cloud, we run the tests, some are running in bulk regression mode, we get pass/fail on the seat, and the signature potentially a very small amount of data coming out, because any debug of that failure is done offline, waveforms will be collected by rerunning etc. But then you get the question of coverage information. If I’m collecting coverage, what’s the lifespan of that coverage? How do I manage combining that, and if I’m also running still on-prem, the merging of those things? So data is really the thing that we spend the most time thinking about initially, as well as in the longer term as we start looking to cost optimize and optimize the user experience. Where’s that data coming from, where’s it going to, how is it being held? That’s going to be what we’re talking about for the next few years.”

Before the user community moves en masse to the cloud, there are a number of wrinkles to smooth out.

Cadence’s Turaga noted that customers are reacting differently to the idea of moving chip design to the cloud. “Some have very ambitious goals to move the bulk of their EDA workloads to the cloud and reduce on-prem compute footprints to take advantage of the latest and greatest cloud computing technology available, which increases engineering productivity, reduces costs and speeds time-to-market. Others are taking a measured approach. In general, the trend among start-ups and small- and medium-sized customers has been to move to cloud as much as possible. The industry still has work to do to migrate on-prem customers with nearly unlimited compute capacity to move to the cloud. The key factors that will drive the decision for such customers to leverage the scalability of the cloud will be the affordability of licenses for short-term needs, significant advances in the performance of cloud-based servers and the availability of new licensing models on the cloud.”

Before EDA in the cloud is adopted widespread, immediate next steps include growing customer confidence, suggested Megan Wachs, vice president of engineering at SiFive. “In order to give confidence to customers, using the cloud and industry standard tools that give customers access to machines through the cloud is the right way to do that. We wouldn’t want to necessarily roll our own web servers and host our own things there because we’re standing on top of the shoulders of all the technology that is out there with companies providing security and these sorts of platforms. We can just put our interesting and relevant chip design stuff on top of that, and not have to hire an entire team to do security. If I was a customer, I would be more confident knowing that we were using real cloud providers and not hiring our own team to do that sort of thing. On the flip side, as a consumer of things like technology libraries, etc., when we want to do our own performance analyses, so far we generally have done that through our own on-prem or in server, or things that are not in the cloud.”

Paying for EDA-in-the-cloud
Licensing adds a whole other set of issues, and no one has firmly worked this out yet.

Aldec, for example, has released its multi-FPGA partitioning software on AWS in order to figure out what works best. The company plans to offer its simulation tools at some point in the future, as well, but expects that to take some time because of the size of the tool.

“We made it simple with a monthly time-based license,” said Louie De Luna, director of marketing at Aldec. “We’re starting with that. It’s not easy. We need users to start paying for it if the price makes sense for them. There’s no specialized pricing. The users still need to pay for the hardware they’re using on AWS. We made it simple because when we tried to explain different pricing options to users, it didn’t make sense to them. They’re still used to professional or time-based. So we gave them a path to try the software. It’s still a bring-your-own-license model for AWS. Basically the customers purchase the license from us and install it there.”

Others is moving in this direction, as well. “For startups, this is good because they don’t need all of that infrastructure in place,” said Zdenek Prikryl, Codasip’s CTO. “For them, the challenge is to nail down the scope with all the different options. This also can accelerate adoption because if you have something in the cloud, it’s quite easy to get access to it. For example, we would like to see more adoption by universities, and cloud is a good choice because you can add in as many students as you want. You do have to manage the infrastructure, but you’d have to do that, anyway. And it’s not just on the EDA or tools side. It’s also happening on the hardware side with FPGAs in the cloud. This is going to bring in new business models for EDA, where you just pay as you go.”

This is the approach taken by Metrics Design Automation. “There are two things to keep in mind with this kind of pay-by-the-minute model,” said Joe Costello, executive chairman at Metrics Design Automation. “It’s a slow build. It takes a while before you end up getting your revenue up because people build up. You don’t get that big-hit sugar high of that gigantic upfront license fee. But after you get going, it becomes incredibly predictable and sustainable. People do, net-net, pay less than they do today for the same amount of simulation. That’s the key word there. For the same amount of simulation, they pay less, because they don’t actually use it 24/7.”

Hagai Arbel, CEO of Vtool, agreed. “Why not pay per use? Whether the use is minutes of actual usage or the use is actual action, maybe it can be tied to how many megabytes or gigabytes of logs you are reading in. Maybe it can be some kind of usage within the tool itself. Whatever way it goes, there are definitely more efficient ways to license EDA tools in the cloud than a yearly floating license. For example, if you look at tools like simulators, these are classic pay-per-minute kinds of things. Why would I buy X simulation license per year, and not pay per usage so that when I have regressions to run, and I need not one license but 2,000 licenses but I need to use them just for an hour, why not enable that?”

Further, Arbel believes cloud-based EDA will provide even better margins for tool providers. “I would gladly pay a nice percentage of our total usage to anyone that will enable it. Why can’t they resell it? And why can’t I resell something else? And why can’t we all use some common infrastructure as a cloud to do all of that? Companies like Google, for example, proved this a very long time ago. They knew that if the value was there, then the business model will also be there. When you provide the right platform, and when you allow innovation to lead, you will benefit from that as a commercial company selling your technology — not only startups but the big EDA vendors also.”

Conclusion
Growing complexity is creating a number of problems, but not all of them need to be addressed at the same exact moment. This is a key reason why a cloud approach is gaining traction.

One feature that is really important for electronics users is the ability to solicit the cloud only when it’s needed, Ansys’ Lejeune said. “When you need peak computational power, you will have the best configuration, and when the workload doesn’t need the full power, then you will be able to reduce the machine and the core type, so that at the end you can fine-tune the cost/performance ratio. Being fast is great, but if it costs you a lot, it’s not good. If we look at the benefits of cloud, it’s increased simulation throughput, while only paying for the hardware and software used. That’s a good way to change the expense, and move from CapEx to OpEx. For that, we have some metrics about return on investment, and total cost of ownership. At the end, what matters for our users is to focus on engineering, not maintaining the cluster and doing IT tasks. It’s a way to keep the focus on the things that matter.”

Or as Vtool’s Arbel put it, “The world is going to cloud. EDA has to go there, too.”

—Ed Sperling contributed to this report.



Leave a Reply


(Note: This name will be displayed publicly)