EDA In The Cloud

Leveraging massive amounts of compute resources to reduce time-to-market.

popularity

Interest in the use of third-party public clouds in conjunction with electronic design automation (EDA) applications has never been higher. Back in February, Ed Sperling and I did a video interview (embedded below) to discuss EDA and cloud computing. This article follows up on that interview, providing some additional insight into why and how the integrated circuit (IC) industry reached this point at just this moment in time.

We are now at a nexus where computational demand, technology, and enhanced security make it both possible and necessary to leverage the cloud to get your next IC design to market. How did the cloud become a viable option, and how can your company use it to expand your compute options at both established nodes (e.g. 45nm to 28nm) and leading-edge technologies (e.g., 7nm to 5nm)? Read on…

Why cloud now?
Historically, there were two primary reasons why the EDA industry was lukewarm about the idea of cloud computing. The first, and perhaps most significant, was the concern over intellectual property (IP) security. Losing control of their IP could have an enormous impact on a company’s ability to maintain their competitive advantage. Secondly, most companies had few problems managing their compute requirements with on-site hardware, so there wasn’t a compelling need to look for an alternative.

A lot has changed in the last few years. First and foremost, there has been a relentless growth in semiconductor computational needs, arising from multiple dimensions. Following the progression of Moore’s Law, each node brings more shapes, nets, and transistors on which to perform computation, with 2x more transistors per node, more fill, etc.

Simultaneously, check-driven growth in compute demand adds to this overall growth in computation. At the 5nm node, DRC is not just DRC. It’s DRC, equation-based DRC, multi-patterning, and pattern matching. LVS is not just LVS; it’s LVS plus advanced device parameters, such as electrostatic discharge (ESD) checking and reliability checking. Fill no longer means simply adding a few geometries; design companies are now using techniques like cell-based fill to maximize and optimize fill. Not only has the number of checks gone up, but the types of checks have expanded, and the number of check operations has grown drastically, so the overall compute is much, much larger than it has ever been historically (Figure 1). At 7nm, the total check count in a DRC deck was 10,000. When 5nm comes out, it’ll probably be 13,000 checks, or a total on that order.


Figure 1. Growth in both new verification technology and the number and complexity of rule checks contributes to an increase in computational requirements.

In fact, node over node compute growth has remained relatively constant. Each node generates an increase in compute demand of between 20 to 30%. To achieve reasonable turnaround times in the early 2000s, an IC company might have used tens of cores. Now, at the most advanced nodes, that number is typically in the hundreds. Design teams are more frequently experiencing long, unproductive waits for adequate hardware.

As for those IP security issues? Cloud providers were finally convinced of the IC industry’s need for stronger security measures for both their technology and physical resources, to the point where the security the cloud can now provide for IP data is arguably better than any individual IC company can offer.

Of course, there are other factors that may influence a company’s adoption of EDA in the cloud, such as the introduction of new lithography technology, innovations in transistor design, and the use of new packaging technologies, but none of them appear as though they will materially change the long-term growth trends as far as check/operation count and total compute time [1,2].

What is the value of cloud computing (or, how can it help me)?
This is the fundamental question IC companies must answer. Initially, companies thought that cloud computing would be cheaper, because they wouldn’t have to buy and maintain all the machines they needed to meet those maximum computing requirements—they could simply add whatever extra capacity they needed at just the right time. However, evidence from early adopters of EDA in the cloud doesn’t show the true value being lower cost of ownership. The real value to these companies has turned out to be a shorter time to market. With the option to access much larger resource pools in the cloud, they now have the ability to compress the total time it takes to complete physical verification, simulation, fill, etc., so they can actually get designs out to the market faster.

If it’s no longer a question of getting timely access to local resources, the question then becomes “How do we leverage the massive amount of computing we can access in the cloud to minimize our total cycle time to market?”

Resource usage
The cost of ownership when using cloud resources is a multi-faceted calculation. Cloud compute is more expensive than on-premise resources, so companies must first determine if they are using their on-premise resources in the most effective and efficient manner. If their utilization of on-premise computing is relatively low, maybe they need to improve their internal allocation process. If their local utilization is extremely high, but they’re meeting nearly of their schedules, the cloud may only make sense in those very limited situations where they have big spikes in demand.

Once they have a good understanding of their current use patterns, they can start to identify conditions where cloud computing may be practical and beneficial. For example, moving the same type of design to a new node results in an increase in the total amount of compute required, which may exceed available in-house resources (Figure 2).


Figure 2. Moving the same design type to a new node drives up the need for computing resources.

Companies also typically have multiple project teams, all competing for the same resources. As long as those resource demands are spread out, in-house resources are sufficient—that is, the peak resource demand of one team falls into the resource demand “valleys” of the other design teams. When one team’s schedule slips, times of high utilization (e.g., tapeout) often end up overlapping, at which point resource demand exceeds supply (Figure 3). At those moments, companies can take advantage of the cloud to get access to more resources.


Figure 3. (Left) As long as competing resource demands don’t exceed supply, companies can use in-house resources. (Right) When peak usage overlaps, companies may need to turn to external resources.

Economies of scale
EDA software performance is another factor. For example, the Calibre platform is designed to scale to extremely large numbers of resources, whether those resources are internal or on the cloud. Companies might choose to use cloud resources with Calibre tools to scale up to resources far beyond their internal availability, which can significantly compress verification runtimes and allow the company to tape out a design much faster. The calculation then becomes the value of moving from, for example, 2x design iterations to 4x per day. With the relentless pressure on time to market, planning ahead to add cloud resources to get to tapeout faster may be financially desirable.

EDA support
Not all EDA suppliers implement EDA in the cloud the same way. For example, two of the three largest EDA companies provide an integrated solution that is only available within their cloud. Because most fabless companies aren’t looking to move all their EDA compute to the cloud—just that small percentage of surge demand, or when tapeout speed is the highest priority—understanding your options is crucial.

Not all cloud providers have the same experience in enabling high performance computing applications like EDA. Running EDA applications in the cloud is very different than running a word processor or spreadsheet program in the cloud. Our experience is that the “out of the box” cloud experience is not usually the most efficient use of those cloud resources. Scaling efficiently to thousands of CPUs/cores takes the right setup.

At Mentor, a Siemens Business, the Calibre platform has supported distributed and cloud computing for more than ten years. We’ve used Calibre applications in the cloud for over five years in our own resource and development to access large hardware pools (up to 10K+ cores). Because of our extensive experience with cloud computing and the Calibre platform, we have a unique ability to focus on ease of use, efficiency, and cost effectiveness, rather than risky major re-architecture efforts to improve scaling. We can use the same Calibre engines and licensing, and deliver the same great performance, whether a company is using a private or public cloud service (Figure 4).


Figure 4. The Calibre configuration for cloud computing was developed based on years of actual experience.

Best practices
All of that experience also put us in a unique position to provide best practices guidance to help companies decide how to communicate to the cloud, how and when to move data out into the cloud, and how to assemble jobs out in the cloud [3,4]. Companies must structure their processes to move data without unnecessarily long lags. They should scale the amount of compute they’re actually running in the cloud as a function of such facts as what node they are using, what type of physical verification they are running, and how big is their die? They also must consider how to best configure their resources. For some types of EDA applications, companies will want their license server on premise so teams can use the same set of licenses on-premise and out in the cloud, while for other types of EDA applications, they’re better off locating their license server in the cloud.

All of that information drives both the amount of cloud compute they should be using, and the benefit they ultimately derive from cloud computing.

Real-world results
What could those benefits look like? Figure 5 shows one possibility. In our recent collaboration with Microsoft Azure, Advanced Micro Devices, Inc. (AMD), and Taiwan Semiconductor Manufacturing Company, Ltd. (TSMC) on a full reticle 7nm design, AMD was able to increase DRC runs from approximately 1-1.5x design iterations a day (17 hours on 500 cores per iteration) up to 3x iterations per day (8 hours on 4,000 CPUs per iteration). Even better, the memory used started small at <500GB, and stayed flat, even as usage expanded out to 4K cores. This result is especially important because the memory size on cloud servers is a key driver of cloud cost, and the ability to use a relatively small amount of memory servers helps control costs.


Figure 5. Using cloud computing resources, AMD was able to more than double their daily DRC iterations on a 7nm production design.

In addition, by evaluating AMD’s environment, the use of Microsoft Azure’s resources, and the TSMC decks, we were able to identify optimizations to all three of those areas that enabled AMD to further reduce the runtime on 2,000 cores from 9.5 hours to 6.7 hours. That’s a savings of three hours using the same resources, just through optimization.

Summary
In a decade’s time, cloud computing for EDA will be as ubiquitous as the foundry is today for IC wafer manufacturing. As of today, however, most companies are still primarily performing on-premise compute, and trying to figure out what percentage of their work for which it makes sense to leverage the cloud. Understanding the options, the benefits, and the costs of cloud EDA can help them make the decision that best supports their needs and resources. Once the decision is made to adopt cloud computing, the availability of proven technology, flexible use models, and guidance through best practices can help companies obtain the maximum benefit for their investment.

References
[1]    M. LaPedus, “Multi-Patterning EUV Vs. High-NA EUV,” SemiEngineering. Dec 4, 2019. https://semiengineering.com/multi-patterning-euv-vs-high-na-euv/

[2]    M. LaPedus, “What Transistors Will Look Like At 5nm,” SemiEngineering. Aug. 18, 2016. https://semiengineering.com/going-to-gate-all-around-fets/

[3]    J. Ferguson, “New approaches to physical verification closure and cloud computing come to the rescue in the EUV era,” Mentor, a Siemens business. March 2020.

[4]    O. El-Sewefy, “Calibre in the cloud: Unlocking massive scaling and cost efficiencies,” Mentor, a Siemens Business. July 2019.



Leave a Reply


(Note: This name will be displayed publicly)