Avoid hitting the on-premises resource wall as compute requirements rise.
By Michael White, Siemens EDA, in technical collaboration with Peeyush Tugnawat, Google Cloud, and Philip Steinke, AMD
At DAC 2022, Google Cloud, AMD, and Calibre Design Solutions presented an EDA in the cloud solution that enables companies to access virtually unlimited compute resources when and as needed to optimize their design and verification flows. If your company is considering adding cloud computing to your resource options, this summary of their discussion can help you learn how to take full advantage of the cloud for your EDA workloads.
What process technology node are you working at right now? 40nm? 20nm? 5nm? How big is your next design? How much functionality have you managed fit into it? And how long is it going to take you to get it through verification and signoff?
Regardless of what your next technology node or design happens to be, computation requirements are constantly expanding, both in sheer numbers and in the rate of change node over node. If you find yourself struggling to acquire, access, and maintain the on-premises resources you need to get to tapeout on time to intercept the market window for your designs, you’re not alone. Design companies everywhere are hitting the on-premises resource wall.
Fortunately, there is a proven solution at hand. Leveraging the nearly infinite resource pool of the latest technology servers in the cloud in conjunction with electronic design automation (EDA) software tools that are ready for cloud processing, design companies can readily and easily access the hardware resources they need to complete more design iterations per day, shortening time to tapeout and/or giving design teams time for that extra analysis/optimization to ensure they deliver the best possible design to market.
As every company in the semiconductor industry is painfully aware, both computing time and resources for integrated circuit (IC) design and verification are, quite simply, headed up and to the right, aggressively so. Each new node adds more and different types of shapes, nets, and transistors on which to perform computation, resulting in an average growth of 2x more transistors per node, more and different fill shapes, etc.
Adding even more pressure, check-driven escalation in compute demand contributes to the overall growth in compute. Historically, the industry saw a node-over-node check and operation count growth in the 20-30% range. Once the 7nm node was introduced, this increase became dramatically higher as the foundries were forced to create more restrictive design rules to ensure designs would still be manufacturable with high yield (figure 1).
Fig. 1: Check count growth was relatively predictable until the 7nm node, when it rose dramatically in response to more restrictive design rules necessary for high yield manufacturability.
But wait, there’s more! In addition to check/operation count growth in traditional design rule checking (DRC) and layout vs. schematic (LVS) verification, physical verification (PV), circuit verification (CV), and design for manufacturing (DFM) functionality is now significantly more comprehensive, as many of today’s designs must not only be manufacturable with high yield, but must also provide new levels of reliability and performance for growing industries such as mobile devices, transportation, medical devices, 5G communications, etc. (figure 2).
Fig. 2: Verification requirements for a design at 130nm compared to 3nm.
Even if your company stays at the same technology node for multiple designs, you can’t escape this compute growth. Consumers always expect more from the next generation of their electronics, whether that means new features, longer lifetimes, faster performance, or lower prices. In fact, whatever they can think of, they expect, and whatever they can’t think of, they want to be surprised with. For design companies, that translates to more functionality, better performance, quicker yield ramp, faster operation, lower power, higher reliability… in other words, an almost endless list of innovations and improvements. The types of IC designs being produced at established technology nodes (e.g., 28nm and higher) now are very different and far more complex than what was released a decade or so earlier, when those nodes were leading-edge. Likewise, their compute requirements are also very different and far greater.
As a consequence of this compute growth, companies that historically might have used tens or maybe a few hundreds of cores to achieve an overnight turnaround time for larger batch PV/CV jobs must now use many hundreds up to a few thousand cores to maintain that same turnaround time for their largest full reticle and advanced node designs. Additionally, due to both this large amount of compute and other factors such as the increased size of netlists, larger memory servers are also required.
Even if you have sufficient hardware on paper, ensuring easy, timely access for everyone is becoming a very real challenge. While smart companies always try to maximize the utilization of their on-premises resources, that strategy can make it harder for design teams to gain timely access to sufficient on-site hardware to achieve their desired turnaround times. When hundreds to thousands of cores are needed, there can often be long, unproductive waits for adequate hardware access.
Another resource challenge has arisen in recent years—actually being able to buy hardware. Supply chain delays are real and have made it increasingly difficult for companies to purchase the hardware they need for their next series of designs or next technology node on a reasonable, or even predictable, schedule.
This growth in overall compute, coupled with the challenges of acquiring and maintaining on-premises resources, is sufficiently large that many, if not most, companies are actively planning to supplement their on-premises resources with cloud computing in a hybrid cloud structure. The strategic goal is to leverage the large incremental hardware resource pool available in the cloud for the compute surges that occur during their biggest compute applications.
There is another factor at play in the decision to invest in cloud computing. Historically, semiconductor companies were reluctant to consider cloud computing due to concerns over intellectual property (IP) security. Semiconductor IP, be it models, schematics, layouts, process information, or foundry rule decks and design checks, are the crown jewels of a company. The information contained in their IP creates a company’s market differentiation, so losing control of their IP is a significant threat to a company’s ability to maintain their competitive advantage.
Aware of the industry’s concerns, cloud providers focused on building stronger security measures for both their technology and physical resources. As a result, the security they now provide for IP data is arguably even better than any individual semiconductor company can offer. With this emphasis on security, the IC industry (and most importantly, the foundries) are opening up to using cloud computing with their data, including rule decks, standard IP, and other confidential and proprietary information.
The public cloud provides access to extremely large resource pools that all semiconductor companies may dream of, but the benefits go far beyond just the need for hybrid cloud surge computing.
As a cloud provider, Google Cloud provides a wide portfolio of different virtual machine (VM) server types that extend well beyond what would be possible within any on-premises data center, no matter how large the company budget. This accessibility provides companies with the flexibility to add resources tailored to specific applications (e.g., VMs that are “right-sized” for memory, core counts, etc.) during design implementation, physical verification, circuit verification, reliability/ESD checking, and so on.
In addition, cloud hardware and service providers constantly add the latest generation servers to their portfolio, continuously providing optimized performance options. AMD provides a very real example of this ongoing infusion of next-generation servers that provide compelling performance benefits. Cloud providers are also able to minimize the impact of supply chain challenges due to their extensive and continual acquisitions. With Google Cloud’s relentless investment in new servers, their access to the latest technology servers has been largely unaffected.
At Siemens Digital Industries Software, the Calibre nmPlatform has supported distributed and cloud computing for over ten years. The Calibre nmPlatform consists of multiple high-performance computing (HPC) applications that scale to thousands of CPUs, while processing extremely large GB/TB size files. The Calibre research and development (R&D) team has run Calibre applications in the cloud for over five years in our own R&D flows to access really large hardware pools (up to 10K+ cores).
With this extensive experience with both cloud computing and the Calibre nmPlatform comes a unique ability to focus on ease of use and efficiency to improve scaling, rather than attempting to rely on risky major re-architecture efforts. Calibre tools use the same Calibre engines and licensing, and deliver the same performance, whether a company uses a private or public cloud service.
Siemens has partnered with AMD for many years to support both their internal migration to cloud surge compute for their CPU/GPU design activities, as well as the performance characterization/optimization of VMs they sell to cloud providers. AMD verifies their cutting-edge products in the cloud using Calibre tools to create compelling CPU/GPU offerings for the cloud. AMD is a real-world success story in leveraging the cloud to accelerate their design processes and close design cycles.
We also collaborated with Google Cloud to identify a reference architecture and right-sized VMs for the Calibre workloads the industry has shown the most interest in moving to the cloud—Calibre PERC reliability verification and Calibre nmDRC physical verification. Both of these Calibre applications have been validated on AMD-powered Google Cloud VMs, including compute-optimized (C2D) VMs based on 3rd-Gen AMD EPYC processors. In doing so, our goal was to not only help companies maintain current turnaround times in the face of mounting compute, but also to enable them to leverage the solutions from Google Cloud, AMD, and Calibre Design Solutions to achieve more design iterations per day—shortening time to tapeout while improving design quality.
Cloud processing provides design companies an opportunity to reduce time to market and speed up innovation. Core Calibre technology has been cloud-ready for years. As design companies increasingly look to leverage cloud capacity for faster turnaround times on their designs, they can be confident that running Calibre in the cloud will provide the same sign-off verification results they know and trust, while enabling them to adjust their resource usage to best fit their business requirements and market demands.
A more detailed discussion can be found in the technical paper, Google, AMD, and Siemens demonstrate the power of the cloud for EDA.
Leave a Reply