Building Complex Chips That Last Longer

Experts at the Table: Using multiphysics and lots of computation to improve reliability and resiliency in advanced nodes and packages.

popularity

Semiconductor Engineering sat down to talk about design challenges in advanced packages and nodes with John Lee, vice president and general manager for semiconductors at Ansys; Shankar Krishnamoorthy, general manager of Synopsys’ Design Group; Simon Burke, distinguished engineer at Xilinx; and Andrew Kahng, professor of CSE and ECE at UC San Diego. This discussion was held at the Ansys IDEAS conference.

Below, left to right: John Lee, Shankar Krishnamoorthy,  Simon Burke, and Andrew Kahng.
Lft to Rt: John Lee, vice president and general manager for semiconductors at Ansys; Shankar Krishnamoorthy, general manager of Synopsys’ Design Group; Simon Burke, distinguished engineer at Xilinx; and Andrew Kahng, professor of CSE and ECE at UC San Diego.

SE: There will be at least several more process nodes, according to leading-edge foundry road maps. What sort of problems we expect to see and how as an industry, are we going to solve them?

Lee: One of the key problems we see is the convergence of multiphysics. With the increased finFET density, there’s much more concentration of power consumption. That, in turn, leads to an increase in IR drop, thermal effects, and also has an impact on timing. So that convergence of multiphysics is a challenge, but it’s also an opportunity.

Krishnamoorthy: This is a golden age for innovation across the full stack, from materials and process all the way through standard-cell architectures, through EDA, and up the chain into the software layer. The industry is striving to get to a 1,000X performance improvement. We really see tremendous opportunities across the full stack to achieve that goal, and all these innovations need to come together and multiply with each other to deliver that big gain.

Burke: Moore’s Law has not stopped, but it is slowing down a bit. We still see the ability to take advantage of new process nodes, but we aren’t seeing the scaling benefits that we used to in the past. EMR (extraordinary magnetoresistance) and thermal are becoming bigger issues with new process nodes. And because of Moore’s Law not shrinking the dies as much as we would like, we are going down other technology avenues. That gives us bigger devices, which are difficult to assemble at the system level. There are more challenges, and it’s becoming more complicated to get the next generation out the door.

Kahng: Recent announcements point to the need for systematic vetting of technology definitions up front, because schedule slips hurt not only what’s going on domestically, but the world’s ecosystem. Foundries, in particular, need to promise aggressive ROI benefits and schedules in order to get investment and commitments from customers. But when this slips, it’s a problem. So how do we solve that? How do we get predictable and scalable design methodologies? And then, how do we extend the scope of design automation, learning, and otherwise reduce effort and stay on schedule to achieve a value trajectory.

SE: In the past, chips developed at the most advanced nodes were expected to last a couple years in phones, and up to 4 years in data centers. Now, in the data center, they’re supposed to last 7 years, and in automotive the target is 18 years. How do we achieve that kind of longevity without a performance degradation?

Kahng: We’ve dealt with this for a long time, and there are traditional approaches for monitoring activity — sort of viability by construction, in design and in use — that will still be relevant. And there may be other help. If you’re only doing video transcoding 24 x 7 in a well-controlled data center environment, or if you’re doing HPC at liquid nitrogen temperatures, there may be fewer corners and less guard bands. That can help the design and its longevity, as well.

Krishnamoorthy: In addition to safety and security, we are seeing resilience as a new objective that design teams are optimizing for. A big part of that resilience shows up in aging — being able to measure and optimize design robustness. There is a lot of work going on across the industry to make designs more robust and resilient to device aging. But the story doesn’t really stop there. It goes all the way to in-field operation, embedding monitors and sensors and having adaptive behavior when the monitors and sensors reflect degrading performance. And then, you optimize the software stack to react to that. We really see this as a full lifecycle problem, from design to manufacturing and bring-up, all the way into the field. It’s becoming a first-order issue, especially with data centers and automotive and IoT.

Burke: From an FPGA perspective, we’ve had market segments that have required long lifetimes for a long time. When you get into defense or just embedded in general, those lifetimes are easily 10-plus years. We are seeing new markets, and automotive is a good example where they’re using more electronics with the same requirements [as mechanical parts]. But a long lifetime has been something we’ve been dealing with in the FPGA business for a while. We just see that requirement expanding across different market spaces.

Lee: There are a lot of methods and technologies out there for high-reliability electronics. What we’re able to do now is also tie in increased usage of physics-based simulation, and make that available through various techniques. Computing is much more available than it was 10 years ago, and as we look further out with the cloud, there are new opportunities for accelerated compute. Then we tie in some of the advances mathematically with AI and ML — smart Monte Carlo — and we’re able to do much better predictive modeling of what will happen over the lifetimes of operation of electronics. So I feel bullish that we have all the tools ready to address those concerns across the applications you mentioned.

SE: Is resilience like dynamic partitioning, where you say a block is not working as well as it should, so therefore data or computing needs to be re-routed?

Krishnamoorthy: Yes, and a lot of the the cloud service providers already have been doing this for a while. It’s really just reaching the next level. There’s this notion of tracking the performance of the entire system’s operation over the course of weeks and months and years, and as you see the performance metrics start to degrade, then it automatically adapts the software stack above it in terms of the expectations from that system. So just because the frequency dropped, or just because there’s increased temperature throttling, it doesn’t mean you have to invalidate that particular socket. You can still keep using it, but for a different set of applications. This notion of adaptability, where you read the data off the silicon and then adaptively use this underlying application will really get to the next level here. Having a spectacular failure is no longer an option. There’s this notion of graceful degradation, prolonging mean time to failure. These are first-order issues. The exciting opportunity is getting the silicon itself to participate in that with monitors, sensors, and other types of readouts, and then having the firmware and the embedded software layers above it collaborate with those readouts. We have an opportunity here to create very adaptive systems, which is essential for these types of applications.

SE: How has the FPGA world been dealing with this?

Burke: Obviously FPGAs recompile, but we don’t change that compilation based upon what the silicon is doing. Instead, we add margin into the designs, and redundancy in terms of some physical design aspects, to make the lifetime longer and increase the reliability over time. But a function is a function, from beginning to end. We haven’t moved into the dynamic reconfiguration of it. We do support dynamic reconfiguration as an architectural feature of our designs, but that’s more intended for functionality. If somebody wants to take an FPGA that’s running and reconfigure part of it to do something else, you could use that to work around some issues. But we don’t turn off parts of the chip when they start working. We try to extend each part of the chip into a longer lifetime. That’s the current approach.

SE: Can all of this be simulated to figure out what’s going to fail within a complex chip or package?

Lee: Computationally, it’s a big challenge, and we’re working with academia and our partners to solve it. We’re focused on physics-based simulation, and we think there’s a tremendous opportunity to shift left and bring that intelligence faster and earlier in the design cycle. There’s a lot of richness by cross-fertilizing our computational physics with design technologies.

SE: Can the current tools keep up?

Kahng: Niche technology or integration or architectural context always has been served late, if not under-served. Tools have co-evolved with what the leading customers demand — Xilinx, Qualcomm, Apple, Nvidia, AMD, Intel, and the foundries. So one question might be whether this history of co-evolution has somehow missed important pieces of what is today the long tail. Those will be important applications tomorrow, such as memory and AI accelerators for these 3D and 2.5D integrations. I understand that EDA companies and their partners are working on this.

Lee: One of the promising areas we see for research and collaboration involves the on-die sensors, and what I would call a digital twin, which is a data-driven, in-field operation of an electronics semiconductor part. We can augment that with the physics-based simulation models we have. So a hybrid AI system that’s taking data-driven, and also simulation-driven, predictions is the way to ensure the maximum lifetime of electronics out in the field. It’s not an either/or. It’s a combination. That’s an example of breaking down some of the silos that exist between the data, and having a more open and extensible platform-based approach will help solve these problems.

SE: In the past, most chips developed for automotive applications were developed at older process nodes. Now we’re looking at 7nm and 5nm chips being used in extreme environments. How does the chip industry deal with this?

Krishnamoorthy: A corner-based methodology that has been used in the past is one way, and many customers are doing that today. Routinely, automotive chips are being signed off at 200 or 300-plus corners. But then with all the recent advances in statistical variation, we really have a fantastic opportunity to work on a multitude issues like power integrity and design robustness. The ability to multiply corner-based analysis with statistical modeling and power integrity, integrated deeply into every aspect of the analysis, is how pretty much all the top-tier companies are doing it. That also opens the door to things like cloud acceleration. When you are trying to compute that many points in that overall parameter space, accelerating with cloud is a crucial element for workflow acceleration. You really have to intersect static timing analysis with signal integrity, design robustness, and then do it across a very large set of operating points.

Lee: If you look at extreme environments like automotive, aerospace and defense, or 5G base stations and microcells, the common theme for all of those is thermal analysis. It’s extremely important is to have a thermal workflow that comprehends the electronic cooling around the ambient environment, but you also need to go to the nanometer level and analyze specific devices computationally. To do that, brute force is not possible. But if you use advanced techniques, which are hierarchical and multi-scale, it is possible, and that’s a brand new world of simulation capabilities that we’ve developed. It helps solve the thermal simulation methodologies. Then, we need to tie that into thermal sensors on a chip. It’s taking measured data from the chip, and adapting the performance of that electronic system so it does not overstress itself. There are tools out there that can be combined for really detailed physics-based simulation across a multitude of corners, from system level all the way down to the nanometer level. But to make that useful you want to modify the design so that in operation you can have a safely functioning device for many years.

Burke: From a solutions perspective — especially for markets like automotive, which is growing quickly at the moment — a lot more computing is required than we’ve seen in those markets in the past, and the packaging is much more difficult. It’s in an area that sees extreme temperatures, the cooling solutions aren’t optimal, and the computer requirements are going up. Those systems are typically multi-die with extreme thermal and mechanical. Those kind of markets are really pushing the boundary on what we need to do to make a working reliable system that’s can last for 10 or 20 years.

Kahng: And what you’re hearing in many comments is this trend of HPC converging with mobile and information and communications technology, whether it’s for your home or your car or your laptop. The demands on silicon are growing, even though we are having increased difficulty in scaling things like power delivery networks, thermal, and power integrity.

Lee: As chip companies design systems today, they use a large amount of emulation, and they have a large amount of simulation data available to them. The challenge is how to take that simulation data and tie it into an effective dynamic thermal management policy. It means really shifting left in analyzing the thermal effects from the emulation data as early as possible, and then tying that back into physical design. We’re really targeting these problems broader and earlier than we’ve been able to in the past.



Leave a Reply


(Note: This name will be displayed publicly)