Improving reliability in semiconductors is critical for automotive, data centers, and AI systems.
Experts at the Table: Semiconductor Engineering sat down to discuss reliability of chips in the context of safety- and mission-critical systems, as well as increasing utilization due to an explosion in AI data, with Steve Pateras, vice president of marketing and business development at Synopsys; Noam Brousard, vice president of solutions engineering at proteanTecs; Harry Foster, chief verification scientist at Siemens EDA; and Jerome Toublanc, high-tech solutions product manager at Ansys. What follows are excerpts of that conversation. To view part one of this discussion, click here.
L-R: Siemens’ Foster; Synopsys’ Pateras; proteanTecs’ Brousard; Ansys’ Toublanc.
SE: Product and technology cycles are accelerating to the point where it may not even be useful to do a post-mortem because it takes too long to determine the sources of silent data corruption, AI hallucinations, or propagation of bad data. How do we address that?
Pateras: Active monitoring is one of the ways to help us understand what’s going on. If you’re monitoring continuously, you can correlate when you saw an error functionally to which chip it was on. Did you see some anomalies physically or structurally in the chip? You can make those correlations and feed it into some diagnostic tools.
Brousard: The implications of SDCs can be too grave without post-mortems, but the problem is how to do it effectively. One of the biggest issues with an SDC is the inability to pinpoint a hardware source error before it propagates through a system to cause exponentially greater damage. The current best-known mechanisms fall short of identifying these source events. By providing high coverage visibility into the physical mechanisms that are precursors to SDCs, and monitoring these over time, we could mitigate the physical failures before their actual occurrence and their eventual morphing into SDCs. Given the right coverage of these physical parameters, we also could record where and when they degraded to the point of failure for learning for future products. This is a paradigm shift from the current approaches of finding and mitigating an error quickly after it occurs and before it develops into an SDC, to a proactive monitoring of a semiconductor’s health and mitigation once that degrades below some critical level.
Foster: That’s the critical aspect of doing a digital twin. You need to bring it back into manufacturing from out in the field. But those models are not trivial.
Toublanc: If you have a way to monitor this, then we want to simulate our chip in situ with a specific profile, a specific mission. This is what we call a digital twin. Sometimes you wonder why you want to simulate that. The reason is to predict reliability issues. You need to understand which piece of the product eventually might fail. This is not just about a chip or the transistor level. It eventually will be a complete system. You want to be able to understand which components might fail first under certain thermal conditions, or in a certain environment, and then anticipate maintenance. Simulation is a useful practice.
SE: One of the issues here is that margin no longer can be added for generalized redundancy. It’s now externalized, because it won’t fit in the chip anymore due to the reticle limits. So now we’re no longer able to fail over inside a chip without sacrificing performance or power. What’s the impact of that?
Pateras: It’s all about performance. If you want to maximize performance you need to reduce your margins, which can lower your reliability.
Toublanc: Yes, and sometimes this is the opposite constraint.
Brousard: It’s true that some protection strategies employ external chip/system redundancy, since baking them into a chip means using valuable in-chip real estate that could be used for more functionality. But that’s an expensive approach that doesn’t scale well. Another approach could be to monitor the active component’s health over time so it does not fail unexpectedly. In this approach, you could reduce the amount of redundancy needed and mitigate a failure before it actually happens. For instance, swap a server blade proactively or divert traffic to a healthier component. Of course, the mechanism that monitors the health must have enough reliability coverage to be dependable. If a monitoring system has an efficient enough footprint, it can be implemented in a chip without hindering its scaling, and without sacrificing reliability.
Foster: Yes, and you cannot design for worst case.
Pateras: If you look at storage voltage scaling, in the past this was very coarsely done. And now, if you want to maximize performance or power, you want to be able to do it at a much more fine-grained level. But if you do that incorrectly, you’re going to affect the reliability. So you need to be monitoring it very closely to keep that monitoring in sync with your scaling so that you’re getting the best of both worlds.
Brousard: Everyone wants to operate on the lowest margins, but the exponential increase in demand for more compute power means these minimum margins may be eaten up very quickly, compromising reliability. Ironically, these applications are the ones with the highest reliability requirements. On the other hand, accommodating worst-case scenarios means that most of the time we are over-provisioning and leaving performance on the table. One approach would be to dynamically adjust operating points, the voltage and frequency, so margins are always kept at a minimum, but still a safe amount. But to truly squeeze maximum power/performance, these should be optimized not only per process variation and silicon aging, but also per instantaneous workload. This can be achieved by monitoring performance margins in a device and adjusting voltage and frequency so that the margins are always at the desired level. If margins do become critically low, such a mechanism must readjust quickly to maintain reliability — in other words, workload-aware power customization with a safety net.
SE: How much of this is tied to the fact that some of these devices are going to be used for longer periods of time, and in some cases under more extreme conditions than in the past. We used to look at reliability from the standpoint of data centers or mobile devices that fit in your pocket, where the thermals are similar. But now the ambient temperature can have wide swings, such as in a car.
Pateras: There was an automotive case study done in Phoenix last July when the temperature got up above 110 degrees for 10 days in a row. We did a study, as well, looking at the effects on the longevity of semiconductors after that 10-day period. It had a huge effect on the lifecycle of semiconductors. And now that chips are in cars, it becomes even more critical to properly manage those chips in those kinds of environments.
Foster: And there is so much more semiconductor content in cars today.
Pateras: The new Porsche 911 has up to 7,000 chips.
Brousard: Quoting a cliche, automobiles are becoming data centers on wheels. In both cases — automobiles and data centers — temperature is a top concern for the most advanced geometries. They expected to run for longer periods of time, and the executed workloads are expected to become progressively more challenging over the years. That will create more stress, but they’re still running on the chips designed years ago. This translates into even higher temperatures, which goes back to a previous point. The shift has to be from designing upfront for any worst-case circumstance that may come to monitoring and adapting for optimal performance and an extended lifetime.
SE: You don’t necessarily know how it’s going to be used, but with AI it’s even worse because design teams aren’t worried about power conservation. It’s getting to results faster, and that creates more rapid aging in the form of electromigration, heat dissipation issues that cannot be fixed in the design, and it requires load balancing over the entire data center. What impact does this have over time?
Foster: When I was designing computers, we were not concerned about power. We were focused on performance. And the way these things were evolving, the lifetime of supercomputers was about two years. That’s changed today. People are not thinking about power as a concern, but these devices need to last longer.
Brousard: We see them extending that cycle out to 5 or 6 years, even up to 10 years. Historically designers were more concerned with performance, not power, reliability, or remaining lifetime. However, that was because you could get away with it. Nowadays, with power, it isn’t just about cost and power budget to the data center. Cooling data centers is becoming a major problem. In addition, power usage effectiveness is being highly scrutinized from a sustainability and regulatory point of view, and of course, the remaining useful lifetime.
Pateras: We’re seeing a lot of our customers asking about how we measure remaining useful life (RUL). So a bunch of analytics that we’re developing are based on monitoring and looking at trends in circuit performance in the data path, for example, where you need to understand when things are going to start failing. There are a whole bunch factors. RUL is a big concern now.
SE: Is it all about heat?
Toublanc: Yes, and that’s a big topic. People don’t necessarily understand the complexity at the beginning. They do understand this is something they need to get under control from day zero with some type of structure. Otherwise, you cannot control that. And people have different motivations. Reliability is clearly one side of it, where they need to understand power, thermal, and thermal stress. This is pure multi-physics. But they also want to understand the tradeoff from a material point of view, and eventually from a consumption point of view. They also really need to understand it from a global architecture viewpoint, and this is pretty new because when you talk about architecture people don’t think about physics. But now they have to simulate it. In the past they were using tables and rule of thumb, and they assumed it was going to be okay. But now, if they really want to squeeze the margin and do this big bespoke silicon for their market, they really have to take temperature into account from the very early days. And this isn’t even about chiplets. This is big AI chips that consume a lot of energy. This is a key concern, and unfortunately they don’t have a lot of ways to get that under control. With chiplets, they realize that, depending on what they do, it could completely change the temperature. So that’s something they have to understand before they take the first step. Thermal is in the middle of every discussion about chiplets.
SE: Thermal is like some gradient blob that moves across the chip, and it’s time-related, right?
Toublanc: Yes, it’s time-related and it’s area-related. And you have to think in 3D for modular design. You have to think about how to manage your power over time. It doesn’t mean I want to simulate a few clock cycles. I need to understand how I can run my chip at full speed and for how long before I need to cool it down. If I reduce the power, is it really cooling down, and can I put it back to full speed again? That’s the kind of understanding we need based on thermal, because thermal is impacting reliability.
SE: How are we going to test this? And do you know exactly what you’re testing and verifying?
Pateras: Monitor placement is critical, and the size of the monitors is critical. There’s always a tradeoff between accuracy and size, and sometimes there’s error overhead from the ability to place the sensors where you want, like near hotspots. We use thermal analysis for that, and simulation tools help us place those temperature sensors correctly. That information is critical.
Brousard: The eventual goal is to make sure your chip will operate properly given its intrinsic and operational conditions. To really get a clear picture of this, it’s necessary to enhance your visibility beyond monitoring just temperature and look at the actual circuit logic under these circumstances, during testing, but also in actual field functionality. We’re going to monitor the margin to failure of the logic paths that most limit performance scaling, because those are the ones most susceptible to failing. For example, temperature changes the character of the propagation in the chip, and this can cause a low slack path to fail. During testing and verification, it’s important to not only capture failures, but also devices with low marginality that might fail early on in the field, despite passing testing.
SE: How does this affect verification, which typically looks backward at what’s been designed? As these chips are used in more critical applications, we’re starting to see time-related functionality.
Foster: With safety, which is an obvious example, we’re putting structures in the design to make them essentially fault-tolerant. That could be redundancy. It could be ECC. It could be a number of different techniques. And when something does fail, we’re trying to identify what’s causing it while the chip is running, and then see if we can either correct it, or at a minimum acknowledge that we’ve got an issue. So essentially we’re putting monitors in there to do that.
Pateras: That’s an interesting tradeoff. Historically, functional safety was at least a redundancy — TMR (triple-module redundancy) or lock-step — but this was hugely expensive because it was doubling or tripling overhead. ISO 26262 is being updated to allow monitoring, so instead of having all this redundancy, there will be monitoring of what’s going on so these systems can make predictions. It’s a more complex tradeoff that we never had before.
SE: How much of this now requires a more federated approach that goes well beyond just semiconductors?
Pateras: With automotive, it’s all about software-defined vehicles now, and the complexity is on the software side. Our goal now is to figure out how to link this physical monitoring, and the data analytics that comes with that, into the overall infrastructure of the care, which is software-based. And these software stacks are complex and varied. There’s AUTOSAR, Elektrobit, new players like Aurora Labs, and all these guys trying to provide reliability on the software side. But ultimately the software has to understand the cost of operating correctly, and one of the inputs, which was not the case in the past, is going to be semiconductor performance or reliability. So our goal is to provide that information to that system and let them handle it. We’re going to provide information about what’s happening on the chips through APIs to that software system, and then it’s up to the software system to make use of that information to create a whole view of the car.
Foster: And that’s essentially what a digital twin is. You have different models. Sometimes you need more resolution, sometimes you don’t, but at least you need to be able to communicate and pass along information they can use.
Leave a Reply