How to identify the causes of failures before they happen, and how latent defects and rising complexity can impact reliability and security after years of use.
Experts at the Table: Semiconductor Engineering sat down to discuss the causes of chip failures, how to respond to them, and how that can change over time, with Steve Pateras, vice president of marketing and business development at Synopsys; Noam Brousard, vice president of solutions engineering at proteanTecs; Harry Foster, chief verification scientist at Siemens EDA; and Jerome Toublanc, high-tech solutions product manager at Ansys. What follows are excerpts of that conversation. To view part one of this discussion, click here. Part two is here.
L-R: Siemens’ Foster; Synopsys’ Pateras; proteanTecs’ Brousard; Ansys’ Toublanc.
SE: There is so much information coming from so many different directions that you don’t necessarily know what you’re supposed to be watching. Is it just anomalies of behavior, which can trigger an unexpected reaction because the use cases are different? Or, is something actually broken? And how do you determine that?
Foster: First, you’ve got to find the outliers and make sense of what it is. Maybe it’s a different use case. But how do you understand that use case in advance?
Brousard: If you take silent data corruption as an example, it is like a butterfly effect, where an error happening in Brazil causes a much more severe failure ‘tornado’ down the line in Texas. It is crucial to catch the failure at the source. Since there are so many internal factors and external scenarios and conditions that could have caused the error, you need to look at the lowest common denominator, the physical layer — the circuits themselves — to sense an impending failure. High coverage visibility into the degradation of timing slack while a device is running its intended mission not only provides a heads up of an impending failure, but also a snapshot of when and under what circumstances the failure may erupt, which is valuable information for learning moving forward.
Pateras: If you’re monitoring all the time, you can then go back to examine a functional failure and cross-correlate. But I don’t think you can predict a use case in the sense of, ‘Okay, what kind of failure am I looking for?’ It’s all about efficient monitoring. Loads of data have been generated. The big question is how do you efficiently sift through that data and provide useful meta-data up the chain.
Toublanc: The amount of data to collect and to process is indeed a challenge. AI on the edge will of course help to interpret the information. This involves a proper training of the meta-model, which implies lots of data and time for proper learning. This challenge actually raises interests for simulated data. Multiphysics simulation can generate many metrics under a very large set of conditions to anticipate reliability. This simulated data does help for sensitivity analysis and for a proper training of the meta-models.
SE: And this gets even more complicated. When you find a problem and you adjust for it, maybe re-routing signals because some of the paths may have closed off due to electromigration, now you have a whole new set of elements that you have to monitor differently than what you were doing before. Ten years is a long time for logic circuits to function correctly.
Brousard: It’s hard to predict what would be most important to monitor, so you would want high coverage of all critical or potentially critical circuitry. On the other hand, you have to be intelligent about where you place the monitors. We use advanced ML algorithms to optimize monitor placement, customized to the specific design being monitored for high parametric coverage without incurring significant PPA penalties.
Pateras: The monitoring has to be at a frequency that’s high enough for you to catch issues. But there also needs to be some intelligence on chip that is going through this information on the fly, deciding if it’s something that needs to be communicated upwards. We have a hierarchy of capability, from monitors, to controllers that sift through the initial monitor data, to on-chip firmware, that’s doing some metadata analysis to actual, you know, full analytics applications. So it goes up this hierarchy. If something happens, it’ll eventually ship through to an alert where it says, ‘Oops, something is going on and you have to replace it.’
Toublanc: I agree. Zero risk does not exist over 10 years of lifecycle. The intrinsic design of the chips could be optimum for even longer under specific pre-defined usage conditions. But there is no way to fully guarantee these conditions because there are so many external parameters. The ultimate goal is to be able to anticipate long enough before the potential issue in order to have time to adjust the system — without any discontinuity of the operations or services and, of course, at the lowest possible cost.
SE: Security is now a reliability issue. With automotive and data centers, we’re dealing with mission-critical and safety-critical applications. Are designs being done with security in mind in these applications, or is it still an afterthought?
Foster: We are seeing a shift. People are designing with security in mind, and they are verifying safety aspects of the chip in the context of security. So it is being considered. We’re still taking baby steps. We have quite a ways to go. But it is something people are proactively thinking about together today, not separating the two.
Pateras: Monitoring can also help there, as well. If you monitor the voltage lines in your chip, you can see attacks on the chip. And you can monitor data buses looking for trends in the data.
Toublanc: Similar to physical integrity matters, we definitely see a ‘shift-left’ trend for security matters. This is not only about safety or cyber security, but also about hardware security. There are more sensitive data hard-coded within chips that should be protected against hackers. Simulation used to be a complement of measurements to understand why a chip is at risk. Now, simulation is used before manufacturing and during the design cycles. This gives the opportunity to optimize the countermeasure while designing. Security matters are changing designs.
SE: The hackers have gotten really smart about this. They’re not sending all the data at once.
Pateras: Security is always a leapfrogging game. There are always people trying to break, and people defending against. But there are ways to mitigate those risks.
Foster: The challenge from a verification perspective is that in the ’90s, we went through what were called orthogonalization concerns. We separated the way we view the physical aspect of the design from the functional, and so on. All of a sudden, those two worlds have come back together, and it’s really complicated. It was easier when I didn’t have to worry about anything physical, just functional.
Toublanc: This is an endless cat-and-mouse game between hackers and security engineers. This is pushing for new countermeasures, new technics, new cryptographies. Post-quantum cryptology will raise this game in the next level.
SE: It was also easier when you had it on one chip. Now you have chiplets going into a chassis, and each one of those is a potential risk. So do you put monitors in every single chip with that comes through?
Pateras: We’ve always had this contradiction between quality and security. DFT and test, by the way are the biggest risk factors for a chip. You’re putting all this visibility and accessibility to the chip for testing and even for monitoring. You’re using all these portions of the chip for getting information.
Foster: That’s how we started. We initialized to the test mechanism.
Pateras: Yes, and now we need to make testing more secure, which is a bit of a contradiction. All this data acquisition provides security risks. So it becomes a double-edged sword that we have to manage.
Brousard: Chiplets are becoming the new basic building blocks of next-generation semiconductors. The vision is that they will be reusable in different projects and designs. Therefore, monitoring for safety, security, reliability and performance needs to be at that level — especially to get comprehensive parametric coverage at the SoC level, which may be mainly a collection of chiplets.
SE: How about adding in resilience? That’s been talked about with failover in vehicle and chips in data centers, but most of that is just adding redundancy, and in many cases extra margin is no longer an option.
Pateras: We’re still trying to make chips resilient. From design perspective, you’re trying to lay things out and add the right guardrails to make them resilient. How effective you’ll be is the question. There are multiple solutions to this problem at the design stage, but you’re still going to have to do a whole bunch of stuff afterward.
Brousard: Resilience definitely is not dead. To replace a blade in a data center, or a component at a remote base station, is a costly operation. We’re also seeing a rise in the expected lifetime of products. Data center products are required to last six to eight years, up from three to four years, and telecom products need to last 15 to 20 years. Guard bands will always be needed, but we have to get wiser about them. Today, each player in the supply chain takes a ‘better safe than sorry’ approach, each layering more guard bands. This high parametric coverage we’ve discussed before allows us to be more particular about how much to put in initially, and how to adjust to exactly what is needed at each stage in a product’s lifetime and at each instance of its operation. This same visibility allows us to identify when reliability does become a risk, such as latent defects, so you can take appropriate and precise action like scale down the frequency a bit, give it more of a voltage boost to keep it functioning, or completely circumvent that server rack.
SE: That’s a very physical approach to resiliency, and one that is external. What was done in the past?
Brousard: It’s all hands on deck. You’re pushing the limit, and you have to be ready within minutes.
Read parts one and two of the discussion:
New Challenges In IC Reliability
How advanced packaging, denser circuits, and safety-critical markets are altering chip and system design.
Why Chips Fail, And What To Do About It
Improving reliability in semiconductors is critical for automotive, data centers, and AI systems.
Leave a Reply