How advanced packaging, denser circuits, and safety-critical markets are altering chip and system design.
Experts at the Table: Semiconductor Engineering sat down to discuss reliability of chips, how it is changing, and where the new challenges are, with Steve Pateras, vice president of marketing and business development at Synopsys; Noam Brousard, vice president of solutions engineering at proteanTecs; Harry Foster, chief verification scientist at Siemens EDA; and Jerome Toublanc, high-tech solutions product manager at Ansys. What follows are excerpts of that conversation.
L-R: Siemens’ Foster; Synopsys’ Pateras; proteanTecs’ Brousard; Ansys’ Toublanc.
SE: How do you define reliability for semiconductors, and how is that changing?
Foster: There are a couple things happening now that are making functional verification more complex. One is safety, which obviously creates a very big concern over reliability. It complicates functional verification because we’re essentially building fault-tolerant systems. There are a lot more requirements that we have to verify. The other thing that’s happening is that reliability now spans the entire system. It includes everything from vibration in an aircraft or a car all the way down to the chip level.
Brousard: We’re looking at reliability once a product is out in the field and in functional operation. The main thing we’re trying to look at is how close we’re getting to a failure point before it actually fails. This is about maintaining reliability, but it’s also about availability. The more conventional sense of reliability would be to monitor failures at a certain error rate, and if it crosses a certain threshold, to mitigate that. But what we’re trying to do is get to where we can identify degradation at the point where it may fail, and give enough input and insight so that it can be mitigated before it becomes an issue, hence maximizing availability.
Toublanc: From my side, reliability has a different meaning. We’re dealing with multi-physics simulation at the semiconductor level, but also with packaging and PCBs. Reliability is something people have been talking about for quite a while, but we’ve seen a change recently with new types of architectures in terms of chip packaging of ICs, and reliability is becoming more and more of a concern. People don’t want to only check for reliability at the end of the flow. They want to anticipate it. As for the meaning of reliability, that depends on who you’re talking to. If you talk to a semiconductor design engineer, they care about the impact of temperature on electromigration because that will reduce a chip’s lifetime. If you talk to someone who’s more mechanically oriented, reliability will be the impact of a film on thermal stress. So reliability is becoming a tricky topic, because you have to understand which phenomenon you want to simulate. We have to anticipate. People now want to understand reliability very early to make sure we design the chip accordingly to those constraints. And this is what’s new. You now have to think twice about the architecture because of reliability.
Pateras: Reliability spans various stages of the design lifecycle. We are focusing on reliability at the design stage. What can we do in terms of design robustness to improve reliability as part of the implementation flow? We then have to deal with reliability during manufacturing. What can we do to ensure the least amount of latent defects in semiconductors through proper testing? That includes stress testing so that the inherent reliability of the part is high. And then, how do we actively manage reliability once the part is deployed? There are a number of different techniques we can use there. Monitoring is one of them. We can use the monitor data to predict and pre-empt failures. In that area, we’re also finding that we can leverage a lot of the structures that we put into the chip for manufacturing quality for test — DFT, built-in self-test. We can leverage those capabilities in the field, as well, to be able to essentially do stress testing in the field. It’s a whole new area. With power-on test, you start your car and run a bunch of different tests to see if everything is okay. Now we’re looking at production stress testing to be able to measure where the part fails. For example, you can vary the frequency and see where it starts to fail. And you can monitor that and track the progression of where things fail over time to predict and pre-empt failures.
SE: Does the definition of reliability change as we get into different markets?
Foster: It certainly becomes more urgent in certain markets.
Pateras: It’s the same definition, but depending on the application it can be more critical. Automotive is an obvious one for reliability because it’s functional safety. But even in data centers with HPC, there’s an issue of silent data corruption. We’re hearing about failures that happen once every thousand hours on a given CPU. That was a concern in terms of cost. It’s not a safety issue. It’s a cost issue. And so using techniques similar to what we’re using in automotive become important.
Brousard: Historically, there were certain verticals that would put more emphasis on reliability. But nowadays, I can’t think of one vertical or domain in which reliability is not important. In HPC data centers, the implications of silent data corruption are huge. And it’s not just financial. Some of these data centers can be vulnerable to security threats because silent data corruption can propagate through a system. And this whole AI/machine learning evolution means that decisions made at the end are based on data that might be corrupted and propagating through. These are huge decisions, and the implications of them failing are huge. But it’s more than just that. If I ask my kids what they expect from a phone, that’s the world for them. So reliability is becoming prominent in nearly every vertical that we’re dealing with.
Toublanc: Reliability used to be a sub-topic. Now it’s a key topic. We’ve had many, many customer visits where they just talk about reliability. That includes software and hardware. It also could be the lifecycle or a side-channel attack. Or it could be ESD, which is becoming a big problem.
SE: Chip designers have been pushing so much stuff into a design that we have very little margin left, and if you increase that margin you change the performance or power profile. But we don’t necessarily know how different pieces of a design are going to interact, particularly with advanced packaging, right?
Pateras: In automotive, this has always been a concern, but less at the semiconductor level and more at the system level. You’d worry about your brakes failing, for example. The semiconductors used in cars were always developed on more mature nodes, and the chips themselves were less complex.
SE: But now the carmakers want to get to 5nm so they don’t have to re-do everything by the time they get to market.
Pateras: Yes, and when we talked to the ‘traditional’ OEMs, they’d tell us they don’t have any semiconductor reliability problems. It was the camera or the brakes or the hydraulics. But now that they’re going to leading-edge nodes and creating these supercomputers on wheels, they don’t really know what’s going to happen and they’re nervous about that — and reasonably so. So now they want to understand how we ensure the reliability of the system.
Foster: There’s a lot of confusion there, too. A lot of those are system people jumping into silicon, and they don’t know how to answer these questions.
Brousard: There are two issues here. There’s the insatiable appetite for more performance and lower power, which is driving those margins down. But also, everyone’s doing their own silicon now. In the past, if it wasn’t their business, they’d buy it from a silicon provider. Now, there’s a lot more competition to squeeze that extra ounce of performance out of the chip. The trifecta of reliability, power, and performance are all being driven together with advanced nodes. The pace at which a company will look at the next node is exponentially faster. But that’s why you can’t decouple the challenge of reliability from that power/performance equation. You have to keep them in check.
Foster: I was talking to a company in Europe and they had never done silicon themselves. So they were putting together a team that was going to buy a bunch of IPs and put them together. They thought, ‘How difficult can that be?’ A lot of people are getting into this, and they have to learn a lot.
Toublanc: That’s very true for monolithic designs. We have companies moving to monolithic designs, doing the chip themselves, which they were not doing before. For advanced nodes or chiplets, it’s even worse. Everybody’s talking about chiplets. We have lots of meetings where people tell us they have no clue how it works. ‘What’s the danger? What’s the risk? What’s the impact of power on thermal and mechanical and timing?’ They believe that using one die is completely the wrong approach, but this is where reliability pops up very quickly.
SE: One of the problems is that we have very little cross-disciplinary knowledge in chip and system design. Now, suddenly, everything is happening in the same company, where people have been doing very specific things. What’s the impact of that?
Pateras: They are still using third-party tools and IP, even though they are developing their own systems. Our mission is still to provide the technology they will leverage to create their own systems. They are not doing everything from scratch. They will still need design tools, IP, and even sub-systems. So our goal is to create more and more packaged, off-the-shelf components that they can use in that system. System reliability through SLM (silicon lifecycle management) is one of those components.
Toublanc: To have a comprehensive flow for heterogeneous integration of chips, it has to be open because people don’t own everything. If you are concerned about thermal, for instance, you have no choice but to simulate different components together. But if you don’t own everything, how do you do that? You need to find a way to model it, to understand exactly what is happening. But sometimes we need specific information from the foundry, especially when it involves reliability, and the foundry is not used to giving out mechanical information. And sometimes foundries have to think twice about including those parameters to make sure people can enable such advanced simulation.
Brousard: We’re working with two kinds of companies. There are big, traditional companies, and they have reliability experts. They speak the language, they’ve been doing it for years, and they bring in whatever functions they think are needed. But front-end designers, back-end designers, or ASIC designers don’t have that same depth of reliability expertise those big companies have developed over the past 30 years. The solution is kind of like providing IPs, with reliability as a component that you don’t have to worry about. The solution is there, and it offloads that necessity from you. You don’t have to be an expert. You have to have the component that will take care of it.
Foster: But we used to have clear boundaries on the process. There’s much more feeding back today to optimize or learn whatever is going on. That’s a significant change. I don’t want to be buzzword-compliant, but AI is learning from something that should have been designed correctly. ‘Well, let’s not do that.’ And it chops out whatever is incorrect and goes back and automatically fixes it.
Brousard: Following up on that, it’s super important to turn reliability into an end-to-end solution so that it’s not siloed. Whatever the solution is, it’s needs to communicate between silos.
Toublanc: Those silos can be different steps of a project, or different teams with different expertise. The mechanical guy will not understand the thermal guy, and so on.
Foster: Yes, and in line with what you’re talking about, the process used to be an afterthought. Now it slices through the entire thing.
Toublanc: And that is driven because of reliability. Now people don’t know where they need to go to get it under control.
Pateras: The challenge is data ownership when we talk about automotive reliability and talk about the OEMs and the Tier-1s, Tier-2s — assuming it’s not a fully vertically integrated OEM player like a Tesla. But even there, they have to deal with the foundry. The question is whether you can get all this data and monitor that data in the field. If you want to share data from a Tier-1 chip, will the foundry allow the detailed analysis of the data in terms of yield to go to the Tier-1 or the OEM? That’s the biggest challenge. Technically we can do it. The analytical data would allow us to learn a lot. But the challenge is how to share it.
Leave a Reply