Experts at the Table, part 2: How the very fast progress of the semiconductor industry is making transistor aging even more difficult.
Semiconductor Engineering sat down to discuss design reliability and circuit aging with João Geada, chief technologist for the semiconductor business unit at ANSYS; Hany Elhak, product management director, simulation and characterization in the custom IC and PCB group at Cadence; Christoph Sohrmann, advanced physical verification at Fraunhofer EAS; and Naseer Khan, vice president of sales at Moortec. What follows are excerpts of that discussion.
L-R, João Geada, Hany Elhak, Christoph Sohrmann, Naseer Khan
SE: Is there a correlation between process variation and aging?
Elhak: Yes. If you take the same transistor with the same stress applied to it, it may age differently from one sample to another. Aging analysis today doesn’t take that into account. Monte Carlo analysis, which takes process variation into account, doesn’t look at aging. What designers do today is run either one or the other independently, and then they make assumptions about how to combine them. For example, if you look at the V threshold of a transistor, it may be, for example, -0.43 volts. We run and aging analysis and it shows up 50 millivolts higher, which means V threshold has increased by 50 millivolts. This means now you need to apply 50 millivolts more to turn a transistor on. That’s what you get from aging analysis — that my V threshold is 50 millivolts more. Now you would do Monte Carlo analysis and you’ll say, ‘All right, you know, actually V threshold is -0.43, is it has a range. There is a distribution around that depending on process variation, and this distribution has a sigma of .0045 for example.’ But look at it in this graph, it’s that much. There is a spread. V threshold should can be anywhere in that range. Now, I should assume — and that’s what every designer does — that after aging, I have 50 millivolts more and I have the same distribution around it. That’s what people do today. The reality, if you actually take into account aging and Monte Carlo together, this experiment here shows that the spread now is three times as much. And as we have discussed, designers do margining to take that into account. So I made margin for 2X.
SE: Wasn’t it also said that this is not really marginable?
Elhak: Yes, it is not marginable because now I can margin for 2X.
Geada: You have to be able to measure it to deal with it.
Sohrmann: We have seen even the opposite effect, where the distribution sigma is actually going down. So you have aging, but the width of your distribution is narrowing during aging, so you could be narrowing the distribution by aging.
Geada: It’s extremely process-specific.
Elhak: And that will differ from one process to another.
Geada: One of the reasons that people really like FD-SOI is because it has a slightly different behavior that way.
SE: So every time I do a chip, do I have to wait for my test chip to come back to understand how the aging and variation is going to be for that particular situation?
Elhak: You can do that with simulation.
Geada: It also depends how early you are in the process.
Elhak: If you are that early in the process, you don’t have these measurements, you don’t have the aging models ready and characterized. So you probably will need to wait for the test chip. Once you have your aging models for a specific process node extracted and you have models that say, ‘For these levels of stress, this is how my V threshold and mobility will change,’ then you can run the aging simulation that we talked about. You run the fresh simulation, apply the stress, then use the aging model to change the parameter of every transistor according to the stress.
Sohrmann: We have a bit of a problem here when we work with industry on this. We do exactly the same approach, but this is not completely accepted. People say, ‘Well that’s interesting that you can show this on a single device. I want to see it on my whole design. I want to age the entire design. And then I want to have silicon correlation.’ There we get into a real problem. As we have been saying, you cannot wait for 10 years. You have to accelerate the aging. That’s how you characterize the degradation. You apply a stress, and you over-stress it. Then you can shorten the time to failure, and then you try to extrapolate it back to the normal operating condition. You can do this with a single device, which you can over-stress, but even if you start to have current narrows or things like this, you cannot over-stress them indefinitely. So if you have a really complex system and your customer wants to see silicon correlation on the system, how do you do that? You can ramp up temperature, but you cannot ramp up the voltage. So how do you accelerate the stress in the system? That’s something we are confronted with. We’ve been asked if it actually still works if you go from a transistor to a million-gate design. Is it possible to extrapolate?
Geada: This is where the physics in certain cases helps. So, for example, even though we we’re moving from 28/16/10/7nm, etc., some of the parts of the physics are the same. It’s mostly the same metals that get laid down in more or less the same layers. There are some changes here and there occasionally, and then all of a sudden we have to relearn things. But we know the behavior of metals on their electric stress, for example. We know that for the behavior of these high-k dielectrics under stress, some of the physics is known. Part of what the foundry model is doing is applying these physics to the slightly new geometry that they’re making, and there’s a reason we keep going to these finer processes. But there’s certain history we can borrow from previous generations, and there are certain assumptions that the foundry makes. They are projections, not measurements. They don’t really know. But they’re going to assume that we’re using approximately the same metals, even though they’re slightly smaller this time. The electric fields are stronger, but we’re going to assume that the same physics from before apply.
Sohrmann: But what about local variability in temperature, for instance? You could have local hotspots. You get a different aging at this local hotspot.
Geada: In particular with finFETs, with self-heating behavior, we have some history from the earliest finFETs. This is the part that concerns me the most. We are making parts for which we don’t really have history on the modeling side on the foundry. We have the simulation technology. If we have the models — both on the highly detailed stuff, as well as on the large scale chip-wide stuff on our side. We do need both, but we depend critically on models, and that’s still a very challenging area.
Elhak: The problem is that historically most of these aging models were empirical, and they depended on measurements and projections. And the problem is that it depends on how long you can project. So now when we are talking about 10 years and 15 years, that’s a very long time of projection. Yes, you will accelerate aging by increasing the temperature and ramping up the voltage, but there is a limit. Let’s say you can do measurements for one year of aging, and then you will extrapolate for 10 years. The problem with empirical models is that they are only as as good as the timespan you did the measurement on. So physics here becomes very valuable because physics is constant, physics doesn’t change. So if you base your models on physics, then even if you measure for one year or for three years, you may still have a little bit more confidence that your models will be valid after 10 or 15 years.
SE: Aging must be approached from all sides, right?
Khan: We have to spend time. The more we know, the more we can put in the model, the more we can simulate, and then we can trust the outcome.
Sohrmann: We see that, as well, as a research institute. We do a lot of physical modeling where we look at individual traps. If you’re talking about NBTI (negative bias temperature instability), for instance, we look at each trap, we extract trap statistics. Models are 300 MB and above, and it’s individual for each transistor: It doesn’t work; it doesn’t scale. You cannot do that. You have to have empirical models. You need to have equations. You can’t put a name on each individual trap, but that’s what would be required to do it accurately. So you have to find something.
Geada: If something goes wrong, you have to be able to diagnose what went wrong. And so you think of this context, you have a chip in 10 years from now, there is a problem and you need to debug it. How do you recover the environment in the design you had 10 years ago? We can barely run software from 10 years ago on modern machines. We’ve lost all the context. Realistically what happens is, [for certain processor developers] at the end of the design, they take a bunch of machines and they put them in a vault and they boot them up every three to four months to make sure they still work, but it just stays in a vault for 10 or 15 years so that if something goes wrong they can boot up the design again, reanalyze it, then figure out what they missed so they can learn to apply it to the next generation.
Khan: That only works when the problem is only there.
Elhak: The only issue here is that people are interested now in designs built in 7nm, and the design from 10 years ago was a 90nm design. Whatever I learned for 90nm on the transistor doesn’t apply anymore, so the very fast progress of the semiconductor industry really makes aging analysis very, very difficult.
SE: That said, will techniques like on-chip monitoring come more into play in partnership with simulation?
Khan: Yes, because simulation is predicting what will happen. But if you want a monitor, you want to actually see what’s really happening in silicon.
SE: And when it sees something happening incorrectly, what can be done about it?
Khan: It depends on how well you can measure, and how locally you can measured. If you really try to be in an ideal scenario, you have one transistor that is actually doing the job, and one transistor actually measuring. It depends on how far you can go, and how far you can monitor.
Geada: One of the brilliant things about having a monitoring system is that you can run the monitoring system in parallel with your simulation and make sure they agree with each other. If there’s any divergence, that sends a red flag that something unexpected is now happening. It’s not what you predicted.
Sohrmann: What’s wrong, the model or the reality?
Geada: It doesn’t really matter. At this point, whatever you used in design is no longer holding. It’s a red flag and you can deploy this red flag before something bad happens. This is really key.
Khan: You don’t know exactly what is wrong, but at least you put a big lens on it, and say, ‘I’m looking at it.’
Sohrmann: Do you have any experience with the aging of your sensor structures? We’ve been working on sensors in a research project and very quickly comes the argument, ‘It’s great that you can measure the aging, but are you actually improving reliability or are you lowering it?’ I was surprised when I read something that said if you want to increase reliability in a system that has a cooling solution, remove the cooler. The cooler is the weakest part, so if you remove it, it’s getting hotter but it’s still living for longer than when you have this mechanical thing on top.
Geada: This is one of those beautiful things about engineering. Sometimes empirical models and predictions are good enough, as long as as long as you have a way of knowing if the empirical model will still match reality. And if there’s a disparity, react preemptively.
Khan: It’s all about statistics at the end. The more you see silicon, the more you try to correlate and say for one particular customer, they do one design in 18 months and things like that or maybe more, but they don’t have this perspective. We work with so many customers that we get the silicon data to see whether it is matching. The first instance of doing a design at 7nm, nobody knows what’s happening and we try to make sense out of it. The more data we get, the more statistics, and then this starts to make sense. But then by the time we start to get things, people move on to another technology.
Sohrmann: Another question, are you looking at aging?
Khan: Yes, we do accelerated time tests and things like that. But as you said, it goes to a extent and then if you want to confirm on maybe 15 years, you have to keep on.
Sohrmann: We think it’s a solution to add these type of sensors and say, red flag if there is a problem you cannot predict by simulation ahead of time, and so it’s probably the way to go.
Khan: What we do is for each IP, if you’re trying to measure something, we should know whether we are okay within the IP, right? So we actually put some fault detectors inside all over the IP so if there’s any problem with a particular circuit within the IP, it will actually say, ‘The value is coming out, but don’t trust this because there’s something wrong with the ADC or there’s something wrong with this.’ This is the kind of mechanism we put in.
SE: What’s the next step? Is it just trying to get all of these things correlated together—aging, reliability, variability. Do we need to bring these all together?
Elhak: First of all, aging shouldn’t be the concern of the reliability team. It should be the concern of every designer, especially those working on advanced nodes. We have to analyze aging holistically with process variation, with self heating, with thermal effects in general, because these are all parameters that change how the transistor degrades. Also for aging, we need to study the wear-out. In aging for transistors, there is also electromigration of the interconnects. The device can break because of an EM hotspot or because of transistor aging because of, for example, time-dependent dielectric breakdown. Any of these can make the chip cease to operate.
Geada: Also, a lot of people use bias circuits, and certain configurations actually have a bias runaway where aging changes the behavior of the biasing circuit and it actually just goes into a positive feedback loop. Basically, it turns what used to be an aging problem into an ESD problem.
Elhak: So running few aging simulations on a number of test transistors does not cut it anymore.
Geada: Aging is now a first order-effect in your design. It needs to be treated the same way that you do timing, the same way that you do power analysis.
Sohrmann: How do you solve two major problems—characterization of all those dimensions, because it’s a multi, high-dimensional problem, and then the complexity in the simulation? There are so many more corners, so many more variables which you have to deal with.
SE: Also, what about the file sizes?
Geada: As Hany [Elhak] pointed out, realistically, like most other things we need the real detailed models, but we’re going to do empirical fits to turn it into a practical problem. And this is EDA. Large-scale problems are our bread and butter, right? In other industries, Google talks about the kind of data that they use. The entire world map for every road in the world fits in a few Gig. Surprisingly, yes. You can carry the world map on your phone. Routable information for every road in the world will fit on your phone. A chip, not so much. We live and breathe big data simulations. This is what makes our industry special. We have the technology to do this, and it’s not just uniquely ANSYS. It’s all of us. This is what we do. This is why our customers pay us money.
Sohrmann: So if you can handle it, you still have to create all this data, and that’s something we’re struggling with. We talk to foundries a lot. They, of course, are not just giving in and saying they will measure everything. No, they wait until it is really required.
Geada: For foundries, it’s also a business. For example, at TSMC last year, the automotive reliability platform is not 7nm. It’s a slightly older node for which they have a lot more data. But they are also a business, and when enough interest gets thrown at a particular domain, they build all the necessary collateral for us to run EDA software on. The awareness that this is a really important problem has come, and the foundries are adapting. They’re realizing they have to capture this data, that they have to measure it, and that they have to provide the EDA community with the models — empirical or otherwise — to deal with it.
Sohrmann: I still think that not all the foundries know how to cope with this. We work with foundries that are clueless, and they ask us how can we do an efficient characterization of this highly complex problem. Then we have to start a research project, and it’s going to take five years.
Geada: There’s a reason certain foundries are a lot more successful than others.
Sohrmann: But in the end, this all has to fit together. We cannot have a million different models.
Geada: Have you seen how many transistor level models we have that we have to deal with on a day-to-day basis?
Sohrmann: How many are really used in the end? It’s astonishingly few.
Geada: Remember that even those within a particular family like, say, for finFETs, BSIM-CMG, there are multiple versions of the things. Depending on which specific process or which specific foundry, it’s a different version of the model because that’s the one that fits. We have to run and support a wealth of models.
SE: Something we didn’t talk about is what happens when you have a customer that wants a second source. Doesn’t this very issue throw everything up in the air?
Geada: Yes. It is punitively difficult.
SE: Will we see an end to that then?
Geada: Absolutely not.
Elhak: Economics is still driving this.
SE: So it’s basically two different designs?
Geada: Effectively that’s it. It’s the same RTL remapped to two physical teams.
SE: But it’s not that easy.
Geada: Oh gosh no, but you throw money at it and the problem goes away.
SE: So the big will get bigger and everybody else will struggle?
Geada: I never underestimate people’s creativity.
SE: At this point in time, what would help the user community the most from the EDA industry?
Sohrmann: What would help most is having a standard model, if there is such a thing. That’s something I doubt, but at least a sort of standardization of this whole thing. Specifically, what to measure and how to get this into all design environments. Especially for the foundries, this is really important. We’ve had discussions where we’ve said, ‘We can help you to characterize certain devices.’ They said, ‘No, we only start with this if we can characterize all the devices, because otherwise this implies that the devices we have not characterized do not age.’
Geada: Standards are great once everybody agrees that there is one right way of doing things. I don’t think we’re anywhere near that yet and the foundries are so differentiated. Standards are only for things that have become commodities.
Sohrmann: We’re at the point where we have to talk about standards here. They are quite important. Also, for the topic we discussed earlier about physical models versus empirical models, there’s still lot of research necessary, especially as we’ve been doing a lot of work with NBTI and recovery effects. This is hell. From a mathematical point of view, from the physical point of view, it’s not even clear what’s going on. I’m totally sure that the latest theories on this are still wrong to a certain extent. Even those universities that are at the front line of research say, ‘We think it’s like that but we’re not sure. It could be different tomorrow.’ Then you want to simplify this and say there’s a way to measure recovery really quickly. You don’t have to wait for 10 years to age and another 10 to recover because you cannot accelerate recovery. How much more can we actually spend? What’s the most efficient way to do characterization. And what’s the simplest model that still captures the most relevant effects?
Khan: Looking at the idealistic scenario, there are simulations that are available to predict, and then we should counteract that with, ‘If you’re talking about a product, we have to put a red flag there when there is a need for a red flag?’ It should be like a mobile phone when you’re tracking the battery. Depending on the kind of apps and the stress that you put on the mobile phone, it tells you it’s going to last one hour or two hours. If it’s for a product, we do all the prediction that we would do before, and then we’re tracking what’s the status today with the amount of stress that has happened before. That prevents the product from failing without reason.
Geada: That’s going to be a big area of innovation. Big industry, large-scale stuff, has already gone in that direction. Preemptive maintenance, maintenance only when you need it, is the way the airline industry operates today. Those issues are going to migrate to our domain. This is going to be an area of innovation, and maintaining digital twin flight monitoring, interacting, comparing model and reality — it’s an area of research. This is something that has not yet been widely deployed, dealing with scales of billions of these things. But it’s going to happen, even if it’s just a statistical sampling. We’re going to keep track of 10% of all the errors to see if there’s a mismatch between model and reality. In the end, I stand by my statement that this is engineering. We’re going to make do, and the parts we understand we will measure. The parts we don’t understand, we’ll add a little bit of padding. And we’ll just keep making progress. Nobody’s going to wait 10 years for a perfect solution. They want it now.
Related Stories
Leave a Reply