Process Variation Not A Solved Issue

Experts at the Table: Biggest issues with process variation today, and its impacts on the design process.


Semiconductor Engineering sat down to talk about process variation in advanced nodes, and how design teams are coping, with Christoph Sohrmann, a member of the Advanced Physical Verification group in Fraunhofer’s Division of Engineering of Adaptive Systems (EAS); Juan Rey, vice president of engineering at Mentor, A Siemens Business; and Stephen Crosher, CEO of Moortec Semiconductor. What follows are excerpts of that conversation.

Left to Right: Christoph Sohrmann (Fraunhofer), Juan Rey (Mentor), Stephen Crosher (Moortec)

SE: Especially with advanced nodes, there are many types of variation in the process and today’s design and verification tools have to understand this. How does variation impact the sphere of your work today?

Rey: Variation is definitely having an impact on the daily activities of pretty much everyone. We have seen it increasing complexity over time, however we have observed a couple of things. The first one, the worst case scenario, is that everyone was concerned that variability would reach a level that it would be unmanageable, and this is not so. Furthermore, there seems to have been a flattening on how to deal with variability overall. What we observed is that around the time when multi-patterning started being adopted and used in production, we have seen essentially a stop in the growth of the number of complexity corners that need to be handled in the industry, where we have kept seeing an exponential growth of issues that need to be dealt with is with overall complexity. The introduction of new nodes, each of which keeps pushing for more and more complex rule decks; so that kept growing. We are seeing now a slowdown that we think is only temporary on the growth of complexity of things that need to be handled with the introduction of EUV. We think that it is only temporary because even EUV has been introduced with the need for computational lithography, but multi-patterning will keep growing in future nodes so we see still a growth in complexity with all the issues that need to be dealt with — not an exponential growth when going from 7nm to 5nm — but it’s still a growth in that area. And it’s, of course, that well beyond just the exponential growth, given the increased density of features (polygons, transistors, elements, edges) whatever number you actually use to measure the geometric complexity of the dies — I was talking about all the complexity of all the other things. That’s how we tend to see it, that the industry had enough time to actually start adopting the tools to deal with variability, in EDA and in our customer base there seems to be a general overall understanding of how to deal with the situation.

Sohrmann: As a research institute, we’re not so much concerned about advanced nodes because variability has earlier already so we have had projects where it was relevant at 65nm and everything in between to 22nm, and what we see from the modeling point of view is that the complexity is blowing up the dimensions so you have to check so many corners. The corners are growing exponentially and we try to create optimizations for the existing flows to handle the additional parasitic effects, and it just keeps increasing. The question is how can you combine all of this? We see combinations of aging and variability, for instance. Modeling is one thing. How can you simulate this, but also how can do qualification of that? You have to do aging model qualification, and degradation data takes a lot of time. And if you want to have in addition some change in degradation due to variability — it’s even tougher to do that, and we’ve been asked how to handle this. For some things we have solutions, but usually it’s very tough.

We have research projects where we try to follow our own ideas but always with the focus is on the industry questions. The input from the industry is important for us. We’re not a university, we don’t want to invent the questions ourselves, but we want to bring academic solutions into the industrial world, so that’s the goal of Fraunhofer to do this transfer. We are confronted with very advanced questions: can you do all of this thermal aging and variability at the same time? Sometimes you have to say, ‘No, I don’t know. We don’t have a situation at the moment.’ On the other hand, for us it’s quite tough: when we talk about advanced nodes, we had it in the booth today, we have trouble accessing those really advanced nodes as Fraunhofer because the foundries don’t really care much about us. We have Imec as a partner but the [foundries] don’t want to partner with all. We’re just about to talk to the guys. Maybe there’s a chance for us to get in there but we look at it in a very general way. From the modeling side, it’s not going to change too much — it’s just a matter of complexity in the end.

Crosher: We see it from very much the design team’s point of view in that customers who are using the foundries, and on these sorts of advanced nodes where process variability is quite marked, they tend to be the digital teams. What that tends to boil down to is really gate delays so they’re not so consumed by the physicality aspect of the elements of process variability . It’s more to do with the actual gate delay.

SE: That’s based on the models from the foundry, yes?

Crosher: Yes, and I guess they’re taking a high level view of it, and what they’re really concerned about is when they run through things like timing closure and static time analysis and they’re having to do a lot more corners than they had to previously is how that then reflects onto silicon and when you have such localized process variability, it can create a chip die that has regions of different temperature, of different supplies, of different gate delay and so it’s our posing the question of well how good is the static timing analysis and simulation correlating to what’s physically on the chip, which is such a changing dynamic and sort of environment. so one of the really big concerns of our customers is meeting timing within silicon over a full range of temperatures. We see them concerned about that. They’re also concerned how aging impacts that and we’ve seen some customers have failures because of process variability. An important concern is knowing how localized the process variability is because the foundry can steer a lot of these things. How do you quantify how localized these effects are? I think we’re still finding out. Is it transistor to transistor; is it gate to gate; is it different gate types? There are areas we still need to understand more about.

SE: What is the best way to find out how localized a process variability effect is?

Rey: This is really an interesting topic for us to discuss and try to figure out what is happening because we, for example, had started developing thermal analysis tools, and we have cooperation in this case with Leti, and we developed this methodology that essentially allows you to do a very fast thermal estimation when you are at the floorplan level. Then we can get a more refined model when you are at the place and route level, and then you can have maximum accuracy as you need if you will have to even do a thermal certification for signoff type of analysis. We kept showing these tools at conferences like DAC another places, even with Leti helping on the demonstration and showing how they have a chip and they were dynamically measuring how temperature evolves and there were modeling results on the side and how well they matched. Everything was very impressive because you have a level of accuracy that you want and you have these refinement techniques that operates very fast at each one of these subsequent levels of refinement. It seems that the thermal problem is of very high importance for some very specific types of designs because we cannot find a level of adoption and interest. So we put these tools essentially as projects, we try to engage with customers, we don’t find any customers coming to us and saying, ‘Hey, we absolutely want to use this.’

SE: Is it maybe too early in the adoption cycle?

Rey: I really think that there are some industries where it is absolutely essential. You go into high speed, the most power consuming designs, and those areas seem to absolutely need it, and have been using some form of thermal simulation for quite some time, but a larger customer base does not seem to be embracing that space.

Sohrmann: We’ve had a similar experience. We also have a tool in the thermal domain and you have to be very careful who you talk to. Everyone says, ‘Yes, that makes sense to use it,’ but it’s not really required for every type of product and every time I’ve learned today that apparently thermal becomes very important in the advanced nodes again, FinFET designs and so on, so it would be interesting to see variability, thermal, what is going to be.

SE: Is there another way to deal with it possibly that people are still employing? Are they doing extra stimulation?

Crosher: The way we see it is that by no account are all the customers doing thermal simulations. Some are, but some aren’t. For those you aren’t doing thermal simulations, and this isn’t a plug for we’re doing, but it is true that people are putting down thermal sensors in chip to see what’s going on. It’s almost like some market environments move so quickly they have to complete the chip design within such a short period of time, they will put thermal sensors down rather than doing the simulation work. So then they’ll just have a scheme that works around the simulation.

SE: Is it more cost effective to do that than buy thermal simulation tools?

Crosher: I don’t know the cost of thermal simulation tools, but it seems to be a way of getting around the subject slightly. And as long as you’ve got a dynamic clocking scheme and you can change clock frequencies to the system then you can try and bring temperatures down and things like that. And for some customers I think that’s okay, but as you say, there are then another category of customer where they’ve got extended lifetimes for devices so thermal management is absolutely critical, so they will then do the simulation work.

SE: Maybe they are reluctant to pay for more tools.

Rey: I don’t think that is a matter of course because we not even engaged in that type of discussion. We’ve participated in helping put reference flows, for example, at TSMC with thermal simulators that can show them that it is possible to do analysis but in the end there is no traction.

Crosher: that doesn’t quite make sense. You would think there would be traction for this because what customers would want to know is what are the risk areas, what are the hotspots within the design, and then tailor their software to make sure that everything is kept at a low temperature level for the die, but it’s as you say, there are only a few customers who are doing that.

Sohrmann: I can’t even imagine to grade sensors without having an idea what the temperature is going to be like. You design this type of feedback loop if you want to, say, switch things off at a certain temperature, you have to know what to expect. We have been talking to designers creating power amplifiers — they do exactly that. They look at the temperature, and they have to correct for effects from thermal and electro-thermal. They have to know it during the design stage.

SE: Do they use specific tools for that?

Sohrmann: They do, yes. So there is some traction, but this market is not very large.

Rey: But if we think about process corners independently of the source that motivates variation, and the concept of encapsulating a series of corners, each of which has a specific characteristic, our observation is that, that number grew very fast but got established in a reasonably small number, 14-15 corners, of that order, for example, when doing parasitic interconnect specifications, and it seems to be a practical number for the industry to handle both when the foundries need to specify the information as well as when the design community needs to deal with thermal effects. So when I was referring to that thesis that it’s growing very fast, but the industry seems to have adopted a methodology and it flattened, thinking in those numbers, and those terms, I don’t know if your experience matches roughly that.

There was high-high, low-low, and a couple of other combinations in the beginning that kept growing, right? But the total number of corners is not in the hundreds or thousands as people were concerned would explode.

Crosher: What we have seen is you will get customers who will limit the number of corners they are running the simulations at, and aim to steer the working chip into those corners. They will maybe do some selection of fast, slow, typical silicon, and they may try to steer the temperature conditions to the areas that they’ve simulated or to the supply. So instead of doing the complete blanket-all corner cases-thousands of runs, they may do less, but just try and steer the actual physical chip into an operating envelope that they know and understand. And that operating envelope may be different if it’s in a low power application or whether it’s a high speed/high performance computing application. So they’ll work the simulation work around those operating regions rather than blanket.

SE: It seems like there’s a lot of activity here with different technologies. Is it perhaps confusing for some users about the need to understand variation, and has the industry communicated clearly enough what needs to be done about it?

Crosher: One thing about variation is that variation comes around through the manufacturing process so probably foundries aren’t going to be apt to say how much variability there is with the silicon. But then they are also trying to steer the process to limit that but maybe there isn’t much of an open dialogue.

SE: I do hear that. Juan, you do have close relationships with foundries but that’s a very closely guarded relationship, right? And do they give you all of the data that you need? All of the process variation data.

Rey: As you said, anything that is close to the technology is something that is highly proprietary for any foundry because that’s the core of their technology and how they need to operate. That forces the foundries to essentially take a path of providing the information that the design community needs in a way that protects their IP. The ideal situation would be if the tools that are used for designing, they will have detailed information but that’s not possible so the foundries encapsulate this information with different types of technologies that have been changing over time. The ideal situation when you really need to go all the way to use a field solver or calculating a specific thing, like capacitance of some very specific structure is not possible because that will reveal proprietary information. So yes, that’s an issue that the foundries have helped the industry to resolve and of course it creates complications but it is something that is being addressed because people succeed.

Sohrmann: But if the design teams are asking for it, the foundry has to provide it. If you want to do an ADC very accurately, you need to know this information. It’s the business model fo the foundry to have stability, and if it’s not there, if it’s a large operation in the design parameters you might not be able to get your yield, but it has to be requested from the design companies

SE: And that’s why the foundries qualify the design tools for their processes. How are design teams accounting for lack of full understanding of the localized process variability?

Crosher: Based on the customers that we talk to pretty much every day, if they haven’t got a full understanding of the localized process variability, they’ll take a low risk, worst case approach and so we’re seeing these design schemes coming through to try and minimize the impact of variability. For example, within clock trees or within logic paths, there won’t necessarily be a mix of Vt types; they’ll keep it fairly regular across the die or for a particular logic path or clock tree — they’ll keep it at one consistent Vt type and treat them as different domains if there’s another logic path.

Sohrmann: It’s not just for the process variability. You can capture flop bending, you can capture thermal variability within the IC, aging, voltage drop, and so on. You have a certain guardband and everything is cramped in there — at least that’s our approach. We were trying to help with variation aware design but it turns out that the guardbanding is so large, so it doesn’t make much sense; you cannot really save much. But still to me it seems that it’s becoming important again, especially with local variability, you can not create corners for the local variability, only for the global ones, right? So you really have to look at each individual device. You have spacial variability on the wafer, and that’s another dimension of complexity. I don’t know if it’s something which requires more characterization and more tools or if it’s just a limit at some point when you say, ‘I cannot use the advantage of the advanced nodes anymore, I have to stop because the variability is getting larger and larger.’

SE: That’s a good point. If there aren’t the tools to give the design teams enough information or the way to figure out any of these pieces of it, such as the localized aspect, then will it limit development?

Crosher: Because we’re on the silicon side because we are putting monitors within silicon, there’s always this desire from the customers to have smaller monitors that have less impact on the silicon area so that they can place more down. And ideally they’d have an infinite number of monitors, that are infinitely accurate, that draw no power at all, but obviously you have to work back from there. It’s only then when you start to see what are the effects and it’s only really then that you can start to optimize every chip, the way it’s being made — and they are all made differently, every single one — you then have an ability to find out how that chip’s sensitive and can best perform for its supply condition, for certain clock frequency, for a certain temperature, but we’re not there yet. That’s from a silicon point of view rather than a simulation point of view.

Rey: The one comment I will add is that on the process side, one of the first things that the foundries focus on is in, after they define what are going to be the electrical characteristics that they need to achieve with the circuit is precisely on variability: what is the level of variability that they are going to be able to allow. One of the areas where we work is anything related to photolithography and computational lithography in order to do optical proximity correction, and other types of things, and early on the foundries determine what is going to be the variability budget that is going to be allowed for every aspect of the dimensions that are ultimately going to be printed. They have the same focus, and those budgets get smaller and smaller, node after node.

Related Stories
Variation Spreads At 10/7nm
Differences in equipment under scrutiny as tolerances tighten.
Near-Threshold Issues Deepen
Process variation plus timing are adding to low-power challenges at the most advanced nodes.
Tech Talk: 7nm Process Variation
A look at the issues caused by process variation and why it’s difficult to find them.
Tech Talk: On-Chip Variation
How to reduce pessimism and margin at 7/5nm.


Mehmet Cirit says:

If the problem is misidentified, I suppose it can never be fixed.

The problem is wrong global corners. It is easy to see why. For an
inverter, at normal threshold and supply, 1 sigma value of relative
delay variation due to local variation is ~5%. Depending on the supply
and threshold, it may double or triple. Over a path of 3-25 inverters,
its relative effect on the path delay will be reduced by sqrt(3) to 5. If
you use 3 sigma local variation, its impact on the path delay will be ~8%
to 3% depending on the length of the path. Let us assume 10%. If this
variation happens on top of 3 sigma global corner, it may impact the
timing of chips between 2.7 and 3 sigma. The probability that a chip
will be in this area is 0.002. Assuming half of them violate 3 sigma
timing, about 1 in 1000 is worst case yield loss. If I had allowed 20%
variation due to local variations, the worst case loss would be 34 chips in
10000. The fact is, global 3 sigma region is very thinly populated,
variation of timing in that region will have little impact on yield.

If global corner is not really at 3 sigma, it will cover only a fraction
of the chips. My calculations show that one popular fab’s 3 sigma corner is
actually covering 20% of the chips. There is no reliable way to correct
that by derating or different PVT combinations. When there are 10-15
variables to control the process, guessing what the global corner as a method
may be somewhat better than throwing dice. Using 3 sigma values of these 10-15
variables for corner estimation is a very crude method.

Virat N says:

I agree with Mehmet on global corner creation. Given that there are far fewer samples around this created digital corners, 3-sigma isn’t enough for local corners.

Also it’ll be good to see a quintified study on the effect of local variation on power and noise models for digital implementation flow.

Mehmet Cirit says:

Let me clarify a little bit more. Statistically, local variation around what is supposed to be a 3 sigma global corner,
should have very little impact on yield. It certainly impact the bell curve, but since it is symmetrical, more or or less, it is equally likely to speed up /slow down some chips around 3 sigma global. Unless global 3 sigma corners are misidentified, which I believe they are, local process variation, is unlikely to cause any significant yield loss. Problem is in global process variation. If global timing corners are wrong, nothing can correct it. If the global corners are defined properly, they should be able to corner local variations as well. Systematic variations, where all devices move the same direction, can corner any random local variation.

Mehmet Cirit says:

Let me clarify a little bit more. Statistically, local variation around what is supposed to be a 3 sigma global corner,
should have very little impact on yield. It certainly impact the bell curve, but since it is symmetrical, more or or less, it is equally likely to speed up /slow down some chips around 3 sigma global. Unless global 3 sigma corners are misidentified, which I believe they are, local process variation is unlikely to cause any significant yield loss. Problem is in global process variation. If global timing corners are wrong, nothing can correct it. If the global corners are defined properly, they should be able to corner local variations as well. Systematic variations, where all devices move the same direction, can corner any random local variation, including power and noise issues.

Leave a Reply

(Note: This name will be displayed publicly)