Reliability Concerns Shift Left Into Chip Design

Goals include offsetting rising manufacturing costs and limiting liability over longer lifetimes.

popularity

Demand for lower defect rates and higher yields is increasing, in part because chips are now being used for safety- and mission-critical applications, and in part because it’s a way of offsetting rising design and manufacturing costs.

What’s changed is the new emphasis on solving these problems in the initial design. In the past, defectivity and yield were considered problems for the fab. Restrictive design rules (RDRs) were implemented to ensure chips were designed in ways that could be successfully manufactured. But several things have fundamentally changed since then:

  • RDRs add too much margin into designs, particularly at advanced nodes. That negatively impacts performance, power and area.
  • More chips are being customized for specific applications, frequently utilizing some type of advanced packaging, different kinds of processors and memories, and unique architectures that have not been produced in volume in the past — and therefore have not been perfected.
  • Longer life expectancy for chips in some applications mean that latent defects that were not a problem in smart phones now can require costly recalls. As a result, design teams are starting to include sensors in designs to determine how a chip is behaving, from the moment it is turned on until its anticipated end of life.

Today, devices and the products they go into may have very different requirements. A chip used in the drive train of an automobile, for example, has entirely different stresses and expectations than one produced for an IoT consumer device. Design teams need to understand how these chips will behave over time, from environmental and use conditions, to aging, electro-thermal, stress, and variation effects.

“Reliability is one of the most important topics that needs to be addressed in today’s circuit design and simulation,” according to Ahmed Ramadan Hassan, product engineering director for ICVS AMS Verification at Siemens EDA. “The products that we have today might not be functioning the same after 2, 5 or 10 years. If you have a processor today that is operating at a certain frequency, you might be expecting the frequency is going to drop after 5 years or more, due to the applied stress on each and every device in the circuit. This stress in terms of bias or temperature is going to degrade the overall performance of that specific device, which is an element in the bigger design. And accordingly, the function of the design might not be performing what it was intended to do, or it will be degraded from what it was intended to perform.”

Designers now must account for reliability in their circuit design and verification, effectively shifting concerns about defectivity, yield and manufacturability all the way to the left and to the right in the design-through-manufacturing flow.

“In the past, with the absence of good reliability analysis and simulation techniques and reliability models, designers would over-design, where they’d be leaving a lot of margin on the table. They add a lot of guard-banding to their design to make sure it’s not going to fail at least during the warranty period of that product.”

The shift is significant, but it needs to be viewed from a higher level to see just how all-encompassing it really is. “For years, we as an industry have been trying to make better, faster, newer chips,” said Aleksandar Mijatovic, formerly the digital design manager at Vtool.[1] “With that comes a whole set of problems, which arrive when we pull the technology to its borderline. Sometimes it breaks the border and goes to where it was not supposed to. This means if you’re trying to use max frequency to do the maximum density on a chip, it is likely that working on the edge of possibilities of given technology, you are going to break them sometimes. But on the other hand, it’s not the engineers’ fault. We all know that, but the market is demanding better, newer, faster.”

The burden of economics is shifting left, as well. While this dynamic always has been present to some extent, it has come into much tighter focus as chipmakers strive to contain costs.


Fig. 1: Rising SoC costs per process node in millions of dollars. Source: Cadence

“Some companies today are saying, ‘We do not want to have the very latest processes. We want it to be reliable. We do not want to replace the chips too often.’ This is nothing new particularly, it’s just that the focus has shifted,” Mijatovic said. “There are many companies that are doing their chips in very outdated technologies. The entire automotive chip manufacturing effort is done in obsolete technologies because they are good enough, they are proven, and there are not so many surprises there. By chasing the new and the best, we forgot that many times we do not need the latest process node. That it’s not actually required.”

These considerations are made even more complex in light of automotive, medical, industrial, and data center applications, where there is a perfect storm of rising chips costs, demand for longer lifetimes, prohibitive replacement costs, and potential liability if something goes wrong.

“When we started talking about having electronics in automotive applications, it became much more important to make sure that these kind of failures are not going to happen, not even in a longer or a shorter period of time, and just making sure that it is accounted for,” Hassan said. “Also, it means a lot of guarding in design is taking place.”

At the same time, as the increasing levels of autonomy are added into everything from automobiles to robots and drones, reliability has emerged as a top priority.

Security concerns
Closely tied with reliability is security, particularly when it comes to automotive, medical, industrial, and mil/aero applications.

Olivera Stojanovic, project manager at Vtool, recalled that during a safety-related conference, the end conclusion was that security may be even more important than safety if hackers can drive your car with you in it. “That’s when security becomes more important than safety.”

Mijatovic noted this was much less of a problem when few devices were connected to the internet. “It’s not only our PCs and phones, but also refrigerators, microwave ovens, and heating in our homes. We’re putting everything online.”

All of this increases the complexity of devices, which in turn requires more verification and better compatibility.

“From the design verification perspective, you can consider every additional requirement you make as additional layer in your specification,” Mijatovic said. “The specification does not mean the device must only perform the function. It needs to do so reliably. It needs to be accurate. It needs to be secure. All of those can be defined as functionalities, and will be implemented as functionalities, when it comes to that end. You will use architectures that are less error- or less hacking-prone, and you will do security checks. In the end, it applies to protocol, and the security or reliability concepts that were in mind from the start. This is driving another set of architectural approaches, and it will cost more effort on every side.”

Ongoing monitoring
These reliability concerns reach well beyond just automotive. “We have started to see that for other applications, design for reliability and reliability verification are becoming more and more important,” said Hassan. “We have seen a lot of EDA vendors work with groups like the Compact Model Coalition to address this reliability from a simulation and modeling point of view that will be needed for circuit design.”

The Compact Model Coalition has developed a standard interfaces for aging and simulation called the Open Model Interface, which creates a way for foundries or various groups in any design house to integrate aging models for effects like mechanical degradation, mechanisms like hot carrier injection (HCI) or negative bias temperature instability (NBTI) inside that interface. It also enables them to run simulations with EDA tools and capture the behavior of the design after 5 or 10 years, or for the intended lifetime of that product.

“This work is happening with the intention that by running this kind of analysis, designers will not need to overdesign because with this aging simulation,” Hassan said. “Now they can see and predict the behavior of their design after a certain amount of years, and can push their design to the limit to gain performance — but without leaving profit margin on the table. It can actually add some compensation techniques on their circuits, on their designs, as they start to use it.”

Some of the techniques used include creating on-chip monitors and sensors to detect any degradation of device performance during the operation. With that sensing, compensation can be applied to accommodate for that degradation to avoid overall performance degradation of the design.

Additional monitors may add to the area of the device and subsequent product, as well as consume extra power or impact performance. But in certain situations, having such monitors and compensation techniques can ensure that corrective action takes place without injury or disruption of service.

Aging and stress
While aging and various types of stress — mechanical, electrical, thermal — are inescapable, being able to predict those effects can have a significant impact on how long a device performs to spec. One of the key elements to making these determinations is understanding the environment in which a chip will be used.

“Automotive is the traditional space where we get into analysis, such as how we can model those stress environments, and how we can give the design engineer confidence that their part will function in 15 years into the future,” said Brandon Bautz, senior product management director in the Digital & Signoff Group at Cadence. “In a car I need my device to operate for 10 years, but I need it to only consume so much power because otherwise my electric car is not going to go as far. There is a balance between reliability and the performance of a part. How do I get more accurate analysis so that I can have a clearer picture of the performance of my part versus the necessary reliability? Aging analysis, in particular, from a digital perspective has been around for a while, and we’ve found that that has been a pessimistic view of things. But given the tools that we had 10 years ago, or even 5 years ago, that’s what we needed to do at the time.”

However, given that the automotive industry depends on silicon so much, there are a lot of newer areas where high reliability is sought for cost reasons, as well. “You can make a part extremely reliable, but it may not perform the way you need it to perform,” said Bautz. “These tradeoffs with cost, performance, area, and risk are just getting more intense because the parts themselves are more complicated. As a result, the types of analyses that need to be undertaken are more complicated. Guard-banding and making sure things are reliable is good, but based on some of the research we’ve done, and the improvements that we’ve made in the characterization and analysis algorithms, we’ve shown a percentage of margin that customers have been leaving on the table as a result of these older methodologies. With more accurate analysis, design teams will be able to balance both reliability and performance.”

This opens the door to more contextual analysis, which in turn can have a big impact on reliability.

“Where we start is with an understanding of how things have been done for the past two decades, and the recognition that the limited compute power from 10 years ago couldn’t really capture the true nature of the problem. In this case we mean aging, and the stress dependency of that effect on aging,” Bautz said. “By putting two of the pieces together in digital analysis, the process of characterization captures the performance of the device at a cell level. Then we look at the design level, and observe the particular cell and device performance in the context of the design. If characterization is merged with timing analysis to provide the designer that accuracy, and more specifically provide the designer insight into how their circuit will work in the context of the overall design, that’s a long way of saying by putting the circuit in context of the design we can analyze the actual stress of the device. And therefore, we can more accurately analyze the aging effects of that on the device and understand how it affects the device time overall.”

Analog reliability concerns
And that’s just for digital designs. Analog design adds its own challenges.

Today, nearly all chips have analog content in them. “Even in ones with millions of gates, there’s still some amount of analog there, and that amount is increasing,” said Jay Madiraju, product management director in the Custom IC & PCB Group, at Cadence. “One thing that analog design teams are concerned with is not just the functionality, i.e, the part or the block they’ve designed that’s going to interface with the huge piece of digital logic. They want to know if it is reliable.”

Reliability has multiple connotations on the analog side. “When you look at the classic bathtub curve, when do you call a product reliable? The notion of reliability, what does that actually entail? It entails whether it functions well over time,” Madiraju said. “That’s definitely something the analog people are concerned with. Then, how does the circuit function over time? It will get worse for sure. We all know that from years of experience. But how much worse? What does this look like specifically in regard to the carrier mobility, the threshold voltage, and other device characteristics that are fundamental for the overall circuit to function the way it should? How does it degrade over time, and how can I predict that before the part goes out?”

While aging technologies have been around in simulations for a few decades, they’ve improved to include mission profiles over the past several years.

“Before mission profiles, engineering teams simulated for the worst-case conditions,” he said. “‘This is going to be my worst case. This chip is going into a car. I’m going to assume that this car is going to be out there the 120° weather forever. How can I simulate for that condition?’ You’ve got to make that device reliable by assuming these worst-case conditions, but the unintended consequence of that is over-design, over-margin, and guard-band. You’re going to design so conservatively that the performance will suffer — performance from different aspects like speed, timing, and power leakage, all the different aspects of how the chip is supposed to behave. Mission profiles help address that so different conditions can be defined, including temperature, voltage, and other conditions over time. You can say there are some times that these parts have different stress modes, or under this operation, such as when it’s going through the calibration process, how much stress it’s going to undergo. Stress drives degradation. It looks different across different modes.”

Another aspect is manufacturing reliability, and in the analog world, that implies several things. “One is the degradation in time. Another is defects that occur in the manufacturing process that escaped test, such that the part has come out and the initial testing has been done before releasing it to the OEMs. For example, in the automotive world, some of the parts escape those tests and the customer will see them. The automotive OEM will see those problems. That’s a big problem, and it’s an aspect of reliability that people are absolutely concerned about,” he explained.

This is where analog fault simulation comes in. It’s analogous to DFT on the digital side, where faults are injected during verification prior to a chip getting taped out. “You see which of the faults escape, which of them affect the output, which of them don’t affect the output, and then you try to get a coverage measurement. You’re exercising the design with various tests. Are my tests good enough? Am I catching all of these? What you want to see when you inject a fault is the wrong output. Ultimately, the goal of all this is to see when I exercise these circuits, using these set of tests, am I catching everything I need to so that when the part goes out the customers are not finding bugs? The manufacturing process will create problems. Are you testing for all those problems? That’s another aspect of reliability,” Madiraju said.

Electro-thermal heat effects are another increasingly important aspect in the analog realm, and what has been missing from self-heating models is the impact of heat on an adjacent or nearby device. This requires electro-thermal simulation.

“Previously, engineering teams would do just thermal simulation, measure that propagation effect, and they would send that information back in terms of how it affects the power, back to the simulation, part of a unidirectional flow between electrical simulation, circuit simulation and thermal,” he said. “Now, it’s becoming clear that that’s not adequate for modern chips and high-voltage devices, and certainly those in the automotive world, and for industrial chips that are subjected to high voltage conditions. You need an integrated approach. That feedback effect needs to be modeled in a single simulation.”

Reliability and memory
Memory adds its own twist on reliability, because memory choices can affect everything from power to area. This is particularly evident with DRAM, where choosing high-bandwidth memory or GDDR can have a big impact on how the memory behaves over time in the context of other components.

“You’re going to have lower power with an HBM device, and fewer physical interfaces to deal with compared to DDR, GDDR, or LPDDR,” said Brett Murdock, product marketing manager for memory interface IP at Synopsys. “How you implement them physically on the SoC is the Wild West. You can do whatever you want. You can put a full linear PHY on the side of the die, you can wrap around a corner, you can fold it in on itself. There’s an untold number of ways you can implement that physical interface. But with HBM, you’re dropping down one HBM cube, and JEDEC has defined exactly what the bump map on that cube looks like. That means while there might be less flexibility as far as where the bumps are put, it equates to better predictability and reliability. There are a few different choices for the interposer and how to connect things together, but at the end of the day, if I look at GDDR, LPDDR, DDR, I can build a million different boards, connect them in a million different ways, resulting in a million different implementations, and a million different opportunities for somebody to mess something up. Whereas with HBM, you put in the PHY, you put in the device, and there’s not a lot of variability for how to place the interposer between those two. There will be minimum spacing rules between the SoC and the HBM device, and that’s pretty much it.”

Wherever possible, repeating what worked in the past can go a long way toward ensuring it will work in a new design. “One thing that contributes to reliability is how many times you’re doing something,” Murdock said. “The fact that we’re doing the same thing, or almost the same thing, for every customer means we’re really good at it. That it is tried and true. If I know it’s worked for AMD and millions of units that they ship, why would it be any different for this new AI customer that we’re selling HBM to for the first time? We’re not going to have to reinvent anything.”

Variation
Variation is another aspect that can affect reliability, and it’s particularly important to understand its impact at advanced nodes and in advanced packaging due. There are many different causes of variation, from contaminants in materials and leftover particles from CMP, to die shift during packaging and inconsistencies in lithography. At what point they can create defects, and how to account for them in the design phase remains a challenge.

“Design teams are realizing they need to do something about the variation in their design,” said Sathishkumar Balasubramanian, PLM software head of products for AMS verification at Siemens EDA. “People talk about different concepts around this including robustness and reliability, all of which at the end of the day mean the same thing, which is that the customer expects their device to work wherever they throw the end product in, and how long they want it to work in a given reasonable timeline.”

This is so critical that variation is starting to be included as a high sigma requirement, and making that part of the flow, starting early in the design flow with library components, Balasubramanian said. “They want to make sure that the components are robust. For example, in a standard library, they want to know for a given standard cell library, for a particular process that it satisfies all of the different PVTs, and a wider range, and still meets the 3 to 7 sigma.”

Conclusion
Putting all these pieces together at the far left side of the design-through-manufacturing flow is a complex undertaking. In effect, what used to be fixable in manufacturing is no longer sufficient. It now has to happen much earlier, which means design teams are now wrestling with concepts that typically were reserved for process engineers, and the process engineers are feeding data back to EDA vendors to make adjustments in the tools, along with a wish list for new capabilities.

Reliability is now a universal challenge, and it’s one that from here on will require diligence on the part of the entire supply chain, from initial design to monitoring of products in the field.

[1] Aleksandar Mijatovic left Vtool after this interview was completed.


1 comments

Eran Weis says:

Very interesting

Leave a Reply


(Note: This name will be displayed publicly)