Finding root causes of problems shifts left, fueled by better data and improvements in tools.
The front end of design is becoming more tightly integrated with the back end of manufacturing, driven by the rising cost and impact of failures in advanced chips and critical applications.
Ironically, the starting point for this shift is failure analysis (FA), which typically happens when a device fails to yield, or worse, when it is returned due to some problem. In production, that leads to an expensive and destructive offline examination of a troublesome wafer or individual die. In the returned device, it results in a time-consuming search to determine whether the failure was an isolated instance, or the first occurrence of a systemic issue.
“Failure analysis typically happens awhile after the design has been taped out, and can lag design work by several months and sometimes years,” said Jayant D’Souza, principal product manager at Siemens EDA. “Parts can fail during production manufacturing test, and the failing parts analyzed and root-caused to make immediate action to improve yield. RMA parts that have eventually failed in field also go through the FA process to determine the point of failure and mechanism.”
The potential causes are varied and growing. Something may have gone wrong somewhere in the production process, whether it’s a nanometer difference in deposition or die shift in packaging. Or it may have been designed wrong from the start, not taking into account all the potential interactions for a particular application or use cases.
“The general assumption in a crisis situation is that everyone is guilty until proven innocent,” noted Rob Aitken, distinguished architect at Synopsys. “You have to say, ‘What is the nature of this? Does it repeat? How often does it show up? Does it show up in exactly the same place? Does it always show up on exactly the same cell? Does it always show up in exactly the same timing corner?’ It’s the essential search of what is a design problem versus a manufacturing problem. There’s often a gray area right at the boundary of process variability.”
Patience and tenacity are essential, given the dimensions of the challenge. “The first thing a designer does for FA is to isolate the failure to a specific part of the system,” said Ashraf Takla, CEO of Mixel. “Is the IC set up in the appropriate mode of operation, and is it getting all the correct inputs? If so, what BiST modes can I exercise to isolate the problem to a specific block? You just go down the hierarchy to isolate the problem. It might not be a specific block, but maybe it’s the buffering in between. Once you get to that point, you can use simulations to try to duplicate the problem. Does the problem only occur in a particular PVT corner? What is the yield fallout because of this specific failure? How repeatable is the failure? All those questions are relevant to help diagnose and zero in on the root cause. Using the 5 Whys Analysis technique can be helpful. You can’t be confident in the solution to the failure without being able to simulate both the failure and the fix.”
Design has been playing an essential role in failure analysis for some time. Much of FA couldn’t even be done without design for test (DFT) techniques, which include everything from what to test, when to test it, and what is the best and most cost-effective way to achieve sufficient coverage. But the emphasis on avoiding potential problems starting with initial design has been increasing, and so has the focus on catching potential problems from initial layout all the way through manufacturing — a problem made increasingly difficult by the fact that there are more processes for each foundry, and widening differences between similar processes across different foundries.
“Design is incredibly important in enabling failure analysis and yield improvement,” said Matt Knowles, product management director for hardware analytics and test products at Synopsys’ EDAG. “DFT has to be put into the chip as it’s being designed in order to have the scan test and any diagnosis at all. Where you put that in the design can happen at RTL or at the gate level. What we’re seeing is people want to get it as far upstream as possible, so that they can they can optimize it at the RTL level and avoid some iterative cycles.”
Fig. 1: Yield optimization flow through failure analysis. Source: Synopsys
The number of failures can be reduced, and designing redundancy and resiliency into a chip/package/system can reduce the number of returned chips and increase yield.
“The root cause is often used to improve reliability,” said Siemens’ D’Souza. “Volume scan diagnosis and machine-learning based root cause deconvolution (RCD) are highly effective for FA from a design POV. The reason is that this technique uses design collateral to localize defects for FA. Using this approach reduces the time to perform FA by an order of magnitude (by weeks in some cases). There have been several advances in scan-based diagnosis including chain diagnosis to significantly improve the resolution of diagnosis results especially for failing scan chains. Additionally, volume scan diagnosis with RCD can be used to build a root-cause pareto that can accurately estimate the root causes for FA on a large population. The best benefit to the user is when volume scan diagnosis and RCD results are generated for all failing parts over the entire lifecycle of the production. This gives the user the ability to trend the defect Pareto charts over time, enabling early course correction in case of wafer excursions.”
Failures will still happen, however, and in the middle of an ongoing failure analysis of a complex device, the root cause may swing the investigation in many directions.
“In that case, there are tools that can help, Aitken said. “You can effectively take silicon measurements on the die to get a handle on what the SPICE model for that individual die would be, feed that back into static timing, and figure out for this particular chip, ‘Here’s where the likely fail paths are,’ as opposed to, ‘Here’s your general sign-off set.’ Essentially, where in the process space is this chip and what implications does that have?”
The tools are getting better, too. DFT has been evolving for some time, and it is becoming more customizable and less static. There are papers going back at least a decade looking at the concept of design for diagnosis, and much of that now has been incorporated into the DFT process. What’s changed is there is far more data flowing from the fab into EDA to prevent these issues, allowing chipmakers to develop plans about where and when to test, and how often. In some cases, this now includes in-chip and in-package monitors, which feed data back to the chipmaker that can be used to prevent problems in the future.
“In the past, there wasn’t a lot of communication, especially in the fabless foundry model, on how to improve the design upstream so that you get better diagnostics downstream,” Knowles said. “If you get really poor diagnostics data that’s very noisy, why is that? Nobody knows, and there’s no action you can take. So, there’s a concept called ‘test points’ in design for test that makes it more testable. People are thinking about how to do things that are designed to make it more diagnosable.”
Types of failures
Pinpointing the source of a problem is as complicated as the design itself.
“Most of the time, unless the process is not mature, you start with the design and then try to figure out if the failure is process related,” said Mixel’s Takla. “Correlation between the simulation results and the measured results is highly dependent on the accuracy of the SPICE models. As an example, sometimes you run into inaccuracy in modeling of leakage current at high temperature in corner cases.”
The good news is there are a lot of processes that can simulate and catch problems before they make it out of design into production. The bad news is there will still be escapes. But not all failures are equal.
“There’s really two kinds of failures,” noted Marc Swinnen, director of product marketing for the semiconductor division at Ansys. “There’s a hard failure in which a chip just doesn’t work — wires short, it burns up, or otherwise. What is far more common are soft failures. A soft failure is more the designer’s responsibility. A soft failure means the chip works, but it doesn’t run at the speed you want, or the bandpass filter doesn’t recognize the frequencies you wanted, or the latency is too long. So it’s not that it doesn’t work. It just doesn’t work as well or within the parameters that you designed for.”
For example, chips may come back with smaller max frequencies than what’s called for in the spec due to unanticipated voltage drop. “There’s a certain combination of switching events, where if you have enough neighbors switching around a certain victim cell, that victim cell will see its local voltage take a significant drop and be slower,” Swinnen said. “You’re expecting that path to be performing at top speed, but in fact it can’t because every time it tries to, there are neighboring cells switching at the same time. That drags down its voltage and slows it down.”
A chip will still work with a soft failure. It just doesn’t work at the speed you want. With billions of possible interactions, designers are turning to multi-physics simulations to help anticipate the problems.
Those interactions can raise the hunt from an individual chip to the system level, particularly as leading-edge designs decompose functions in an SoC into chips or chiplets in an advanced package.
“One of the key platforms is system design and analysis, and simulation of systems,” said Anirudh Devgan, president and CEO of Cadence, during a recent presentation. “This whole analysis of power and thermal is going to become super-critical. Whether it’s 3D-ICs, data centers or electric cars, simulation coupled with the merger of semiconductors and systems requires the analysis of them together.”
This is why all of the major EDA players are investing heavily in reinforcement learning. One of the biggest challenges today is more functionality coupled with more customization, and all of that needs to be architected, floor-planned, routed, and verified with design teams that are not getting any larger. The only way to accomplish that without creating massive numbers of new failures is with machine learning.
“Reinforcement learning is the next model for re-use,” Devgan said in an interview. “All the data is there in a company. This allows you to mine your own data. It’s a different way of looking at the problem. Data science is all about looking outside-in. Physics is looking inside-out.”
The new approach is to leverage what is known to work, to ensure it works in the context of a system in which it is being used, and to provide sufficient coverage for whatever is new and different. That will help reduce failures, which can be costly — particularly in markets such as automotive.
Even failures caught in manufacturing come at a high price. Losing a spot in a foundry production queue can result in missed deadlines, which can damage customer relationships and result in supply chain glitches further down the line.
Foundries guard against hard failures with extensive design rules, which describe the parameters for a successful design. Those rules include details as basic as how much space there should be between wires. But the rule deck for foundries has been growing at each new process node, and with different advanced packaging options. There are more corner cases to worry about, and more options to create new ones. At the same time, there is less margin to work with than in the past, because that margin can reduce performance and increase power.
So while foundries still utilize design rule checks (DRC) and may insist on layout versus schematic (LVS) to ensure every netlist in the original schematic is represented in the final design, they also have pushed the reliability problem further left in the flow — along with much of the liability in case a chip or system doesn’t work.
In the past, Swinnen noted, “once the foundry was satisfied, they would accept the design, and if something went wrong it was their fault because they had agreed it was manufacturable.” That is no longer the case. “To a degree, the foundry’s attitude is, ‘That’s your problem.’ If you put in an extra wire, they’ll happily print it. It’s your responsibility not to have done that. But foundries have an interest in making sure their customers are successful, so they also include LVS in sign-off.”
While a failure is bad, trying too hard not to fail can be bad too. The military requires every chip to be tested, but many commercial production runs are satisfied with statistical averages. “If you want to be safe, you stay nicely within the statistical margin so that no matter the statistical fluctuation, you’re going to get a working part that is safe and secure. But it leaves a lot of performance on the table,” Swinnen explained. “You’re backing away from the edge so much that you actually couldn’t get higher speed, and your competitor might get that higher speed and higher performance you could have gotten if you just pushed a little bit.”
It’s a fine balance, which is often solved by binning, the industry practice of testing and separating chips according to their performance. But being too conservative means a risk of too many lower-end, underpriced chips in the mix.
A collaborative role
Simulation has long been a critical component in preventing, or at least analyzing failures. What’s changing is those simulations are becoming larger, more complex, and more versatile. The ability to look across a design, and then to drill down deep into it, is now a requirement.
“There are some very common process signatures, like something that’s under-etched or over-etched,” said Knowles. “You can see that signature in a 3D view of the device. So one way design is now participating all the way down into root cause analysis is through 3D CAD visualizations of that particular device in that particular location. They have a defect image versus the simulation, and they can say absolutely, ‘That’s an edge that didn’t go all the way through,’ and provide that feedback to the foundry.”
Ultimately, the hope with automating tools and approaches is to enable faster, smoother communications and save time.
“Having a piece of silicon going through FA and looking for something takes a long time. You want to avoid doing that as much as possible,” said Aitken. “For example, if you go through the whole flow into a scanning electron microscope and say, ‘I found a problem on this particular gate’ when you run standard ATPG-type fault diagnostics, and it comes up with 10 more chips that have a problem on that specific gate somewhere else in the design, you can see it’s probably the same thing, check the box, and move on to a different problem. That ability to take these complicated, expensive pieces of physical failure analysis, and use the results elsewhere in the process on things with a shorter turnaround cycle to save you time and effort is a key part of making the whole ramp cycle effective.”
— Ed Sperling contributed to this report.
Leave a Reply