Making 3D Structures And Packages More Reliable

Challenges and solutions for finding defects in new structures and multi-chip integration.


The move to smaller vertical structures and complex packaging schemes is straining existing testing approaches, particularly in heterogeneous combinations on a single chip and in multi-die packages.

The complexity of these devices has exploded with the slowdown in scaling, as chipmakers turn to architectural solutions and new transistor structures rather than just relying on shrinking features to boost performance and reduce power. Increases in leakage current at each new node, the inability to scale wires, and an overall rise in the amount of data that needs to be processed have forced chipmakers to add different gate structures to their transistors and move certain components off chip, including analog blocks and some memories.

While all of this has been shown to work, several problems have cropped up. Among them:

  • Existing test and inspection methods are being strained, in part because there are simply more components and in part because existing tools can’t see everything in tightly packed 3D structures.
  • Vertical packaging options have made parts or even entire chips inaccessible to classic testing and inspection methods once they are bound together.
  • Some of the most advanced devices are expected to work for longer periods than in the past, with parts that are consistently in the fully “on” state expected to last decades rather than a couple years of intermittent service. This is particularly true in safety-critical applications, where failures are potentially dangerous.

Solutions do exist for some of these issues, but none is perfect. In many cases they add time to the manufacturing process and increase the cost, which is already high. And even then, there are plenty of things that can go wrong. For example, a latent defect that never raised concern in the past when advanced chips were used for less than several years may cause problems when usage is extended beyond that, particularly in automotive applications where environmental conditions can be extreme. Moreover, when dies or wafers are bonded together, they can be damaged or warp, putting extreme stress on solder balls and various sizes of bumps.

In the past, foundries and OSATs solved these issues and boosted yield by limiting the number of options. But with advanced packaging, many of the solutions implemented to date have been highly customized.

“Each manufacturer seems to be using slightly different processes for advanced packaging, creating challenges — as well as opportunities — for suppliers of inspection equipment, like us,” said Subodh Kulkarni, CEO of CyberOptics. “Non-contact and non-destructive optical imaging always has been very important for inspection of semiconductor processes. However, with many advanced packaging processes like stacking or embedding, direct line of sight of important areas is lost once the process step is completed. So, many manufacturers are keen to do non-contact optical inspection in between processes while the opportunity exists, and correlating that to functional testing, defects and final yields to improve overall yields and productivity. Simulation and destructive testing continue to play their roles in advanced packaging, but with the value of advanced packages going up, destructive testing isn’t desired.”

Destructive testing involves kitchen-like approaches — putting chips in ovens to accelerate aging — as well as insertion of multiple probes into a device, selectively etching parts of the package and examining them under powerful microscopes to search for defects. But as geometries continue to shrink, not all of those defects can be spotted during manufacturing. And while modeling and simulation are important, they don’t always capture everything.

Normally those kinds of problems are identified over time and fed back into the manufacturing process. But with widespread customization at advanced nodes, often accompanied by advanced packaging, that kind of feedback loop is limited.

New approaches for new problems
This has prompted a number of new way to tracking defects, both during manufacturing and after devices are in use.

For example, chip behavior data can be analyzed from a high level, and changes made based upon that behavior.

“One of the solutions for 2.5D packaging is to monitor the HBM interface,” said Evelyn Landman, CTO of proteanTecs. “With 2.5D, you may have chips made by different companies. “That in itself increases variability and the risk of problems. On top of that, there are limits to what you can directly test, and there is no possibility after assembly. What companies need is a way to gain visibility into the assembled device and system, in order to detect marginal parameters, and that way avoid walking wounded and system failures in the field.”

What’s important to note, though, is that nothing happens immediately. Chips fail or degrade over time. By closely monitoring in-circuit activity on a device, particularly at the interface level, subtle changes can be detected that are not visible through external testing.

“If you’re heading toward a physical failure, you will be able see signs of this if you check in periodically and monitor degradation over time,” Landman said. “With HBM, you may have degradation from the I/O, the interconnect, or the HBM DRAM itself. With on-chip deep-data monitoring, you can know in advance if there is weakness in one of the lanes and make a decision to shift the traffic to an available strong lane. That’s the true essence of in-field predictive maintenance.”

Raising the abstraction level of data analysis is especially important in multi-chip packages and systems. While the defects themselves may be highly localized, they may not be recognized until one part of a system interacts with another. In addition, some defects may be subtle, showing up over time, such as drift in sensors.

“There’s a lot of talk about adaptive test in manufacturing, where one key input keeps changing,” said Michael Schuldenfrei, corporate technology fellow at OptimalPlus. “But instead of just information from the device being tested and manufactured, you want to look at all data holistically. That allows you to detect things like drift.”

Drift is a problem with all devices, but the time element makes it hard to pinpoint the cause up front. While drift can be modeled and observed, because it is typically associated with aging and stress, the typical solution is recalibration rather than finding the problem in design or in manufacturing.

“The lack of drift detection explains a lot of RMAs,” said Schuldenfrei. “You need real-time data access to data from previous tests and current tests to determine what has changed.”

This also is where the idea of a digital twin fits in, providing a reference point against which to measure that drift. While the terminology is relatively new, the idea has been around for years. It is becoming more important, though, as the numbers of components increases. Understanding how things have changed requires an extra level of abstraction, as well as a highly detailed digital twin against which changes can be measured.

This path isn’t as straightforward as it might appear, however. In many cases domain knowledge needs to be included to determine what needs immediate attention and what can be ignored. And in multi-die packages, it becomes even more complex due to siloed expertise. An analog expert, for example, typically has very little insight into what’s going on in the digital circuitry or the software.

“Experts are experts in a specific domain, like packaging,” said Andy Heinig, group manager for system integration in Fraunhofer’s Engineering of Adaptive Systems Division. “So they have to discuss this with others. With an SoC, it’s possible for one system architect to make decisions. With packaging, you’ve also got software and hardware. So now you have three people involved and you need to test different iterations, and there may be 10 options. The package expert may not have knowledge of software support.”

It becomes even trickier with AI, which is being added into many devices to optimize performance. “You see higher-level simulation, but now you’ve also added intelligence into the package,” said Heinig. “What does that mean for all of those different options?”

Machine learning and testing in context
The impact on detecting failures can be significant. But it’s one thing to pinpoint the failure. It’s quite another to figure out what caused it.

Machine learning can help in this regard, because it can identify patterns in large volumes of production and operational data. “In what appears to be noise to humans, machine learning can find signals,” said Jeff David, vice president of AI solutions at PDF Solutions. “One of the problems is that data is constantly drifting and shifting, so customer failure modes change from month to month. That requires tool recalibration. But you also may get a shift in the way you sense data. And then, what do you do if the data is corrupt or you don’t make the correct prediction?”

This is a particular problem in chip manufacturing, where steps such as etch, chemical vapor deposition and back-end services are siloed.

“All of these silos of data are not necessarily working together for the highest yield within process flows,” David said. “What you need in this case is active learning, where machine learning is used to augment the processes you’re already doing. So if you use machine learning to predict yield on wafers, a human needs to review the results. Machine learning can take it 80% of the way, the human can take it the other 20%. So basically you’re using machine learning to make analytics better. Machine learning can get you so far and humans can get you so far. You need to marry the two. That also helps take some of the opaqueness out of machine learning and give control back to customers.”

Testing in context is another important part of the solution. In a planar world, that was largely confined to real-world use cases. As more chips are added into packages, though, and as the transistors and memories move into the third dimension, testing in context becomes more convoluted. With densely packed 3D transistors, different use cases affect dynamic power density, current leakage, heat, stress and aging. This is one of the reasons that so much attention is being paid to processing closer to the end point. If data can be reduced at the source then processing can be spread out across multiple devices rather than concentrated in a single chip or package.

It still is unknown how 7nm logic using finFETs will perform over time in automotive applications, particularly in vehicles used more intensively for taxi services than ones used for commuting and occasional outings. Down time is not an option in those applications, which means there is no time for cooling down circuits.

“We may to add tests we didn’t think of in the past,” said Lee Harrison, automotive IC marketing manager at Mentor, a Siemens Business. “What that means is test has to be modifiable. This is a lot of work because you are not putting this into final test. You are putting it into the infrastructure. What’s tricky is you have to understand the full function of all the IP to know where to put the safety mechanisms. Maybe the IP provider is not putting it into the final solution, so the infrastructure has to do the final testing. So you’re shipping IP with hooks, but the customer needs to define the final configuration to get the actual numbers.”

TSMC noted a similar trend at its Open Innovation Platform last fall. Suk Lee, senior director of TSMC’s Design Infrastructure Management Division, said customers want test capabilities built into third-party IP.

There are other problems, as well. At each successive process node, variation between what is in the original design and what gets printed on a chip has become more challenging. Much of this can be calculated up front, and the rule deck for designs has been getting larger at each new node. But the impact of variation also can be compounded in advanced packaging where multiple die are involved because those effects are additive. On top of that, dies and various interconnects physically can shift during the packaging phase.

Variation also plays an increasing role in new processes, particularly with the introduction of nanosheet FETs, where a gate-all-around approach is used to control leakage and turn devices completely off.

“We’re seeing some huge challenges in the front-end-of-line for nanosheets in particular, where we are rapidly developing new features and capabilities in our tools to meet those total measurement uncertainty (TMU) requirements,” said Chet Lenox, Director of Process Control Solutions for New Technology and R&D at KLA. “There are two big challenges the nanosheet is highlighting. They’re challenges we already have, but nanosheets are making them tougher. The first is the z-direction challenge. For years we have cared about the z-axis for things like fin and gate profiles. Those always have been super-important parameters, and we’ve measured them at the top, middle and bottom using optical inspection tools and CD-SEM tools. So that’s not new, but it’s becoming much more critical in terms of making sure those measurements are accurate. In addition to fins and gates, the nanosheet itself adds a more complicated structure in z to those profiles. Those measurements are harder to make. You have a much more complicated 3D structure that has to be modeled, or it has to be fed into a machine learning algorithm to be able to interpret the optical data that’s coming off of the wafer. That’s a much bigger challenge than the old z-height measurements on gates and fins on finFETs.”

Fig. 1: Latent defects, which were less of a problem in controlled use conditions, can turn into killer defects over time and under stress. Source: KLA

The future
None of this even begins to address such new technologies as silicon photonics, which is being looked at for die-to-die communication within a package. While the rest of the chip industry wrestles with standard ways to test packages, photonics is barely on the radar for most test companies.

“A big issue in photonics is what is the equivalent in the digital world of sidewall roughness,” said Lumerical CTO James Pond. “There has been test and packaging and standardization around that, but in photonics this still needs to be more standardized. There are mechanisms to automate some of that testing, but it has to become more widely adopted. When you test systems, you make optical measurements. That still has to get standardized, and it will take time. But testing and packaging is still a huge fraction of the cost.”

Cryogenic technology, including quantum computing, as well as some other novel techniques, such as DNA storage, open up vast new challenges for reliability, as well. So far, these technologies are still in the lab phase. As they begin rolling out, an entire infrastructure will need to be established to ensure everything works as planned. Viewed from a high level, widespread customization, new structures and a long list of packaging options and approaches has made assessing and improving reliability in ICs much more complicated than in the past, and it is happening at a time when devices are supposed to last longer and behave reliably throughout their projected lifecycle.

Creativity in design is exploding, but there are potential pitfalls to the kind of creativity that had been systematically isolated and minimized while a single roadmap for chip design existed. In its place, external and in-circuit data analytics will be required, coupled with domain expertise in multiple areas and potentially some new tools and methodologies. What those changes ultimately will cost, and how that cost will be amortized across designs, remains to be seen.

Leave a Reply

(Note: This name will be displayed publicly)