Improving Yield, Reliability With Data

Outlier detection gaining attention as way of improving test and manufacturing methodologies.


Big data techniques for sorting through massive amounts of data to identify aberrations are beginning to find a home in semiconductor manufacturing, fueled by new requirements in safety-critical markets such as automotive as well as the rising price of packaged chips in smartphones.

Outlier detection—the process of finding data points outside the normal distribution—isn’t a new idea. It has been used in one form or another for years. But as the need for quality and reliability increase, more formalized approaches and tools are emerging and being deployed.

In the automotive industry, part of this being driven by the Automotive Electronic Counsel (AEC), which establishes recommended criteria for parts used in critical automotive systems such as airbag sensors and automatic braking system controls. Automotive OEMs have passed these requirements to their semiconductor suppliers.

“People building chips for automotive companies have had to do this additional step, which is to weed out the outliers based on a higher probability of failure down the road,” noted Wesley Smith, product director for the Quantix group at Mentor, a Siemens Business. “But it’s kind of a black art. You’re asking people to throw away parts that pass all of their specifications and seem like otherwise perfectly good parts, but which are just a little bit different from parts in the same wafer or lot. Some people do it even though they don’t believe in it because their customers demand it.”

This doesn’t always go smoothly, either. Customers often are not willing to pay for what is thrown away.

Automotive is just one place where this is happening. Top smartphone manufacturers have jumped in, as well, insisting their suppliers meet very stringent reliability guidelines.

“When phones come back to the shop they have to replace them, and it hurts their brand reputation,” Smith said. “The leading phone manufacturers are pushing their suppliers, bringing a resurgence of need for outlier removal driven by super-high volumes. In the context of 10 billion phones, if you consider that even if 1 in 1,000 came back, it’s still a huge number.”

Test vs. outlier detection

Some of these problems are caught during manufacturing or test. Outlier detection takes that a step further, digging into the root cause of the problems rather than just making sure a design works as intended.

“If there’s a fundamental problem with the chip being designed, that’s not what manufacturing is doing,” said David Park, vice president of worldwide marketing at Optimal+. “The concerns on the manufacturing side include making sure there is nothing being shipped that doesn’t meet the quality requirements of the semiconductor vendor. That’s very different than reliability. If someone builds a chip that is inherently unreliable, such as it’s supposed to last for 10 years and realistically only lasts for three years, the manufacturing test process will say, ‘It passed all the tests so it’s good, it functions.’ Reliability is an add-on to that, which is why burn-in test is done to make sure the chip lasts as long as the expected lifetime. If a chip doesn’t have fail-safe measures, you can do burn-in test. But without the proper design methodologies being applied, a la ISO 26262, the chip can fail abruptly. The manufacturing process isn’t going to be able to verify that.”

This is where outlier detection practices come into play.

“With the concept of parts per million (PPM), the question is whether your manufacturing process is good enough so that the vast majority of what you’re shipping is good,” said Park. “This is where big data analytics come in. You want to ensure those devices will function the way the designers and the company expects them to function. A ‘quality firewall’ is essential for semiconductor manufacturers to make sure they are shipping only the highest quality devices, which is different than the highest reliability devices. Manufacturing does not guarantee reliability. It guarantees quality. The quality is still important, because you don’t want devices that are going to fail early. And a big part of quality test is outlier detection, which is a way of looking at all of the good parts because everything has passed every test. But when they are all lined up, there are some that will not look like the others. It doesn’t mean they are bad. It just means they are different. In the concept of manufacturing where you are hoping to manufacture everything that is the same, the outlier detection is really key.”

Using data more effectively
The key to making all of this work is combing through data that already exists for discrepancies. That data comes from a variety of sources, including wafer sort, final test, system-level test and burn-in test, said Karthik Ranganathan, director of engineering at Astronics.

This is a common practice at the wafer sort stage, Ranganathan noted. “Let’s say a particular die is bad. The [test engineers] have modified the wafer map to have all the adjacent die around it also marked ‘bad.’ Or they do an Fmax (frequency maximum) distribution at the wafer level, and specify that anything off by three sigma in terms of Fmax from the mean will be re-marked on the wafer map so that all those die are deemed bad. It’s fairly easy to do this on a wafer-sort level before the die gets packaged.”

Once the die is packaged, a database must be used to deal with the volume of data. But there is still a gap that needs to be closed.

“There is a fundamental flaw in the architecture,” he said. “You can only use the parametric data collected in all of the preceding test steps. You can’t use the data collected in the current test step because you don’t have all the other pieces of data available. That has prevented a lot of people from going mass scale with this approach. I’ve seen this at the wafer level, I’ve seen it at final test, at system-level test and burn-in. People use the data from final test to drive a binning decision. However if they wanted to use the data from system-level test or from burn-in to drive a binning decision, highly quality conscious customers do that by adding an additional [outlier detection] test step or an additional handling step. If they post-process the data after system-level test is done, and after burn in is done, then you introduce another step to be able to do this kind of work.”

Fig. 1: Multiple device volume diagnostics. Source: Cadence

Setting expectations
The advantage is that it takes into account anything that’s been introduced in the packaging process, Mentor’s Smith pointed out. “If you introduced any false marginalities when you went through packaging and handling of some form, then you do this as a final test. The downside is that if you bin something out you are losing more than a die. You’re losing a packaged part that has accumulated a lot more value than just raw die because it has gone through that many more steps. It’s generally more cost-effective to catch them at the wafer level, but at the final test level some people don’t do wafer sort. They do blind build, where it just goes directly into packages and they only test it once. In that case you don’t have much choice.”

The key here is understanding the real value of outlier detection, and that may vary by customer or by market.

“You need to understand what is the main objective,” said Calvin Cheung, vice president of business development and engineering at ASE. “What you’re trying to do is make sure you have enough margin in your design even if you encounter the worst corners. This is essential because there is less time to do full characterization in some cases where companies are rushing products to market.”

But being able to fit this into the design through manufacturing flow requires some planning.

“The ideal solution would allow this kind of post processing on the fly without the need to add a test insert,” Ranganathan said. “That’s what the customer is looking for. Until then, this is a niche feature that some military customers, quality-conscious automotive customers or high-end server customers will use. But it’s not going to receive mass adoption at the package level.”

Specifically, what needs to be addressed is the cost of test associated with outlier screening, he said. “Most companies are happy to invest in a database. They want traceability for multiple reasons. For instance, if they have an RMA come back, they want to be able to track that die to everything. They want to use that data for yield improvement, among other things. What has been a fundamental issue is the addition of both the test step or a handling step to do this, and the time spent on that particular step. If you have to run a script while system-level test or burn-in is being done, and that adds even five seconds of time to parse through this data while the DUT is sitting in the socket, that’s completely unacceptable to the customer. That’s taking away time from the test itself.”

Root-cause analysis
Achieving reliability and quality requirements also depends on understanding why a device failed to begin with. This is the focus of root-cause analysis, which is another aspect of outlier detection.

“Outlier detection is like a detective game,” pointed out Rob Knoth, product manager for digital and signoff in test and automotive at Cadence. “There may or may not be a problem. Its severity may or may not take down a system. It can still be a usable system but just not a top bin system. This is a great area for EDA to help with because there are a lot of different ways that you can attack this in terms of what is the root cause of the failure or a performance degradation.”

This is more than just identifying a bad part.

“That’s interesting and it’s good to know so you don’t ship a bad part and a customer pays for it and they get mad, but what’s more interesting is why,” said Knoth. “Why is it a bad part? Preventing a bad part being shipped is important but it’s way more valuable to the customer and the foundry to know why. Are they having a problem with a via on a certain layer? Or is the routing of a certain block too congested, and maybe they should give it more room? Or did the foundry define too aggressive spacing rules?

There are two ways to look at root-cause analysis—single die/precision and volume, which involves looking at massive numbers of die.

From the ‘precision’ angle, the concepts of accuracy and resolution are critical. “Accuracy is about, ‘This is why I think there is a failure.’ How right are you? It is essentially a scoring. I can tell you what I think is the right answer, but if it is wrong I’m not helping you. Then, there’s resolution. Because we are gathering statistics I’m going to come up with maybe 5 or 6 or 10 or however many potential causes for failure. The lower number of those the better, because then there are fewer to investigate. We want to strive for high accuracy, but a very fine resolution so that it’s not like, ‘Here’s the 50 reasons you could be having a failure.’ That doesn’t help. We have found it’s very helpful not to look at the problem in an isolated context. Very tight integration between custom layout tools and digital analysis is very key.”

He noted that many customers assume the foundry will take care of problems with yield ramp. “The ones who do step up to the plate and really invest in their own understanding of diagnostics — this is fabless companies — they get a return on investment because they are informed participants in that discussion, whether it’s with a customer with a field return or packaging or foundry in terms of yield ramp and getting a product to volume. Very quickly they see that investment of a different piece of EDA infrastructure and a little bit of engineering time. That’s money in the bank as far as time to market, reliability and quality. It lets them be informed participants and not passengers.”

Ultimately what this all comes down to is the fact that outliers do count as yield loss. For multiple reasons, fewer is better.

“The more centered your process is, and the more in control it is, the less likely you’re going to have outliers,” Smith said. “The most enlightened customers take the outlier data and they use it not just for binning, but to feed back to the fab process to figure out how things could be improved and more controlled in the manufacturing process. If you use the data well, instead of seeing additional yield loss you can actually gain yield, because in the process of minimizing your outliers you are also minimizing the failed die. Failed die automatically gets thrown away but by working on the outlier problem you may end up seeing fewer failed die.”

He suggested a future direction in all of this may included automating the process by which RMA parts are analyzed to automatically figure out which kind of algorithms might have caught them rather than having to do that as a manual process. “If all the data is stored, you could basically reprocess that data multiple times and come up with an answer saying, ‘No, we couldn’t have detected it,’ or, ‘It looks like we could have.’ And if you use multi-variant PAT (part average test) instead of straight dynamic PAT, what would be the impact on all the other die? Let’s say you analyze those 50, but what about all the other 5,000 in the same lot? How much yield would you have given up if you ran that? So you run it against all the parts from the same wafer or lot and figure out if the yield loss would be acceptable and you can balance those two. There could also be some applications of machine learning and those kinds of things to help figure this stuff out.”

Much of this process is not automated today. But there are opportunities to share data through the test and manufacturing process, and possibly even back up to initial design so that problems can be avoided from the outset. And there are plenty of people looking at how all of these pieces can be brought together much better in the future.

Related Stories
Using Data To Improve Yield
Information technology adds big efficiency boost to industrial operations.
Toward System-Level Test
What’s working in test, what isn’t, and where the holes are.
Module Testing Adds New Challenges
Technology is shaping up as system-level functional test.
5G Test Equipment Race Begins
Next-gen wireless communications technology is still under development, but instrument suppliers are ready to test 5G in trial deployments.
Testing IoT DevicesMicrocontrollers and other chips are in the mix.