Reliability Costs Becoming Harder To Track

Test, metrology and inspection costs blur as the number of options increase and chips are developed for specific applications.


Ensuring reliability in chips is becoming more complex and significantly more expensive, shifting left into the design cycle and right into the field. But those costs also are becoming more difficult to define and track, varying greatly from one design to the next based upon process node, package technology, market segment, and which fab or OSAT is used.

As the number of options increases for how a chip is designed and assembled, so do the ways in which reliability can be infused into the design-through-manufacturing flows and post-manufacturing monitoring. There are tradeoffs for all of these in terms of effectiveness and cost, and possibilities extend well beyond traditional reliability metrics into areas such as security.

In the past, many of these functions were segmented into well-defined silos, where costs were relatively simple to understand. For example, from the late 1990s to the introduction of finFETs, the cost of test was a flat 2% of the overall design and manufacturing budget. But with systems vendors such as Google, Facebook, and Apple now developing their own customized chips and systems, the economic formulas have changed. This is true for automakers, as well, which are pushing for zero defects over a decade or more in order to avoid costly recalls. For each of these companies, and many others, the pricetag for a new chip may be a much smaller percentage of a much larger system, and the cost of reliability will be analyzed in the context of liability if something fails.

Using data from inside a chip or package over its projected lifetime only adds to the pricing confusion. In-chip, in-package, and in-system sensors to monitor heat, electrical activity, and various types of noise rely on data comparisons, in some cases at a multi-device level. That also opens the door to additional kinds of data to identify and track aging of circuits, the impact of latent defects, and even suspicious activity that could indicate a cyberattack.

“This all varies by industry, and it varies by application,” said Keith Schaub, vice president of technology and strategy at Advantest America. “Some industries don’t want to spare the space, and they don’t want the incremental cost. But you have to weigh that against what is the cost to solve the problem, because device architectures are changing so fast in a lot of these markets.”

Smaller, slower, more expensive
At least part of this pricing uncertainty stems from the fact that it’s getting harder to test, inspect, and measure devices at each new process node. The smaller the features and the higher the density, the harder it is to get test probes to make consistent contact, for example. That, in turn, requires more expensive equipment to structure, process, and distill all of that data, and to make high-probability guesses where data doesn’t exist.

At the same time, that equipment needs to be amortized in the context of other savings. If older, fully depreciated equipment can do the job, test costs generally are assumed to be lower. But if new approaches can replace some of the steps done by older equipment, or move chips through the fab more quickly, that needs to be factored into the total cost. And if new equipment can be upgraded over time, the amortization formula needs to reflect that, as well.

“The cost is usually the (test) system itself, so we have to make sure that the system is future-proof,” said Jens Klattenhoff, vice president and general manager of FormFactor’s Systems Business Unit. “The average age of systems is 15 years, but within 15 years there is a lot of change in the requirements for measurements. You have to make sure that you can modify those systems to the requirements that are maybe coming in five years. That requires a solid platform, so then we can focus on the application layers. It also requires automation, and we’re putting a lot of effort into automation right now, starting with automatic temperature control and realignment.”

The same is true for inspection. The real workhorse in the fab today is optical inspection. But as features become smaller at each new process node and as density increases, it becomes harder to use existing inspection technology, both because of limited resolution and because of throughput. Technology exists for much finer inspection, but it takes more time. For example, e-beam inspection has better resolutions than optical inspection. It has sensitivities down to 1nm, but it is much slower than optical.

To complicate matters, fewer inspections might be required once sufficient volume has been established with an acceptable defectivity rate. And in some markets, only parts of a wafer may be inspected, while in other cases, entire wafers might be randomly sampled.

In other markets, such as automotive or medical, the emphasis on reliability might require much more intensive inspection with more sampling. For this, chipmakers could use either optical or e-beam inspection. It also may require might tighter specifications to reduce margin and thereby reduce costs in a different way, which is where metrology fits into this picture. The same trends that are prevalent in test and inspection are at work in metrology, with technologies such as atomic force microscopy (AFM) and future technologies such as CD-SAX, to measure pitch variations using an X-ray source.

Take AFM for example. AFM plays a critical role in packaging. “One of the critical measurements we’re doing is all the slopes around the pads, and the local topography, all of which are going to affect the bonding capability,” said Sean Hand, senior staff applications scientist at Bruker. “With AFM, we’re inspecting nominally about a 50 micron area over different chips and dies throughout the wafer. One of the key applications is looking at top-line roughness — being able to correlate line breaks and defectivity in those prints with subsequent defectivity.”

AFM also is used for precise patterning issues on the wafer in a fab. “With AFM, we’re inspecting nominally about a 50µm area over different chips and dies throughout the wafer. One of the key applications is looking at top-line roughness — being able to correlate line breaks and defectivity in those prints with subsequent defectivity.”

But as costs rise, equipment also is being extended into other areas to improve the return on investment.

“Traditionally, we’ve been developing AFM for depth, roughness, and we expanded to CD,” said Ingo Schmitz, technical marketer at Bruker. “We have been chasing smaller and smaller line trench geometries. Now we’re moving toward measuring things that are more specific to UV, like top-line roughness because of the off-axis illumination.”

All of this needs be viewed in a larger context, which is increasing granularity and the ability to pick and choose from a variety of different options, depending upon need. Unlike in the past, one size no longer fits all applications.

“The fragmentation of processes is playing a role here,” said Subodh Kulkarni, CEO of CyberOptics. “If there was some kind of a standardization, it would be a lot easier to design our sensors to accommodate for that. But because every fab and every OSAT seems to follow its own processes, that has led to a plethora of permutations and combinations, making it harder for us. Distances are getting smaller, the number of layers is increasing — all those things are at play. That’s a given. But it’s the fragmentation of how different technologies are playing with each other right now that is causing the confusion and chaos in our area.”

What comes next in many of these technologies isn’t completely clear. In metrology alone multiple possibilities are on the horizon.

“People are starting to look at X-ray technology, particularly for the hybrid bonding, because once that’s bonded — whether it’s chip-to-wafer or wafer-to-wafer — if there are any voids or organic residue inside the bump pads, that will cause a problem because the resistance will increase,” said Damon Tsai, director of inspection product management at Onto Innovation. “I’ve discussed this with a hybrid bonding customer and they think X-ray, IR, and ultrasonic technology will all be important, but they’re concerned that X-ray will damage the material and that compared to traditional inspection technology, it’s slower.”

The alternatives are infrared inspection, which cannot see through the metal from the top, and ultrasonic, which has experienced issues with reflection and scattering of waves.

“With IR, you have to flip the wafer and inspect it from the back side,” said Tsai. “But if you want to see through the back side, you have to see through the fin, which will decrease the sensitivity level. As a result, we have to work with customers to use a proper fin type. Throughput is still very slow. Typically we will only do the corner inspection inspection by IR. That will improve the throughput a lot. It’s still not the whole wafer. And as for ultrasonic, I still don’t see there’s any path to really improve this technology.”

Longer lifetimes
One of the thorniest problems chipmakers face is longer lifetimes for chips and packages, and in the case of safety-critical applications such as automotive or industrial, these chips also have to function in environments that are sometimes extreme.

“To be able to predict changes in reliability, and in particular changes in performance, you need a baseline understanding of electronics and semiconductors and the ability to extrapolate — and the ability to update those extrapolations with new data,” said Steve Pateras, senior director of marketing for test products at Synopsys. “You need to have robust solutions to do performance optimization, security optimization, and ensure reliability. But it’s not just a question of maintaining reliability. You want to be able to predict things. If a car is going to start failing, and that failure is imminent, you want to get off the road.”

Pateras noted that stretches from the initial architecture all the way into the field. “You need redundancy, you need a dual-lockstep processor, and you need to be monitoring all of this activity. This will be a requirement. Twenty years ago people were saying they didn’t have room for DFT or they’re running a business. Now it’s something nobody talks about and everybody puts it into a chip. In some cases, it can even take up 10% to 20% of the chip area.”

On-chip monitoring is an extension for test, inspection, and metrology. For those chips with long lifetimes, or for derivative chips, it also provides a potential loop back into the design and manufacturing to avoid future problems. This is particularly important in markets such as automotive, but it’s also made more complex by the fact that many of these devices are heterogeneous and involve analog as well as digital chips.

“Test itself is complex, and the fabrication process is so complex that it’s creating defects the test cannot even recognize,” said Uzi Baruch, chief strategy officer at proteanTecs, which uses ‘Agents’ inserted into different areas of a chip and telemetry to output data in real time. “Some of them are sensitive to the variation of the process itself, so they know how to measure the expected performance of that specific chip. Some of them are looking at overall performance and degradation of the chip over time. Some of them are looking at interconnects, specifically chip-to-chip in advanced packaging. And some of them are looking at what is causing a chip to degrade, or the voltage to drop, or things that are related to temperature.”

This becomes even more complex in packages, where the number of options skyrockets, along with the possibility for interactions between different chips. The challenge there is to ensure the chips in the package work as expected, both before and after the assembly process.

“The complexity of the process is being pushed downstream,” said Dave Huntley, business development manager at PDF Solutions and the Single Device Tracking task force leader. “The assembly part of putting all this together with interconnects is becoming a significant contributor to the overall performance. But there’s no point in talking about the performance or reliability of a device if you can’t guarantee the reliability of the things that went into it.”

This is where data analytics fit into the picture, and there are some new alliances forming in the industry around that kind of approach. Advantest’s recent deal with PDF Solutions is a case in point.

“PDF Solutions has a bunch of data exchange (DEX) networks, and all that data is already flowing through those those DEX networks,” said Advantest’s Schaub. “Now, we’re adding an edge compute capability throughout our product chain, and you can tie those two things together. They’re actually very complementary. We are really good at the edge, and PDF Solutions is really good in the cloud. You need both. You need to be able to run models at the edge and in the cloud, and if you can connect them, then you can have them feed each other information and improve the models.”

Other costs and options
How all the pieces go together for a particular chip design in a specific market may vary greatly, and the costs associated with test, metrology, and inspection can vary greatly. For example, OSAT costs are in the equipment, the personnel, lab operations (building, power), and some data collection. The costs to the semiconductor company is working with the OSAT to design tests, data analytics and personnel to analyze the data, costs for moving data and product around, the use of OSAT services, and the time it take to test. Costs go up if chips do not pass, and sometimes even more if they pass but fail in the field.

In the past, many of these steps were well defined. That is no longer the case. Beyond the variety of chips that have existed for many years, newer designs are adding test requirements. Among them:

  • 3D device testing, which includes four test process steps
  • Chiplets
  • SoCs
  • Mixed signal devices, including millimeter wave devices
  • Advanced packaging
  • Large AI chips, often at or beyond the limits of reticles.

AI chips in particular are yet another test challenge. “It’s very hierarchical,” said Randy Fish, director of marketing for Silicon Lifecycle Management at Synopsys. “That’s an interesting side that a lot of these chips are arrayed structures. There are ways that you can attack the testing problem that way, and so that is something you can leverage,”

In addition, problems with scaling continue to mount. Defects can increase with the advanced nodes, and not just one at a time. It is normal to have multiple defects happening at the same time. Sometimes what appears to be a single defect is masking multiple defects. That, in turn, can cause fault masking and reinforcing, said Sameer Chillarige, senior software engineering manager at Cadence, in an ETS2021 presentation. Using fault simulation on all possible fault combinations would take too much time.

There also are some new strategies to reduce costs. On the test side, parallel testing has been gaining ground as a way of improving throughput. But tests also can be more efficient, with faster and more accurate diagnoses from analytical data if a chip fails, and design for test techniques can be improved, such as compressing wirelengths and improving scan access.

Diagnostics throughput — the number of failing devices diagnosed within a given time on compute hardware — is particularly important because thousands of failing chips have to be diagnosed, and that data processed to figure out the systemic issues causing the fails.

“This methodology is called volume diagnosis methodology,” said Chillarige. “To achieve high throughputs in volume diagnosis methodology, our customers have been running diagnosis jobs on distributed farms for many, many years. However, with increasing design sizes, introduction of advanced fault models, increasing compression ratios, diagnosis run times and memory consumption have shot up significantly. To maintain the throughput, either diagnostics tools have to become smart or customers have to add significantly higher compute resources to meet the demands, which is not always practical.”

Reliability costs are no longer segmented as they were in the past. The effort to improve reliability starts all the way at the front end of the design flow in the architecture and design for test blueprint for that design, and it continues throughout the lifetime of a chip, package or system.

How the pieces go together will vary greatly by market, by design, and by the overall budget for that chip or package. As a percentage of total chip development cost, ensuring reliability always will be more costly in safety- and mission-critical applications. But as a percentage of overall system cost, those numbers can vary widely.

“The idea is to eliminate the fixed cost across devices and amortize the variable cost across all devices,” said Chillarige. But just how important that is to different companies and applications is relative to what they’re trying to achieve.

Related stories
Recalculating The Cost Of Test
Why it’s becoming so difficult to put a number on well-known process.

Test Costs Spiking
Use of more complex chips in safety- and mission-critical markets is changing pricing formulas for manufacturing.

Leave a Reply

(Note: This name will be displayed publicly)